Pandas

  • Pandas was originally developed by Wes McKinney from AQR Capital Management, primarily to address issues in business analytics and quantitative trading.
  • The name pandas comes from panel data.
  • Unlike NumPy, Pandas is designed for tabular and heterogeneous data.

Topics

  • Serialize pandas
    • Parquet - multiple files - this has best support for Pandas metadata, multi indexes, extension types, etc.
    • HDF5
    • Pickle - preserves absolutely everything, but has potential security and backward compatibility issues.
    • Feather - not in a single file

Usage

  • Pandas data structures have index, a crucial component. Data is aligned with respect to the index, for example, when adding two series.
  • Series can be considered as ordered dictionary.
  • Both index and series/DataFrame can have name, very handy in analysis.
  • Index is designed to be immutable, take advantage of it.
  • When performing arithmetics, if dimension doesn’t match the result would be a union of the indices, NaN will be produced.
  • The builtin extension data types in Pandas works better since they handle missing elements better.
  • df.explode('col_name) turns a list element into multiple rows while preserving other column values. This duplicates indexes. So we may need to perform df.reset_index(drop=True)
  • series.array converts it to a NumPy like array.
  • df.reindex(columns=columns) can be used to reindex, or to effectively drop columns!
  • frame.sub(series, axis='index') — match on rows, broadcast over columns!
  • df.apply()
    • Can pass axis='columns'
    • Can return a Series
    • applymap should be used for element-wise functions
  • Use dim (dimensional table, or categories) and values (category codes) series, with dim.take(values) to reconstruct the categorical series.
  • df.reset_index() moves index value back to column!
  • df.plot() can be used to plot directly, there are sub-methods such as plt.plot.bar(). They also take ax object.
  • The patsy library
    • can be used to to construct matrices from dataframe for modeling. DesignMatrix is basically NumPy arrays with additional metadata.
    • y, X = patsy.dmatrices('y ~ standardize(x0) + center(x1)), data
    • new_X = patsy.build_design_matrices([X.design_info], new_data)

Indexing

  • Regular df[index] index is ambiguous, it treats integers as labels if index contains integers.. Prefer df.loc[] when indexing with labels and df.iloc[] with integers to avoid any ambiguity.
  • Single index selects columns, the exception is df[:n] selects the first n rows.
  • With df.loc[], first index selects rows, second selects columns.
  • Chained indexing cannot be used for assignment.
  • argsort returns indexes as a result of sorting (indexer), which can then be used for df.take(indexer).

GroupBy

  • df.groupby() is a “split-apply-combine” process.
  • The GroupBy object returned by df.groupby() has some optimized aggregation methods builtin.
  • You can also call non-optimized versions on it via df.groupby().agg(func). Passing a list of aggregation methods we obtain them all. Passing two tuples we give names to the agg functions.
  • Passing dictionary to df.agg() can run different aggregations on different columns.
  • The returned object is iterable. We index it to get only the split we need.
  • Similarly, we can use transform method on the groups, which either returns an object of the same size, or a scalar value to be broadcasted.
  • For builtin aggregations, we can pass a string instead of a function/lambda.
  • pd.crosstab(df['col1'], df['col2']) is a simple way to compute a frequency table on two columns, this is easier than groupby.

Time Series

  • There are many DateOffset objects to be used! They also have rollforward and rollbackward methods.
  • resample is better than groupby for aggregation.
  • Period object is also very handy!
  • pd.Grouper() object can be created to facilitate groupby operations, but time must be the index.
  • rolling, and ewm, etc, for rolling window operations.