Pandas

Pandas was originally developed by Wes McKinney from AQR Capital Management, primarily to address issues in business analytics and quantitative trading.
The name pandas comes from panel data.
Unlike NumPy, Pandas is designed for tabular and heterogeneous data.

Topics

Serialize pandas
- Parquet - multiple files - this has best support for Pandas metadata, multi indexes, extension types, etc.
- HDF5
- Pickle - preserves absolutely everything, but has potential security and backward compatibility issues.
- Feather - not in a single file

Usage

Pandas data structures have index, a crucial component. Data is aligned with respect to the index, for example, when adding two series.
Series can be considered as ordered dictionary.
Both index and series/DataFrame can have name, very handy in analysis.
Index is designed to be immutable, take advantage of it.
When performing arithmetics, if dimension doesn’t match the result would be a union of the indices, NaN will be produced.
The builtin extension data types in Pandas works better since they handle missing elements better.
df.explode('col_name) turns a list element into multiple rows while preserving other column values. This duplicates indexes. So we may need to perform df.reset_index(drop=True)
series.array converts it to a NumPy like array.
df.reindex(columns=columns) can be used to reindex, or to effectively drop columns!
frame.sub(series, axis='index') — match on rows, broadcast over columns!
df.apply()
- Can pass axis='columns'
- Can return a Series
- applymap should be used for element-wise functions
Use dim (dimensional table, or categories) and values (category codes) series, with dim.take(values) to reconstruct the categorical series.
df.reset_index() moves index value back to column!
df.plot() can be used to plot directly, there are sub-methods such as plt.plot.bar(). They also take ax object.
The patsy library
- can be used to to construct matrices from dataframe for modeling. DesignMatrix is basically NumPy arrays with additional metadata.
- y, X = patsy.dmatrices('y ~ standardize(x0) + center(x1)), data
- new_X = patsy.build_design_matrices([X.design_info], new_data)

Indexing

Regular df[index] index is ambiguous, it treats integers as labels if index contains integers.. Prefer df.loc[] when indexing with labels and df.iloc[] with integers to avoid any ambiguity.
Single index selects columns, the exception is df[:n] selects the first n rows.
With df.loc[], first index selects rows, second selects columns.
Chained indexing cannot be used for assignment.
argsort returns indexes as a result of sorting (indexer), which can then be used for df.take(indexer).

GroupBy

df.groupby() is a “split-apply-combine” process.
The GroupBy object returned by df.groupby() has some optimized aggregation methods builtin.
You can also call non-optimized versions on it via df.groupby().agg(func). Passing a list of aggregation methods we obtain them all. Passing two tuples we give names to the agg functions.
Passing dictionary to df.agg() can run different aggregations on different columns.
The returned object is iterable. We index it to get only the split we need.
Similarly, we can use transform method on the groups, which either returns an object of the same size, or a scalar value to be broadcasted.
For builtin aggregations, we can pass a string instead of a function/lambda.
pd.crosstab(df['col1'], df['col2']) is a simple way to compute a frequency table on two columns, this is easier than groupby.

Time Series

There are many DateOffset objects to be used! They also have rollforward and rollbackward methods.
resample is better than groupby for aggregation.
Period object is also very handy!
pd.Grouper() object can be created to facilitate groupby operations, but time must be the index.
rolling, and ewm, etc, for rolling window operations.

My Vault

Explorer

pandas

Pandas

Topics

Usage

Indexing

GroupBy

Time Series

Graph View

Table of Contents

Backlinks