Pandas was originally developed by Wes McKinney from AQR Capital Management,
primarily to address issues in business analytics and quantitative trading.
The name pandas comes from panel data.
Unlike NumPy, Pandas is designed for tabular and heterogeneous data.
Pandas data structures have index, a crucial component. Data is aligned with
respect to the index, for example, when adding two series.
Series can be considered as ordered dictionary.
Both index and series/DataFrame can have name, very handy in analysis.
Index is designed to be immutable, take advantage of it.
When performing arithmetics, if dimension doesn’t match the result would be a
union of the indices, NaN will be produced.
The builtin extension data types in Pandas works better since they handle
missing elements better.
df.explode('col_name) turns a list element into multiple rows while
preserving other column values. This duplicates indexes. So we may need to
perform df.reset_index(drop=True)
series.array converts it to a NumPy like array.
df.reindex(columns=columns) can be used to reindex, or to effectively drop
columns!
frame.sub(series, axis='index') — match on rows, broadcast over columns!
df.apply()
Can pass axis='columns'
Can return a Series
applymap should be used for element-wise functions
Use dim (dimensional table, or categories) and values (category codes)
series, with dim.take(values) to reconstruct the categorical series.
df.reset_index() moves index value back to column!
df.plot() can be used to plot directly, there are sub-methods such as
plt.plot.bar(). They also take ax object.
The patsy library
can be used to to construct matrices from dataframe for modeling.
DesignMatrix is basically NumPy arrays with additional metadata.
y, X = patsy.dmatrices('y ~ standardize(x0) + center(x1)), data
Regular df[index] index is ambiguous, it treats integers as labels if index
contains integers.. Prefer df.loc[] when indexing with labels and
df.iloc[] with integers to avoid any ambiguity.
Single index selects columns, the exception is df[:n] selects the first n
rows.
With df.loc[], first index selects rows, second selects columns.
Chained indexing cannot be used for assignment.
argsort returns indexes as a result of sorting (indexer), which can then be
used for df.take(indexer).
GroupBy
df.groupby() is a “split-apply-combine” process.
The GroupBy object returned by df.groupby() has some optimized aggregation
methods builtin.
You can also call non-optimized versions on it via df.groupby().agg(func).
Passing a list of aggregation methods we obtain them all. Passing two tuples
we give names to the agg functions.
Passing dictionary to df.agg() can run different aggregations on different
columns.
The returned object is iterable. We index it to get only the split we need.
Similarly, we can use transform method on the groups, which either returns
an object of the same size, or a scalar value to be broadcasted.
For builtin aggregations, we can pass a string instead of a function/lambda.
pd.crosstab(df['col1'], df['col2']) is a simple way to compute a frequency
table on two columns, this is easier than groupby.