Pandas for time series data — tricks and tips

There are some Pandas DataFrame manipulations that I keep looking up how to do. I am recording these here to save myself time. These may help you too.

Time series data

Convert column to datetime with given format

Re-index a dataframe to interpolate missing values (eg every 30 mins below). You need to have a datetime index on the df before running this.

Find missing dates in a DataFrame

Split a dataframe based on a date in a datetime column

Find nearest date in dataframe (here we assume index is a datetime field)

Calculate a delta between datetimes in rows (assuming index is datetime)

Calculate a running delta between date column and a given date (eg here we use first date in the date column as the date we want to difference to).

Housekeeping

Reset index

Set index

Reset index, don’t keep the original index

Drop column(s)

Drop rows that contain a duplicate value in a specific column(s)

Rename column(s)

Sort dataframe by column_1 then column_2, in ascending order

Sort dataframe using a list (sorts by column ‘id’ using given list order of id’s)

Selection

Select rows from a DataFrame based on values in a column in pandas

Super useful snippets after https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas

Select columns from dataframe

Get unique values in a column

Get row where value in column is a minimum

Select by row number

Select by column number

Get column names for maximum value in each row

Select 70% of Dataframe rows

Randomly select n rows from a Dataframe

Select rows where a column doesn’t (remove tilda for does) contain a substring

Select rows containing a substring from a list of substrings

Select duplicated rows based on selected columns

Select duplicated rows based on all columns (returns all except first occurrence)

Select using query then set value for specific column. In the example below we search the dataframe on the ‘island’ column and ‘vegetation’ column, and for the matches we set the ‘biodiversity’ column to ‘low’

Create 10 fold Train/Test Splits

Similar to selecting to a % of dataframe rows, we can repeat randomly to create 10 fold train/test set splits using a 90/10 train test split ratio.

Group by

Group by columns, get most common occurrence of string in other column (eg class predictions on different runs of a model).

Group by column, apply operation then convert result to dataframe

Replacement

Replace rows in dataframe with rows from another dataframe with same index.

Replace value in column(s) by row index

Replace substring in column. See pandas docs for regex use

Iterate over rows

Using iterrows

Using itertuples (faster, see https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas)

If you need to modify the rows you are iterating, use apply:

Or alternatively, see good explanation here and example below:

NaN’s

Replace NaN in df or column with zeros (or value)

Count NaN’s in column

Find which columns have Nans, list of those columns, and select columns with one or more NaNs. After https://stackoverflow.com/questions/36226083/how-to-find-which-columns-contain-any-nan-value-in-pandas-dataframe-python

Get rows where column is NaN

Data Analysis

Show last n rows of dataframe

Show transpose of dataframe head. we pass in len(list(df)) as number to head to show all the columns

Calculate the mean for each cell across multiple dataframes

Calculate sum across all columns for each row

String operations

Replace a specific character in column

Concatenate two columns

Merge

Merge DataFrame on multiple columns

Concat / Append vertically

(see https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)

Split

Split dataframe into N roughly equal sized dataframes

Split a column string by the last occurrence of a substring, create new columns

Type conversion

Change column type in dataframe

Add data

Add an empty column

Data types

Convert columns ‘a’ and ‘b’ to numeric, coerce non numeric to ‘NaN’

Creating DataFrames

From list of dicts

From list

From multiple lists

Numpy

As an alternative method to concatenating dataframes, you can use numpy (less memory intensive than pandas-useful for large merges)

Import/Export

Group by a column, then export each group into a separate dataframe

Import / Export in Feather format

Here we save a DataFrame in feather format (really fast to read back in). Note I have an issue saving feather files >~2GB using pandas==0.23.4

Import / Export in Parquet format

Save without index

Read in, specifying new column names

datetime64 Date and Time Codes

from here: https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html

time units:

Geophysicist and Deep Learning Practitioner