Fast loading and saving of pandas data frames using numpickle (numpy and pickle)

Speedful data load (Autonomous subway in Suwon South Korea; Photo by Jeremy Bishop)

Loading huge data frames (Giga bytes size) from text files (csv or tsv) can be pretty time consuming.

Here, I show you how you can save 3x — if not several magnitudes of order — time by using the numpy and pickle packages for saving and loadingpandas data frames.

If the data frame consists of numeric values-only without any non-numeric values (e.g. strings), the loading is in average ~7x faster.

In the following, you see the code:

import pandas as pd
import numpy as np
import pickle
import os
def save_numpickle(df, outfpath, all_numeric=False):
arr, colnames, rownames = df.to_numpy(), df.columns, df.index
np.save(arr=arr, file=outfpath)
pickle.dump({‘colnames’: colnames,
‘rownames’: rownames,
‘dtypes’: None if all_numeric else df.dtypes},
open(outfpath + “.pckl”, “wb”))
def load_numpickle(fpath):
df = pd.DataFrame(np.load(fpath, allow_pickle=True))
with open(fpath + “.pckl”, “rb”) as fin:
meta = pickle.load(fin)
# if no ‘types’ present, assuming all_numeric
df.index, df.columns, dtypes = \
meta[‘rownames’], meta[‘colnames’], meta.get(‘dtypes’, None)
if dtypes is not None:
df = df.astype(dtypes)
return df

The code is made accessible for you from everywhere:

pip install numpickle

Load and use it by:

import pandas as pd
import numpickle as npl
# create example data frame with non-numeric and numeric columns
df = pd.DataFrame([[1, 2,'a'], [3, 4, 'b']])
df.columns = ["A", "B", "C"]
df.index = ["row1", "row2"]
df
# A B C
# row1 1 2 a
# row2 3 4 b
df.dtypes
# A int64
# B int64
# C object
# dtype: object
# save data frame as numpy array and pickle row and column names
# into a helper pickle file "/home/user/test.npy.pckl"
npl.save_numpickle(df, "/home/user/test.npy")
# load the saved data as a control:
df_ = npl.load_numpickle("/home/user/test.npy")
df_
# A B C
# row1 1 2 a
# row2 3 4 b
df_.dtypes
# A int64
# B int64
# C object
# dtype: object
all(df == df_)
# True

In case you have numeric-only values, set all_numeric=True (This accelerates the loading ~7x):

df = pd.DataFrame([[1, 2], [3, 4]])
df.columns = ["A", "B"]
df.index = ["row1", "row2"]
# save the data frame
npl.save_numpickle(df, "/home/user/test_num.npy", all_numeric=True)
# load the saved data
df_ = npl.load_numpickle("/home/user/test_num.npy")

Preprocess your data file for fast loading by numpickle

How to convert any tabular text file to a fast-loadablenumpickle file:

def save_file_as_numpickle(fpath, 
sep=”\t”,
ending=".tsv",
all_numeric=False,
deletep=False,
*args, **kwargs):
df = pd.read_csv(fpath, sep=sep, *args, **kwargs)
save_numpickle(df, fpath.replace(ending, ".npy"),
all_numeric=all_numeric)
if deletep:
os.remove(fpath)
# usage:
fpath = /path/to/input.csv
import numpickle as npl
npl.save_file_as_numpickle(fpath,
sep=",", # comma separated file!
ending=".csv", # ending to be replaced by ".npy"
all_numeric=True,
deletep=True)
# now, you can do:
df = npl.load_numpickle(fpath.replace(".csv", ".npy"))
# you should be able to feel the difference in loading speed!

--

--

--

Code tweaker + Human Geneticist (Ph.D.) | Join medium by: https://gwang-jin-kim.medium.com/membership

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

List Comprehension In Python

Frictionless Validity Checks

Statistical / Statistic Process Control (SPC) for Laymen

HyperDao - a DAO based IDO launchpad Coming Soon❗️🔥 Offering Unparalleled: - Transparency ⭐️ …

Setting Up Dynamodb For Local Development

Deploying and monitoring a Redis cluster to Oracle Container Engine (OKE)

Introducing Conjure, Palantir’s toolchain for HTTP/JSON APIs

Pulse Sensor for Arduino

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Gwang Jin Kim

Gwang Jin Kim

Code tweaker + Human Geneticist (Ph.D.) | Join medium by: https://gwang-jin-kim.medium.com/membership

More from Medium

Taming larger-than-RAM-data workflows: a Dask example

How to get around of the “UnicodeEncodeError: ‘utf-8’ codec can’t encode character …: surrogates…

Installing Anaconda for Python Development

Namespaces in Python