Pandas, tips to deal with huge datasets!
I have been using pandas for more than 4 years now and have always done all my ETL using pandas before training a machine learning model. Now during the second semester of my Graduate School, I was handed a huge dataset. A CSV file that has more than 1 million rows and 22k columns. I was given a computing device with 32GB RAM and 8 core processor. Equipped with these I wanted to dive in and do EDA. As with any other projects involving CSV I booted up a python notebook, imported pandas, and ran the pd.read_csv() to load my data. Alas after a few minutes the kernel crashed and the OS appeared dead for a min or two. That is when I realized that my usual way of dealing with CSV files won’t work here. In this article, I want to share my experience and a few tips on how I was able to work with this huge dataset using pandas.
Tip 1: Just read the first couple of rows
The first best thing to do when you have a huge dataset is to just read the first 100–200 rows and get a sense of what the data mostly looks like. With this, you can infer the possible data types of the rows, the column names. For this, you can pass n_rows (with the number of rows you want to read)
import pandas as pddata_file = '/path/to/my/datafile'df = pd.read_csv(data_file, n_rows = 100)
Tip 2: Explicitly pass the data-types
Once you look at the first few rows of your data, you get a sense of the data types of your columns. Now the next time you read a larger chunk or as you will see the chunk parameter, it is a good option to pass the data types explicitly when reading the csv. This can be done by using the dtype parameter in read_csv and passing to it a dictionary that maps the column name to a data type. Ex:
import pandas as pd
data_file = '/path/to/my/datafile'
df = pd.read_csv(data_file, n_rows = 100, dtype={'col1': 'object', 'col2': 'float32', ........})
If you have a huge number of columns the above method will take you a lot of manual effort. The following code shows a simple way to do it whereby reading the top 100 rows we try to infer the data types from the top 100 rows and create a dictionary with the mapping of the column name to a data type that we will pass to the dtype parameter. But I will caution you that since the dataset is huge this may sometimes not be efficient or the correct inference for the dtypes needed. Ex: The top 100 rows may contain only values that fit in memory using only float16 but later rows have data that require float32. This you can fix with a little trial and error and nonetheless, the following snippet should come in handy most of the time.
import pandas as pd
data_file = '/path/to/my/datafile'
df = pd.read_csv(data_file, n_rows = 100)dtype_dic = {}
for key, value in df.dtypes.items():
dtype_dic[key] = valuebigger_df = pd.read_csv(data_file, n_rows = 10000, dtype=dtype_dic)
Tip 3: Read data in chunks
This mostly is the popular way of dealing with large CSV files in pandas. Pandas read_csv has a parameter called chunk to which you can pass a chunk size (number of rows you want to read). When you are doing this make sure you have a few functions ready which will work on the chunks and infer and do work for you.
import pandas as pd
data_file = '/path/to/my/datafile'
CHUNK = 10000def calculate_some_stats_on_chunk():
passdef plot_grapgs_on_chunk():
passdf = pd.read_csv(data_file, n_rows = 100)
dtype_dic = {}
for key, value in df.dtypes.items():
dtype_dic[key] = valuechunk_stats_list = []
for chunk in pd.read_csv(data_file, chunksize = CHUNK, dtype=dtype_dic):
plot_grapgs_on_chunk(chunk)
stats = calculate_some_stats_on_chunk(chunk)
chunk_stats_list.append(stats)
Tip 4: Maybe select a subset of columns and conduct EDA:
The final tip in this article is what I used first to start and explore my data, I selected 2000 odd columns and found that 1 million rows and the 2000 columns fit in my ram and I did my EDA on this chunk. (Yeah !!! it was quite the work but I was able to automate it through some functions).
import pandas as pd
data_file = '/path/to/my/datafile'cols_i_wanna_use = ['col1', 'col2',]df = pd.read_csv(data_file, usecols=cols_i_wanna_use)
Thank you everyone for reading, I will be updating these tips or writing a new article as I face new challenges in my work. Please upvote if the content was useful and follow me for more such tidbits on pandas, Machine Learning and Deep Learning.
The gitrepo for the snippets can be found at github/Virajdatt
You can get in contact with me on the following platforms:
Linkedin:- https://www.linkedin.com/in/virajdatt-kohir/
Twitter:- https://twitter.com/kvirajdatt
GitHub:- https://github.com/Virajdatt
GoodReads:- https://www.goodreads.com/user/show/114768501-virajdatt-kohir