Pandas Equivalents in Pyspark
This year I have been using pyspark extensively at my job. It’s been nearly 6 months and I have grown to love pyspark for it’s efficiency for handling millions of rows of data. When I started to learn pyspark I was an extensive pandas user and had done a lot of data wrangling with pandas. One thing I really missed in my early days of learning pyspark and getting some quick work done was having a list/cheatsheet of pandas equivalents in pyspark. In this article I complied a few snippets that can help you with the transition and get some quick work done as you also start your pyspark journey.
0. Preliminary steps:
Quick imports and setup for our examples
from pyspark.sql import SparkSession
import pandas as pd
import pyspark.sql.functions as sf
from pyspark.sql.types import (
StringType,
)
data_fr_df = [[1, 'python'], [2, 'C'], [3, 'C++'], [4, 'scala'], [5, 'python']
]
# Create pandas Dataframe
simple_df = pd.DataFrame(data_fr_df, columns=['user_id', 'programming_language'])
#Create spark Dataframe with same data
simple_sdf = spark.createDataFrame(simple_df)
spark = SparkSession.builder.getOrCreate()
1. Number of rows/records:
use count from pyspark dataframe.
# Number of rows
## pandas
len(simple_df)
## pyspark
simple_sdf.count()
2. List datatypes of columns
use printSchema from pyspark dataframe.
# Listing datatypes of columns
## pandas
simple_df.dtypes
## pyspark
simple_sdf.printSchema()
3. Adding a new column
Use withColumn from pyspark dataframe
# Adding a new column
## pandas
simple_df['new_col'] = 'pandas_new_col'
simple_df.head()
## pyspark
simple_sdf = simple_sdf.withColumn('newColumnPyspark',
sf.lit('pyspark_new_col'))
simple_sdf.show()
4. value_counts
Use grouby, followed by count from pyspark dataframe.
# value_counts
## pandas
simple_df['programming_language'].value_counts(dropna=False)
## pyspark
simple_sdf.groupBy('programming_language').count().orderBy('count').show()
5. Sorting
Use orderby from pyspark dataframe.
# sorting
## pandas
simple_df.sort_values('user_id', ascending=False)
## pyspark
simple_sdf = simple_sdf.orderBy('user_id', ascending=False)
simple_sdf.show()
6. Min and Max Values
Use sf.max and sf.min which are part of the pyspark.sql.functions module.
## pandas
print(f'pandas max value {simple_df["user_id"].max()}')
print(f'pandas min value {simple_df["user_id"].min()}')
## pyspark
print(f'pyspark max value {simple_sdf.agg(sf.max("user_id")).show()}')
print(f'pyspark min value {simple_sdf.agg(sf.min("user_id")).show()}')
7. Data Filtering
Use filter method from pyspark dataframe.
## pandas
simple_df[simple_df['user_id']==3]
## pyspark
simple_sdf.filter(sf.col('user_id')==3).show()
8. Calculating Nulls/Nans values
Pyspark unlike pandas isnull() function treats Nulls and Nans differently so they need to be counted seperately. We will use isnull and isnan from pyspark.sql.functions.
## pandas
simple_df.isnull().sum()
## pyspark
# nans
simple_sdf.select([sf.count(sf.when(sf.isnan(c), c))\
.alias(c) for c in simple_sdf.columns]).show()
# nulls
simple_sdf.select([sf.count(sf.when(sf.isnull(c), c))\
.alias(c) for c in simple_sdf.columns]).show()
9. Type Casting
## pandas
simple_df['user_id'] = simple_df['user_id'].astype('str')
## pyspark
cast_column = 'user_id'
simple_sdf = simple_sdf.withColumn(cast_column, simple_sdf[cast_column]\
.cast(StringType()))
Thank you for reading, that’s all for this article. More content to follow. Please clap if the article was helpful to you and comment if you have any questions. If you want to connect with me, learn and grow with me or collaborate you can reach me at any of the following:
Linkedin:- https://www.linkedin.com/in/virajdatt-kohir/
Twitter:- https://twitter.com/kvirajdatt
GitHub:- https://github.com/Virajdatt
GoodReads:- https://www.goodreads.com/user/show/114768501-virajdatt-kohir
:) :):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):)