Pandas Equivalents in Pyspark

Virajdatt Kohir
3 min readJul 13, 2023

--

Pandas Equivalents in Pyspark

This year I have been using pyspark extensively at my job. It’s been nearly 6 months and I have grown to love pyspark for it’s efficiency for handling millions of rows of data. When I started to learn pyspark I was an extensive pandas user and had done a lot of data wrangling with pandas. One thing I really missed in my early days of learning pyspark and getting some quick work done was having a list/cheatsheet of pandas equivalents in pyspark. In this article I complied a few snippets that can help you with the transition and get some quick work done as you also start your pyspark journey.

0. Preliminary steps:

Quick imports and setup for our examples

from pyspark.sql import SparkSession
import pandas as pd
import pyspark.sql.functions as sf
from pyspark.sql.types import (
StringType,
)

data_fr_df = [[1, 'python'], [2, 'C'], [3, 'C++'], [4, 'scala'], [5, 'python']
]
# Create pandas Dataframe
simple_df = pd.DataFrame(data_fr_df, columns=['user_id', 'programming_language'])


#Create spark Dataframe with same data
simple_sdf = spark.createDataFrame(simple_df)
spark = SparkSession.builder.getOrCreate()

1. Number of rows/records:

use count from pyspark dataframe.

# Number of rows
## pandas
len(simple_df)

## pyspark
simple_sdf.count()

2. List datatypes of columns

use printSchema from pyspark dataframe.

# Listing datatypes of columns

## pandas
simple_df.dtypes

## pyspark
simple_sdf.printSchema()

3. Adding a new column

Use withColumn from pyspark dataframe

# Adding a new column

## pandas

simple_df['new_col'] = 'pandas_new_col'
simple_df.head()

## pyspark
simple_sdf = simple_sdf.withColumn('newColumnPyspark',
sf.lit('pyspark_new_col'))
simple_sdf.show()
Img: Adding a new column

4. value_counts

Use grouby, followed by count from pyspark dataframe.

# value_counts

## pandas
simple_df['programming_language'].value_counts(dropna=False)

## pyspark
simple_sdf.groupBy('programming_language').count().orderBy('count').show()
Img: value_counts

5. Sorting

Use orderby from pyspark dataframe.

# sorting
## pandas
simple_df.sort_values('user_id', ascending=False)

## pyspark
simple_sdf = simple_sdf.orderBy('user_id', ascending=False)
simple_sdf.show()
Img: Sorting

6. Min and Max Values

Use sf.max and sf.min which are part of the pyspark.sql.functions module.

## pandas
print(f'pandas max value {simple_df["user_id"].max()}')
print(f'pandas min value {simple_df["user_id"].min()}')

## pyspark
print(f'pyspark max value {simple_sdf.agg(sf.max("user_id")).show()}')
print(f'pyspark min value {simple_sdf.agg(sf.min("user_id")).show()}')
img: Max and Min Values

7. Data Filtering

Use filter method from pyspark dataframe.

## pandas
simple_df[simple_df['user_id']==3]

## pyspark
simple_sdf.filter(sf.col('user_id')==3).show()
Img: Filtering

8. Calculating Nulls/Nans values

Pyspark unlike pandas isnull() function treats Nulls and Nans differently so they need to be counted seperately. We will use isnull and isnan from pyspark.sql.functions.

## pandas 
simple_df.isnull().sum()


## pyspark

# nans
simple_sdf.select([sf.count(sf.when(sf.isnan(c), c))\
.alias(c) for c in simple_sdf.columns]).show()

# nulls
simple_sdf.select([sf.count(sf.when(sf.isnull(c), c))\
.alias(c) for c in simple_sdf.columns]).show()
Nulls/Nans values

9. Type Casting

## pandas
simple_df['user_id'] = simple_df['user_id'].astype('str')

## pyspark
cast_column = 'user_id'
simple_sdf = simple_sdf.withColumn(cast_column, simple_sdf[cast_column]\
.cast(StringType()))
Type Casting

Thank you for reading, that’s all for this article. More content to follow. Please clap if the article was helpful to you and comment if you have any questions. If you want to connect with me, learn and grow with me or collaborate you can reach me at any of the following:

Linkedin:- https://www.linkedin.com/in/virajdatt-kohir/
Twitter:- https://twitter.com/kvirajdatt
GitHub:- https://github.com/Virajdatt
GoodReads:- https://www.goodreads.com/user/show/114768501-virajdatt-kohir

:) :):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):)

--

--

Virajdatt Kohir

Data Analysis/Science, Machine Learning, Deep Learning, student for life.