Hi PySpark Lovers, In this article, you will learn everything about the PySpark sort functions with examples in order to sort the specific column of the data frame with the help of proper explanation and using examples.
Throughout this article, we will explore various types of sort functions which you can apply to the PySpark DataFrame column. We can also see the PySpark data frame sorting by multiple columns.
There are six types of sort functions available in PySpark that you can use to sort the column of PySpark DataFrame and RDD in ascending order or descending order.
Headings of Contents
What is the PySpark sort function?
Sort functions in PySpark defined inside The pyspark.sql.functions module. By default it comes with PySpark, You just need to import all these sorts of functions and use them. These sort functions take a column as a parameter and return a sort expression.
You can find all the sort functions here in order to sort the PySpark DataFrame column.
- asc():- It is used to sort the PySpark DataFrame column in ascending order. Basically, it takes a column name as a parameter and returns a sort expression based on the ascending order for the given column.
- asc_nulls_first():- This function is also used to sort the given column in ascending order, but it returns a null value first before not null values.
- asc_nulls_last():- This function is also used to sort the given column in ascending order, but it returns not null value first before null values.
- desc():- The desc() function sorts the passed column in descending order.
- desc_nulls_first():- It sorts the passed column in descending order but null values will come first before not null values.
- desc_nulls_last():- This function sort the passed column in descending order but it will return not null values first before the null values.
Why do we need to sort functions in the PySpark?
It all depends on your requirement, When you are working on the PySpark application and after applying all the transformations you generate a final PySpark data frame and you want to sort a particular column or multiple columns in specific order means ascending order or descending order then you can use sort function in order to sort the PySpark data frame columns, Even you can sort the number column in ascending order or descending order.
Now, let’s see how to use the sorting function in PySpark DataFrame, We will apply all the sort function on PySpark DataFrame.Therefore first of all we need to create a data frame with some records.
Creating PySpark DataFrame
To create DataFrame in PySpark, we need to import some resources from the PySpark library. The resources will be the SparkSession class from the pyspark.sql module which is used to create a spark session that will be the entry point of our application.
Then we have to use the createDataFrame() method of the spark session in order to create a new PySpark DataFrame.I am not going to tell you more about the createDataFrame() method because I already have written a detailed article about how to create a PySpark data frame.
I have created a new PySpark DataFrame by using the below code, as you can see here.
from pyspark.sql import SparkSession from pyspark.sql.functions import asc, desc, asc_nulls_first, asc_nulls_last, desc_nulls_first, desc_nulls_last, col data = [ ('Rambo', 'Developer', 'IT', 33000), ('John', 'Developer', 'IT', 40000), ('Harshita', 'HR Executive', 'HR', 25000), ('Vanshika', 'Senior HR Manager', 'HR', 50000), (None, 'Senior Marketing Expert', 'IT', None), ('Harry', 'SEO Analyst', 'Marketing', 33000), ('Shital', 'HR Executive', 'HR', 25000), (None, 'HR Executive', 'HR', None), ] columns = ['name', 'designation', 'department', 'salary'] # creating spark session spark = SparkSession.builder.appName("testing").getOrCreate() # df = spark.createDataFrame(data, columns) df.show(truncate=False)
After creating PySpark DataFrame, It will be like this.
+--------+-----------------------+----------+------+
|name |designation |department|salary|
+--------+-----------------------+----------+------+
|Rambo |Developer |IT |33000 |
|John |Developer |IT |40000 |
|Harshita|HR Executive |HR |25000 |
|Vanshika|Senior HR Manager |HR |50000 |
|null |Senior Marketing Expert|IT |null |
|Harry |SEO Analyst |Marketing |33000 |
|Shital |HR Executive |HR |25000 |
|null |HR Executive |HR |null |
+--------+-----------------------+----------+------+
Now, we have created PySpark DataFrame successfully, Now it’s time to come to apply all those sorting functions that I have mentioned above.
How to use the sorting function in PySpark DataFrame?
As we know that all the sort functions take a column name as a parameter and return a sort expression, Therefore we need to use PySpark dataFrame orderBy() method to apply that returned sort expression on top of the DataFrame column.
let’s see all the functions one by one with the help of an example.
PySpark asc(col) sorting function
The asc() sort function accepts a col as a parameter that represents the column name on which you want to sort and returns the sort expression based on the ascending order for a given column. Here, I am about to sort the name column.
df.orderBy(asc(col("name"))).show(truncate=False)
Output
+--------+-----------------------+----------+------+
|name |designation |department|salary|
+--------+-----------------------+----------+------+
|null |Senior Marketing Expert|IT |null |
|null |HR Executive |HR |null |
|Harry |SEO Analyst |Marketing |33000 |
|Harshita|HR Executive |HR |25000 |
|John |Developer |IT |40000 |
|Rambo |Developer |IT |33000 |
|Shital |HR Executive |HR |25000 |
|Vanshika|Senior HR Manager |HR |50000 |
+--------+-----------------------+----------+------+
PySpark asc_nulls_first(col) sorting function
This function also accepts the col parameter which represents the column name of the PySpark DataFrame, It also sorts the column name in ascending order but it returns the Null value first before not the null value.
df.orderBy(asc_nulls_first(col("name"))).show(truncate=False)
Output
+--------+-----------------------+----------+------+
|name |designation |department|salary|
+--------+-----------------------+----------+------+
|null |HR Executive |HR |null |
|null |Senior Marketing Expert|IT |null |
|Harry |SEO Analyst |Marketing |33000 |
|Harshita|HR Executive |HR |25000 |
|John |Developer |IT |40000 |
|Rambo |Developer |IT |33000 |
|Shital |HR Executive |HR |25000 |
|Vanshika|Senior HR Manager |HR |50000 |
+--------+-----------------------+----------+------+
Reference:- Click Here
PySpark asc_nulls_last(col) sorting function
This function also accepts the col parameter which represents the column name of the PySpark DataFrame, It also sorts the column name in ascending order but it returns not Null value first before the null value.
df.orderBy(asc_nulls_first(col("name"))).show(truncate=False)
Output
+--------+-----------------------+----------+------+
|name |designation |department|salary|
+--------+-----------------------+----------+------+
|Harry |SEO Analyst |Marketing |33000 |
|Harshita|HR Executive |HR |25000 |
|John |Developer |IT |40000 |
|Rambo |Developer |IT |33000 |
|Shital |HR Executive |HR |25000 |
|Vanshika|Senior HR Manager |HR |50000 |
|null |Senior Marketing Expert|IT |null |
|null |HR Executive |HR |null |
+--------+-----------------------+----------+------+
Reference:- Click Here
PySpark desc(col) function
The desc() works the opposite of asc() function. The asc() sorts the column in ascending order whereas desc() sorts the passed column in descending order. It takes col which refers to the column name of the DataFrame.
Here, I am about to apply desc() function on the DataFrame salary column.
df.orderBy(desc(col("salary"))).show(truncate=False)
Output
+--------+-----------------------+----------+------+
|name |designation |department|salary|
+--------+-----------------------+----------+------+
|Vanshika|Senior HR Manager |HR |50000 |
|John |Developer |IT |40000 |
|Rambo |Developer |IT |33000 |
|Harry |SEO Analyst |Marketing |33000 |
|Harshita|HR Executive |HR |25000 |
|Shital |HR Executive |HR |25000 |
|null |HR Executive |HR |null |
|null |Senior Marketing Expert|IT |null |
+--------+-----------------------+----------+------+
Pyspark desc_nulls_first() sorting function
The desc_nulls_first() function accepts a column name and returns a sort expression of descending order for a given column. It returns null values first before not null values.
df.orderBy(desc_nulls_first(col("salary"))).show(truncate=False)
Output
+--------+-----------------------+----------+------+
|name |designation |department|salary|
+--------+-----------------------+----------+------+
|null |HR Executive |HR |null |
|null |Senior Marketing Expert|IT |null |
|Vanshika|Senior HR Manager |HR |50000 |
|John |Developer |IT |40000 |
|Harry |SEO Analyst |Marketing |33000 |
|Rambo |Developer |IT |33000 |
|Harshita|HR Executive |HR |25000 |
|Shital |HR Executive |HR |25000 |
+--------+-----------------------+----------+------+
PySpark desc_nulls_last() sorting function
The desc_nulls_last() function accepts a column name and returns a sort expression of descending order for a given column. It returns not null values first before null values.
df.orderBy(desc_nulls_last(col("salary"))).show(truncate=False)
Output
+--------+-----------------------+----------+------+
|name |designation |department|salary|
+--------+-----------------------+----------+------+
|Vanshika|Senior HR Manager |HR |50000 |
|John |Developer |IT |40000 |
|Harry |SEO Analyst |Marketing |33000 |
|Rambo |Developer |IT |33000 |
|Harshita|HR Executive |HR |25000 |
|Shital |HR Executive |HR |25000 |
|null |Senior Marketing Expert|IT |null |
|null |HR Executive |HR |null |
+--------+-----------------------+----------+------+
PySpark sorts multiple columns
So far, we have applied sort functions in only a single column but we can also apply PySpark sorting functions on multiple columns. Let’s see how can we do that.
To sort multiple columns in PySpark DataFrame, you have to pass two sorted expressions in the orderBy() method and that will be like this.
df.orderBy(asc(col("name")), asc(col("salary"))).show(truncate=False)
Output
+--------+-----------------------+----------+------+
|name |designation |department|salary|
+--------+-----------------------+----------+------+
|null |HR Executive |HR |null |
|null |Senior Marketing Expert|IT |null |
|Harry |SEO Analyst |Marketing |33000 |
|Harshita|HR Executive |HR |25000 |
|John |Developer |IT |40000 |
|Rambo |Developer |IT |33000 |
|Shital |HR Executive |HR |25000 |
|Vanshika|Senior HR Manager |HR |50000 |
+--------+-----------------------+----------+------+
It does not mandatory to apply the same function, you can apply any sorting function as per your requirement. As you can see below.
df.orderBy(asc_nulls_last(col("name")), desc(col("department"))).show(truncate=False)
Output
+--------+-----------------------+----------+------+
|name |designation |department|salary|
+--------+-----------------------+----------+------+
|Harry |SEO Analyst |Marketing |33000 |
|Harshita|HR Executive |HR |25000 |
|John |Developer |IT |40000 |
|Rambo |Developer |IT |33000 |
|Shital |HR Executive |HR |25000 |
|Vanshika|Senior HR Manager |HR |50000 |
|null |Senior Marketing Expert|IT |null |
|null |HR Executive |HR |null |
+--------+-----------------------+----------+------+
So, we have covered all the sort functions along with examples. In fact, you can find the code here.
PySpark Articles:
Complete Source Code
from pyspark.sql import SparkSession from pyspark.sql.functions import asc, desc, asc_nulls_first, asc_nulls_last, desc_nulls_first, desc_nulls_last, col data = [ ('Rambo', 'Developer', 'IT', 33000), ('John', 'Developer', 'IT', 40000), ('Harshita', 'HR Executive', 'HR', 25000), ('Vanshika', 'Senior HR Manager', 'HR', 50000), (None, 'Senior Marketing Expert', 'IT', None), ('Harry', 'SEO Analyst', 'Marketing', 33000), ('Shital', 'HR Executive', 'HR', 25000), (None, 'HR Executive', 'HR', None), ] columns = ['name', 'designation', 'department', 'salary'] # # creating spark session spark = SparkSession.builder.appName("testing").getOrCreate() # df = spark.createDataFrame(data, columns) # asc df.orderBy(asc(col("name"))).show(truncate=False) # asc_nulls_first df.orderBy(asc_nulls_first(col("name"))).show(truncate=False) # asc_nulls_last df.orderBy(asc_nulls_last(col("name"))).show(truncate=False) # desc df.orderBy(desc(col("salary"))).show(truncate=False) # desc_nulls_first df.orderBy(desc_nulls_first(col("salary"))).show(truncate=False) # desc_nulls_last df.orderBy(desc_nulls_last(col("salary"))).show(truncate=False) # asc df.orderBy(asc(col("name")), asc(col("salary"))).show(truncate=False) # asc_nulls_last and desc df.orderBy(asc_nulls_last(col("name")), desc(col("department"))).show(truncate=False)
Summary
I hope, Process of applying sort functions on PySpark DataFrame was pretty easy and quite straightforward, Now you can sort any column within the data frame using PySpark Sort Function according to your requirement. You can sort single as well as multiple columns together in the PySpark data frame, As you can see above.
If you found this article helpful, please share and keep visiting for further PySpark tutorials.
if you have any queries regarding this article, please let us know through the mail.
have a nice day….