In this article, we will see how to find the Nth highest salary using PySpark with the help of the examples. Throughout this article, we will explore two things, Nth highest salary in all the data and the Nth highest salary within a group The group can be anything like department, country, gender, etc.
This is one of the asked questions by interviewers in Python developer, Data Scientist, or Data Engineering interviews as a data engineer I have got this question many times in interviews.
I have prepared small datasets with some records as you can see, in Real-life applications, you might have large datasets but as of now, we are about to find Nth highest salary using PySpark using this dataset.
As you can see in the dataset, there are multiple columns, First, we will get the Nth highest salary in the whole dataset and second we will see the Nth highest salary in each department.
Here Nth means any number you want to for I am about to find out 2nd highest salaried employee.
Let’s start.
Headings of Contents
Find the Nth Highest Salary Using PySpark
Let’s find out the Nth highest salary by considering the department or without considering the department.
Without Department
from pyspark.sql import SparkSession from pyspark.sql.window import Window from pyspark.sql.functions import desc, row_number # creating spark session spark = ( SparkSession.builder.master("local[*]") .appName("www.programmingfunda.com") .getOrCreate() ) # creating DataFrame from csv file df = spark.read.option("header", "true").csv("./sample_data.csv") # creating a window specification # windowFunction = Window.orderBy(desc("salary")) # applying window function df = df.withColumn("rank", row_number().over(windowFunction)) # getting 2nd highest salaried employee df = df.filter(df["rank"] == 2) df.show()
Output:
+----------+---------+------+----------+---+----------+---+------+----+
|first_name|last_name|gender|department|age| date| id|salary|rank|
+----------+---------+------+----------+---+----------+---+------+----+
| Vishvajit| Rao| Male| Testing| 24|2023-04-10|232| 35000| 2|
+----------+---------+------+----------+---+----------+---+------+----+
In the Output, Vishvajit is an employee who is second in salary among all the employees.
Let’s move on to the second, where we will see Nth highest salary in each department in PySpark.
By Department
Here we will see the second-highest salary in each department, we have three departments dataset Testing, Marketing, and IT.
from pyspark.sql import SparkSession from pyspark.sql.window import Window from pyspark.sql.functions import desc, row_number # creating spark session spark = ( SparkSession.builder.master("local[*]") .appName("www.programmingfunda.com") .getOrCreate() ) # creating DataFrame from csv file df = spark.read.option("header", "true").csv("./sample_data.csv") # creating a window specification windowFunction = Window.partitionBy("department").orderBy(desc("salary")) # applying window function df = df.withColumn("rank", row_number().over(windowFunction)) # getting 2nd highest salaried employee df = df.filter(df["rank"] == 2) df.show()
Output
+----------+---------+------+----------+---+----------+----+------+----+
|first_name|last_name|gender|department|age| date| id|salary|rank|
+----------+---------+------+----------+---+----------+----+------+----+
| Kathleen| Hanner|Female| IT| 25|2017-10-15|1876| 29000| 2|
| Dulce| Abril|Female| Marketing| 32|2017-10-15|1562| 28750| 2|
| Vishvajit| Rao| Male| Testing| 24|2023-04-10| 232| 35000| 2|
+----------+---------+------+----------+---+----------+----+------+----+
This is how you can get nth highest salary in a specific group or in the whole dataset.
See Also:
- PySpark Column Class with Examples
- PySpark Sort Function with Examples
- PySpark col() Function with Examples
- How to read CSV files using PySpark
- How to Format a String in PySpark DataFrame using Column Values
- How to Mask Card Number in PySpark DataFrame
- How to Remove Time Part from PySpark DateTime Column
- How to Explode Multiple Columns in PySpark DataFrame
- PySpark Normal Built-in Functions
- PySpark SQL DateTime Functions with Examples
- PySpark SQL String Functions with Examples
Conclusion
Throughout this article, we have explored the nth highest salary using PySpark with the help of the example. Now you can easily think kind of the questions. You can use any column instead of the department, for example, nth highest salary in country, gender, etc.
If you found this article helpful, please share and keep visiting for further PySpark tutorials.
Happy Coding!