In this article, You will learn everything about the drop one or multiple columns from PySpark DataFrame with the help of the examples. In real-life projects sometimes we want to delete columns in PySpark DataFrame.
PySpark DataFrame has a method called a drop() that is used to delete columns from the PySpark DataFrame. After reading this article, you will not have any confusion regarding how to drop columns in PySpark DataFrame because throughout this article we are about to delete single as well multiple columns from the PySpark DataFrame.
To apply the drop() method, first of all, we must have a PySpark DataFrame. Let me create a simple PySpark DataFrame just for demonstration for this article. you can skip this part if you have already a PySpark DataFrame.
Headings of Contents
Create PySpark DataFrame
To create PySpark DataFrame, I have prepared a list of tuples and each tuple inside the list contains some information about the students like their first_name, last_name, course, marks, roll_number, and admission_date.
Code to create PySpark DataFrame:
from pyspark.sql import SparkSession data = [ ("Pankaj", "Kumar", "BTech", "1550.50", "101", "2022-12-20"), ("Hari", "Sharma", "BCA", "1400.00", "102", "2018-03-12"), ("Anshika", "Kumari", "MCA", "1450.00", "103", "2029-05-19"), ("Shantanu", "Saini", "BSc", "1350.50", "104", "2019-08-20"), ("Avantika", "Srivastava", "BCom", "1350.00", "105", "2020-10-21"), ("Jay", "Kumar", "BTech", "1540.00", "106", "2019-08-29"), ("Vinay", "Singh", "BCA", "1480.50", "107", "2017-09-17"), ] columns = [ "first_name", "last_name", "course", "marks", "roll_number", "admission_date", ] # creating spark session spark = SparkSession.builder.appName("testing").getOrCreate() # creating dataframe student_dataframe = spark.createDataFrame(data, columns) # displaying dataframe student_dataframe.show()
After the successful execution of the above code, The Created DataFrame will look like this.
+----------+----------+------+-------+-----------+--------------+
|first_name| last_name|course| marks|roll_number|admission_date|
+----------+----------+------+-------+-----------+--------------+
| Pankaj| Kumar| BTech|1550.50| 101| 2022-12-20|
| Hari| Sharma| BCA|1400.00| 102| 2018-03-12|
| Anshika| Kumari| MCA|1450.00| 103| 2029-05-19|
| Shantanu| Saini| BSc|1350.50| 104| 2019-08-20|
| Avantika|Srivastava| BCom|1350.00| 105| 2020-10-21|
| Jay| Kumar| BTech|1540.00| 106| 2019-08-29|
| Vinay| Singh| BCA|1480.50| 107| 2017-09-17|
+----------+----------+------+-------+-----------+--------------+
Let me explain the above code so that you can get more clarity with that code.
- Firstly, I have imported the SparkSession from the pyspark.sql module.
- Prepared the list of Python tuples and each tuple contains information about the Student like first_name, last_name, course, marks, roll_number, and admission_date.
- I have created the Python list that contained column names for PySpark DataFrame.
- Created a spark session using SparkSession.builder.appName(“testing”).getOrCreate() because spark session is the entry point of any spark application.
- And I have used the createDataFrame() method of the spark session and passed a list of tuples and columns inside it in order to create PySpark DataFrame.
- And finally displayed the created PySpark DataFrame.
Now it’s time to explore the PySpark DataFrame drop function in order to drop one or multiple columns from PySpark DataFrame.
PySpark DataFrame drop() Method
The drop() method is a PySpark DataFrame method that is responsible for drop columns in PySpark DataFrame.It takes the column name as a parameter and drops them. You can pass a single column or multiple columns in order to drop them. It’s up to you.
Throughout this article, we are about to drop single or multiple columns from PySpark DataFrame.
Drop one or Multiple columns from PySpark DataFrame
We will see multiple ways to drop one or more columns from PySpark DataFrame with the help of the examples.
PySpark DataFrame drop Single Column
To drop a single column, we have to pass the column inside the drop() method. After dropping the passed column, The drop method will return a new DataFrame. In this example, I am going to drop the admission_date column.
from pyspark.sql import SparkSession data = [ ("Pankaj", "Kumar", "BTech", "1550.50", "101", "2022-12-20"), ("Hari", "Sharma", "BCA", "1400.00", "102", "2018-03-12"), ("Anshika", "Kumari", "MCA", "1450.00", "103", "2029-05-19"), ("Shantanu", "Saini", "BSc", "1350.50", "104", "2019-08-20"), ("Avantika", "Srivastava", "BCom", "1350.00", "105", "2020-10-21"), ("Jay", "Kumar", "BTech", "1540.00", "106", "2019-08-29"), ("Vinay", "Singh", "BCA", "1480.50", "107", "2017-09-17"), ] columns = [ "first_name", "last_name", "course", "marks", "roll_number", "admission_date", ] # creating spark session spark = SparkSession.builder.appName("testing").getOrCreate() # creating dataframe student_dataframe = spark.createDataFrame(data, columns) # dropping admission_date column student_dataframe2 = student_dataframe.drop('admission_date') # displaying dataframe student_dataframe2.show()
After dropping the admission_date column, Your data frame will be like this.
+----------+----------+------+-------+-----------+
|first_name| last_name|course| marks|roll_number|
+----------+----------+------+-------+-----------+
| Pankaj| Kumar| BTech|1550.50| 101|
| Hari| Sharma| BCA|1400.00| 102|
| Anshika| Kumari| MCA|1450.00| 103|
| Shantanu| Saini| BSc|1350.50| 104|
| Avantika|Srivastava| BCom|1350.00| 105|
| Jay| Kumar| BTech|1540.00| 106|
| Vinay| Singh| BCA|1480.50| 107|
+----------+----------+------+-------+-----------+
PySpark DataFrame Drop Multiple Columns:
To drop multiple columns, we have to pass multiple columns inside the drop() method separated by a comma. The drop() method will drop all the passed columns from the existing data frame and return mew one.
I am about to drop a course, marks, and admission_date column from PySpark DataFrame.
from pyspark.sql import SparkSession data = [ ("Pankaj", "Kumar", "BTech", "1550.50", "101", "2022-12-20"), ("Hari", "Sharma", "BCA", "1400.00", "102", "2018-03-12"), ("Anshika", "Kumari", "MCA", "1450.00", "103", "2029-05-19"), ("Shantanu", "Saini", "BSc", "1350.50", "104", "2019-08-20"), ("Avantika", "Srivastava", "BCom", "1350.00", "105", "2020-10-21"), ("Jay", "Kumar", "BTech", "1540.00", "106", "2019-08-29"), ("Vinay", "Singh", "BCA", "1480.50", "107", "2017-09-17"), ] columns = [ "first_name", "last_name", "course", "marks", "roll_number", "admission_date", ] # creating spark session spark = SparkSession.builder.appName("testing").getOrCreate() # creating dataframe student_dataframe = spark.createDataFrame(data, columns) # dropping multiple columns student_dataframe2 = student_dataframe.drop('course', 'marks', 'admission_date') # displaying dataframe student_dataframe2.show()
After dropping the above columns the new PySpark DataFrame will be like this:
+----------+----------+-----------+
|first_name| last_name|roll_number|
+----------+----------+-----------+
| Pankaj| Kumar| 101|
| Hari| Sharma| 102|
| Anshika| Kumari| 103|
| Shantanu| Saini| 104|
| Avantika|Srivastava| 105|
| Jay| Kumar| 106|
| Vinay| Singh| 107|
+----------+----------+-----------+
PySpark DataFrame Drop All Columns:
To drop all column names from PySpark DataFrame, you have to pass all column names inside the drop() method. The drop() will return a blank PySpark DataFrame after dropping all the column names.
from pyspark.sql import SparkSession data = [ ("Pankaj", "Kumar", "BTech", "1550.50", "101", "2022-12-20"), ("Hari", "Sharma", "BCA", "1400.00", "102", "2018-03-12"), ("Anshika", "Kumari", "MCA", "1450.00", "103", "2029-05-19"), ("Shantanu", "Saini", "BSc", "1350.50", "104", "2019-08-20"), ("Avantika", "Srivastava", "BCom", "1350.00", "105", "2020-10-21"), ("Jay", "Kumar", "BTech", "1540.00", "106", "2019-08-29"), ("Vinay", "Singh", "BCA", "1480.50", "107", "2017-09-17"), ] columns = [ "first_name", "last_name", "course", "marks", "roll_number", "admission_date", ] # creating spark session spark = SparkSession.builder.appName("testing").getOrCreate() # creating dataframe student_dataframe = spark.createDataFrame(data, columns) # dropping all columns student_dataframe2 = student_dataframe.drop(*columns) # displaying dataframe student_dataframe2.show()
New PySpark Blank DataFrame something looks like this.
++
||
++
||
||
||
||
||
||
||
++
Using drop() method with col() method to drop columns
We can pass the column name inside the col() method because it represents the column of PySpark DataFrame. For example, I am about to drop last_name from PySpark DataFrame. To drop last_name first I will pass last_name into the col() method and then pass the col() method to the drop() method.To use the col() method you have to import it from pyspark.sql.functions module.
let’s see.
Example: Using drop() and col() functions to drop single column
from pyspark.sql import SparkSession from pyspark.sql.functions import col data = [ ("Pankaj", "Kumar", "BTech", "1550.50", "101", "2022-12-20"), ("Hari", "Sharma", "BCA", "1400.00", "102", "2018-03-12"), ("Anshika", "Kumari", "MCA", "1450.00", "103", "2029-05-19"), ("Shantanu", "Saini", "BSc", "1350.50", "104", "2019-08-20"), ("Avantika", "Srivastava", "BCom", "1350.00", "105", "2020-10-21"), ("Jay", "Kumar", "BTech", "1540.00", "106", "2019-08-29"), ("Vinay", "Singh", "BCA", "1480.50", "107", "2017-09-17"), ] columns = [ "first_name", "last_name", "course", "marks", "roll_number", "admission_date", ] # creating spark session spark = SparkSession.builder.appName("testing").getOrCreate() # creating dataframe student_dataframe = spark.createDataFrame(data, columns) # using col() method student_dataframe2 = student_dataframe.drop(col("last_name")) student_dataframe2.show()
Output
+----------+------+-------+-----------+--------------+
|first_name|course| marks|roll_number|admission_date|
+----------+------+-------+-----------+--------------+
| Pankaj| BTech|1550.50| 101| 2022-12-20|
| Hari| BCA|1400.00| 102| 2018-03-12|
| Anshika| MCA|1450.00| 103| 2029-05-19|
| Shantanu| BSc|1350.50| 104| 2019-08-20|
| Avantika| BCom|1350.00| 105| 2020-10-21|
| Jay| BTech|1540.00| 106| 2019-08-29|
| Vinay| BCA|1480.50| 107| 2017-09-17|
+----------+------+-------+-----------+--------------+
Related PySpark Articles:
- How to convert PySpark Row To Dictionary
- PySpark Column Class with Examples
- PySpark Sort Function with Examples
- How to read CSV files using PySpark
- PySpark col() Function with Examples
- Convert PySpark DataFrame Column to List
- How to Write PySpark DataFrame to CSV
- How to Convert PySpark DataFrame to JSON
👉PySpark DataFrame drop() method reference:- Click Here
Conclusion
I hope this tutorial was helpful and easy to understand. Throughout this article, we have seen how to drop one or multiple columns from PySpark DataFrame with the help of the proper example. You can use any one of them as per your project requirement. This is most useful, especially when you are working on any real-life project and your requirement is to drop columns in PySpark DataFrame.
If you found this article helpful, Please share and keep visiting for further PySpark tutorials.
Have a great day…..