Hi, In this article we are going to see how to Remove Time Part from PySpark DateTime Column with the help of the examples. These questions might be asked in most of the Data engineering interviews which is why it is one of the most important for PySpark developers or data engineers.
To remove the time part from the PySpark DateTime column, first of all, we have a PySpark DataFrame column along with the time part.
Let’s create a PySpark DataFrame along with a DateTime column.
Headings of Contents
Remove Time Part from PySpark DateTime Column
Here, I have created a sample PySpark DataFrame along with the dob column and the dob column has the date of birth of the students in the form of DateTime.
from pyspark.sql import SparkSession # list of tuples data = [ ("1", "Vishvajit", "Rao", "2021-01-12 04:30:20"), ("2", "Harsh", "Goal", "2020-04-10 04:40:20"), ("3", "Pankaj", "Kumar", "2019-08-09 04:35:50"), ("4", "Pranjal", "Rao", "2013-11-12 02:11:20"), ("5", "Ritika", "Kumari", "2017-04-07 05:36:10"), ("6", "Diyanshu", "Saini", "2018-06-12 03:34:55"), ] # columns column_names = ["id", "first_name", "last_name", "dob"] # creating spark session spark = ( SparkSession.builder.master("local[*]") .appName("www.programmingfunda.com") .getOrCreate() ) # creating DataFrame df = spark.createDataFrame(data=data, schema=column_names) df.show(truncate=False)
Output:
+---+----------+---------+-------------------+
|id |first_name|last_name|dob |
+---+----------+---------+-------------------+
|1 |Vishvajit |Rao |2021-01-12 04:30:20|
|2 |Harsh |Goal |2020-04-10 04:40:20|
|3 |Pankaj |Kumar |2019-08-09 04:35:50|
|4 |Pranjal |Rao |2013-11-12 02:11:20|
|5 |Ritika |Kumari |2017-04-07 05:36:10|
|6 |Diyanshu |Saini |2018-06-12 03:34:55|
+---+----------+---------+-------------------+
We have successfully created the PySpark DataFrame, Now it’s time to remove the Time Part from the PySpark DataFrame Column.
To do this, we have to follow some steps.
- Convert string DateTime to a timestamp column
new_df = df.withColumn("dob_col", to_timestamp('dob', 'yyyy-MM-dd HH:mm:ss'))
2. Use the PySpark to_date() function to extract the date part from the above convert timestamp DateTime.
new_df = new_df.withColumn("dob_col", to_date('dob_col', 'yyyy-MM-dd'))
3. Delete or replace the old column called ‘dob‘. Here I am DateTime the old column.
new_df.drop(cols='dob')
Note:- To use to_timestamp() and to_date() function you have to import these functions from the pyspark.sql.functions module.
And finally, your result will be:
+---+----------+---------+----------+
|id |first_name|last_name|dob_col |
+---+----------+---------+----------+
|1 |Vishvajit |Rao |2021-01-12|
|2 |Harsh |Goal |2020-04-10|
|3 |Pankaj |Kumar |2019-08-09|
|4 |Pranjal |Rao |2013-11-12|
|5 |Ritika |Kumari |2017-04-07|
|6 |Diyanshu |Saini |2018-06-12|
+---+----------+---------+----------+
Complete Source Code
You can get the complete source code form here.
from pyspark.sql import SparkSession from pyspark.sql.functions import to_date, to_timestamp # list of tuples data = [ ("1", "Vishvajit", "Rao", "2021-01-12 04:30:20"), ("2", "Harsh", "Goal", "2020-04-10 04:40:20"), ("3", "Pankaj", "Kumar", "2019-08-09 04:35:50"), ("4", "Pranjal", "Rao", "2013-11-12 02:11:20"), ("5", "Ritika", "Kumari", "2017-04-07 05:36:10"), ("6", "Diyanshu", "Saini", "2018-06-12 03:34:55"), ] # columns column_names = ["id", "first_name", "last_name", "dob"] # creating spark session spark = ( SparkSession.builder.master("local[*]") .appName("www.programmingfunda.com") .getOrCreate() ) # creating DataFrame df = spark.createDataFrame(data=data, schema=column_names) new_df = df.withColumn("dob_col", to_timestamp('dob', 'yyyy-MM-dd HH:mm:ss')) new_df = new_df.withColumn("dob_col", to_date('dob_col', 'yyyy-MM-dd')) new_df = new_df.drop('dob') new_df.show(truncate=False)
This is how you can extract date from the PySpark DataFrame DateTime column.
Helpful PySpark Articles
- PySpark Normal Built-in Functions
- PySpark SQL DateTime Functions with Examples
- PySpark SQL String Functions with Examples
- Merge Two DataFrames in PySpark with Different Column Names
- How to Fill Null Values in PySpark DataFrame
- How to Drop Duplicate Rows from PySpark DataFrame
- PySpark DataFrame Tutorial for Beginners
- PySpark Column Class with Examples
- PySpark Sort Function with Examples
- PySpark col() Function with Examples
- How to read CSV files using PySpark
- How to Explode Multiple Columns in PySpark DataFrame
- How to Count Null and NaN Values in Each Column in PySpark DataFrame?
- Merge Two DataFrames in PySpark with Same Column Names
- How to Apply groupBy in Pyspark DataFrame
- How to Change DataType of Column in PySpark DataFrame
- Drop One or Multiple columns from PySpark DataFrame
- How to Convert PySpark DataFrame to JSON ( 3 Ways )
- How to Write PySpark DataFrame to CSV
- How to Convert PySpark DataFrame Column to List
- How to convert PySpark DataFrame to RDD
- How to convert PySpark Row To Dictionary
Conclusion
So, in this article, we have seen how to Remove Time Part from PySpark DateTime Column with the help of the examples. This is one of the important questions from the interview point of view.
As a Data Engineer or PySpark Developer, you must know this question.
To solve these questions, bookmark this side. If you found this article helpful, please share and keep visiting for further PySpark tutorials.
Thanks for reading 🙏🙏