Hi PySpark lovers, In this PySpark create temp table we are going to see the complete process of How to Create a Temp View in PySpark to run the SQL queries on top of that table.
If you are familiar with SQL tables this PySpark temp table will be easy for you, it is just the same as an SQL table where you can perform raw SQL queries to analyze the data. As a Data Engineer, Data Analysis it is our job to analyze the data correctly and efficiently so that we can make better decisions.
To create a PySpark temp view first you should have a PySpark DataFrame, If you don’t know how to create a DataFrame in PySpark then I have already written an article reading CSV data into PySpark DataFrame you can learn from there.
Now, let’s see what is temp view is in PySpark DataFrame and the process of creating a PySpark Temporary table.
Headings of Contents
What is Temp View in PySpark?
In PySpark, a View or virtual table is just like a real table on SQL and it allows to perform raw and complex SQL queries on top of the virtual table PySpark provides some DataFrame methods to create a PySpark temp view. I have listed all the methods below.
createTempView:- It is a DataFrame method that is used to create a temporary view. it is a session scoped method which means it will create a temporary view within the current spark session. if the view already exists then it will throw an error.
The temporary view will be dropped when the spark session terminates.
createOrReplaceTempView:- It is the same as the createTempView method. The only difference is, it will replace the view if the view already exists.
createGlobalTempView:- This method Creates a global PySpark temporary view of the DataFrame that is accessible across all Spark sessions within the same Spark application. The global view will be dropped only when the application terminates. The view name is prefixed with global_temp.
createOrReplaceGlobalTempView:- it is also the same as createGlobalTempView. It also replaces the view if the view is already created.
For easier understanding, you can see the below table.
Method | Scope | Replacement | Prefix |
---|---|---|---|
createTempView | Session-Scoped | Throws error if view exists | – |
createOrReplaceTempView | Session-Scoped | Replaces | – |
createGlobalTempView | Application-Scoped | Throws an error if the view exists | global_temp |
createOrReplaceGlobalTempView | Application-Scoped | Replaces | global_temp |
Throughout this article, we will see the createTempView and createOrReplaceTempView methods. Before using these two methods firstly we must have PySpark DataFrame so let’s create a sample PySpark DataFrame.
Creating PySpark DataFrame
To create the PySpark dataframe, I have prepared a simple CSV file which I loaded into PySpark DataFrame. For more details check out this tutorial:- How to load CSV file into PySpark DataFrame.
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Programming Funda").getOrCreate() # reading data frame csv file df = ( spark.read.option("header", True) .option("inferSchema", True) .format("CSV") .load("sample_data.csv") ) # displaying the dataframe df.show()
After executing the above code a PySpark DataFrame will be created which will look like this.
+----------+---------+------+----------+---+----------+----+------+
|first_name|last_name|gender|department|age| date| id|salary|
+----------+---------+------+----------+---+----------+----+------+
| Dulce| Abril|Female| IT| 32|2017-10-15|1562| 34500|
| Mara|Hashimoto|Female| Testing| 25|2016-08-16|1582| 54000|
| Philip| Gent| Male| IT| 36|2015-05-21|2587| 18700|
| Kathleen| Hanner|Female| Marketing| 25|2017-10-15|1876| 24000|
| Mara|Hashimoto|Female| Testing| 25|2016-08-16|1582| 31000|
| Kathleen| Hanner|Female| IT| 25|2017-10-15|1876| 29000|
| Vishvajit| Rao| Male| Testing| 24|2023-04-10| 232| 35000|
| Ajay| Kumar| Male| Marketing| 27|2018-04-10|1234| 34560|
| Dulce| Abril|Female| Marketing| 32|2017-10-15|1562| 28750|
| Harsh| Kumar| Male| Marketing| 26|2022-04-10|1232| 12560|
+----------+---------+------+----------+---+----------+----+------+
Let’s create a PySpark Temporary table.
To create a PySpark temp table we need to add a simple line in your code and after that, your PySpark Temporary table or view will be created.
# Creating temp table like sql table df.createTempView('SampleStudentData')
But remember one thing, In the same spark session when you try to create a temp view with the same name then you will get an error like
pyspark.errors.exceptions.captured.AnalysisException: [TEMP_TABLE_OR_VIEW_ALREADY_EXISTS] Cannot create the temporary view `SampleStudentData` because it already exists.
To fix this problem we have two ways provide a different name each time or use a second method called createOrReplaceTempView method.
I will recommend always going with the createOrReplaceTempView method.
process of using the createOrReplaceTempView method will be the same as above.
# Creating temp table like sql table df.createOrReplaceTempView('SampleStudentData').
Now I am supposing, You have created your PySpark temp View. Now, let’s perform some raw SQL queries.
Displaying Records
To display all records use a simple SQL SELECT statement. Learn about the SELECT statement by clicking here.
new_df = spark.sql('select * from SampleStudentData') new_df.show()
Display Some Columns:
To display specific columns, separate the column names by commas.
new_df = spark.sql('select id, first_name, last_name from SampleStudentData') new_df.show()
Calculate average salary
To calculate the average salary of all employees, I am using the avg() aggregate function.
new_df = spark.sql('select avg(salary) from SampleStudentData') new_df.show()
The output will be:
+-----------+
|avg(salary)|
+-----------+
| 30207.0|
+-----------+
Calculate average salary by department
To calculate the average salary of the department, we have to use the GROUP BY clause along with the avg() aggregate function.
new_df = spark.sql('select department, avg(salary) from SampleStudentData group by department') new_df.show()
Output
+----------+-----------+ |department|avg(salary)| +----------+-----------+ | Marketing| 24967.5| | IT| 27400.0| | Testing| 40000.0| +----------+-----------+
Use LIMIT Clause
LIMIT Clause is used when we want to display some specific number of rows here I am Displaying only 5 rows using LIMIT Clause.
new_df = spark.sql('select * from SampleStudentData LIMIT 5') new_df.show()
Output
+----------+---------+------+----------+---+----------+----+------+ |first_name|last_name|gender|department|age| date| id|salary| +----------+---------+------+----------+---+----------+----+------+ | Dulce| Abril|Female| IT| 32|2017-10-15|1562| 34500| | Mara|Hashimoto|Female| Testing| 25|2016-08-16|1582| 54000| | Philip| Gent| Male| IT| 36|2015-05-21|2587| 18700| | Kathleen| Hanner|Female| Marketing| 25|2017-10-15|1876| 24000| | Mara|Hashimoto|Female| Testing| 25|2016-08-16|1582| 31000| +----------+---------+------+----------+---+----------+----+------+
Using Operator
I am using display all the employees whose salary is greater than 24000.
new_df = spark.sql('select * from SampleStudentData where salary > 24000') new_df.show()
This is how you can use SQL queries on top of the PySpark DataFrame after creating the PySpark temp view.
Conclusion
If you are comfortable with SQL you are working in PySpark and you want to perform SQL queries then this approach will be best for you. First, create the PySpark temp view and perform SQL queries as per your requirement.
You can use any method as per your requirement, If you want to create PySpark temp view for session scoped then use createTempView and
createOrReplaceTempView and for application scoped use createGlobalTempView and createOrReplaceGlobalTempView method.
If you found this article helpful, please share and keep visiting for further Pyspark tutorials.
Thanks for reading this article…
Happy Coding 🙂