Menu Close

PySpark Spark Session Tutorial | Entry Point to Spark

PySpark Spark Session Tutorial

In the Big data field, Apache Spark is a powerful framework for distributed data processing. PySpark is just an API for Spark, It allows Python developers to harness the functionality of Apache Spark.

In Today’s PySpark Spark Session article we are about to explore Spark Session with the help of the examples and this tutorial is designed for PySpark beginners and experienced people. For more PySpark useful articles you can explore our PySpark tutorials page.

When I was new to PySpark, then sometimes I used to get confused about the PySpark session but after reading more about the Spark session Now I can, I can explain the PySpark session easily so that anyone can understand. So

Let’s get started.

What is Spark Session?

Apache Spark introduced the Spark Session in version 2.0 and after that, It became the entry point of the Spark application. Before version 2.0, SparkContext was the entry point of any Spark application.

A Spark Session is a unified entry point of the Spark application that allows to read data from multiple sources into PySpark DataFrame, applying SQL queries on top of the DataFrame, etc.

It contains all the settings and configurations that are required to run a PySpark application.

Before the introduction of Spark version 2.0, We had to require different contexts for different operations in Spark for example SparkContext was used for RDD operations, Cluster Resource Management, SQLContext was used to perform SQL operations on top of the DataFrame and HiveContext was used to interact with the hive but spark session includes all these contexts into a single one called spark session.

Note:- We can say that, Spark session is just a ticket to enter in movie theater.

Key Features of Spark Session:

  • Unified Entry Point:- SparkSession provides a single entry point for reading data, creating DataFrame and executing SQL queries, and replacing the previous need to use various contexts like AQLContext, SparkContext, HiveContext, etc.
  • Simplicity:- It simplifies the user API by providing a more straightforward and consolidated way to interact with Spark’s functionality.

Now Let’s see how we can create a Spark session using PySpark.

You can use any editor, Here I am using Pycharm because it’s my favorite IDE.

Creating Spark Session

In PySpark, the Creation of the spark session is pretty straightforward. You can create a Spark session by using the SparkSession.builder method.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Programming Funda").getOrCreate()

In the above example:

  • SparkSession: SparkSession.builder is used to configure and create spark session SparkSession.builder returns the object of the Builder class.
  • appName():- appName is a method of the Builder class that is used to set the name of the Spark application.
  • getOrCreate():- As the name suggests it returns the spark session. It is also a method of the Builder class but returns the object of the SparkSession class which is typically called spark session. If the spark session is already exited it will return the existing spark session.

This is how you can create the Spark session.

Configure Spark Session

Spark provides various parameters that you can set as per your application needs. Here, you can set the configurations.

from pyspark.sql import SparkSession


spark = (
    SparkSession.builder.appName("Programming Funda")
    .config("spark.executor.memory", "4g")
    .config("spark.executor.cores", "4")
)

In the above example:

  • spark.executor.memory:- This configuration is used to assign for per executor pr
  • spark.executor.cores:- This configuration is used to assign the number of cores to use for each executor.

This is how you can set configurations using the config() method for your spark application.

As we have seen above, spark session is mostly used for reading data from various sources.

Let’s load CSV data into PySpark DataFrame using spark session.

Reading Data From CSV File

The primary use for spark session is to load data from the various formats into PySpark DataFrame. You can read this tutorial where I have described how to load data from a CSV file into DataFrame.

df = (
    spark.read.option("header", True)
    .option("inferSchema", True)
    .format("CSV")
    .load("sample_data.csv")
)
df.show()

In the above example:

  • spark:- This is the spark session that I created earlier.
  • read:- This is the property of the spark session that returns the object of the DataFrameReader class which has various methods to load files.
  • option(“header”, True):- This is a method of the DataFrameReader class that ensures that the first row of the CSV file will be the header.
  • option(“inferSchema”, True):- It will automatically infer the schema of the data available in the CSV file.
  • format(“CSV”):- This ensures that the file is CSV.
  • load(“sample_data.csv”):- This method takes the path of the CSV file and loads it into PySpark DataFrame.
  • df.show():- Show is the PySpark DataFrame method that is used to display the DataFrame.
Note:- In PySpark various options are available, You can use any one of them as per your requirement.

In My Case, DataFrame will look like this.

PySpark Spark Session Tutorial | Entry Point to Spark

This is how you can load CSV data into PySpark DataFrame, Now let’s perform some operations on top of this DataFrame.

Showing a Specific Number of Rows

To display the specific number of rows, pass the number into the show() method.

df.show(5)

Now, only 5 rows will be displayed.

Selecting Specific Columns

To select some specific column from DataFrame, you can use the select() method and pass column names inside it separated by commas which you want to select. For example, I want to select only first_name, last_name, and salary columns.

new_df = df.select('first_name', 'last_name', 'salary')
new_df.show()

This will display a dataframe that will have only three columns first_name, last_name, and salary.

Perform SQL Queries

You can also perform raw SQL queries on top of the PySpark DataFrame just by creating a Temporary Table. Let’s see how we can achieve this.

# Creating temp table like sql table
df.createTempView('SampleStudentData')
new_df = spark.sql('select first_name, last_name, salary from SampleStudentData where salary < 31000')
new_df.show()

In the above example:

  • df.createTempView(‘SampleStudentData’):- This is creating a temporary view of the DataFrame called SampleStudentData.
  • spark.sql():- This is executing the SQL query that I have passed on a temp table called SampleStudentData and finally returning a PySpark DataFrame.
  • new_df.show():- And show() method displaying the newly created dataframe.

Output would be:

+----------+---------+------+
|first_name|last_name|salary|
+----------+---------+------+
|    Philip|     Gent| 18700|
|  Kathleen|   Hanner| 24000|
|  Kathleen|   Hanner| 29000|
|     Dulce|    Abril| 28750|
|     Harsh|    Kumar| 12560|
+----------+---------+------+

Stopping the Spark Session

It is good practice to stop the spark session when you are done to release resources.

spark.stop()

See Also:

Conclusion

When you write your PySpark applications then creating of Spark session might be your first step because it is the entry point for getting into the Spark application. Without a spark session you cannot create DataFrame, load data, etc that’s why it is more important.

With the help of your spark session, you can set up configurations for your PySpark application with the help of the config() method.

If you found this PySpark Spark Session tutorial helpful, Please share and keep visiting for further PySpark tutorials.

Happy Coding.

How to Find the Nth Highest Salary Using PySpark

Related Posts