PySpark for Absolute Beginners/Getting Started

Your First SparkSession

Create and configure your first SparkSession. Understand the SparkSession lifecycle, configuration options, and how Spark initializes.

Your First SparkSession

What You'll Learn

What a SparkSession is and why it matters
How to create and configure a SparkSession
The SparkSession lifecycle (create → use → stop)
Key configuration options for development
How to avoid common SparkSession pitfalls

What is a SparkSession?

The SparkSession is your entry point to all Spark functionality. Every PySpark program starts by creating one. Think of it as opening a connection to the Spark engine.

Before Spark 2.0, you had to create separate contexts (SparkContext, SQLContext, HiveContext). Now, SparkSession unifies all of these into a single object.

graph TD
    A[SparkSession] --> B[SparkContext]
    A --> C[SQLContext]
    A --> D[HiveContext]
    A --> E[StreamingContext]
    style A fill:#4A90D9,color:#fff

Creating a SparkSession

The simplest way:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MyFirstApp") \
    .getOrCreate()

print(f"Spark version: {spark.version}")
print(f"App name: {spark.sparkContext.appName}")

Expected Output

Spark version: 3.5.0
App name: MyFirstApp

The Builder Pattern

SparkSession uses the builder pattern — you chain configuration calls before creating the session:

spark = SparkSession.builder \
    .appName("ConfiguredApp") \
    .master("local[*]") \
    .config("spark.sql.shuffle.partitions", "4") \
    .config("spark.driver.memory", "2g") \
    .config("spark.ui.showConsoleProgress", "false") \
    .getOrCreate()

Key Configuration Options

| Config | What it does | Good default | |--------|-------------|--------------| | appName | Names your app (shows in Spark UI) | Always set this | | master | Where to run: local[*], yarn, k8s | local[*] for dev | | spark.driver.memory | Memory for the driver | 2g for dev | | spark.sql.shuffle.partitions | Partitions during shuffles | 4 for dev, 200 for prod | | spark.ui.showConsoleProgress | Show progress bars | false to reduce noise |

The `master` Setting

The master URL tells Spark where to run:

# Use all available CPU cores locally
.master("local[*]")

# Use exactly 4 cores
.master("local[4]")

# Single thread (useful for debugging)
.master("local[1]")

# Connect to a YARN cluster
.master("yarn")

For learning, always use local[*].

getOrCreate() vs New Session

getOrCreate() is smart — it returns the existing session if one is already running, or creates a new one:

# First call: creates a new session
spark1 = SparkSession.builder.appName("App1").getOrCreate()

# Second call: returns the SAME session (ignores new appName)
spark2 = SparkSession.builder.appName("App2").getOrCreate()

print(spark1 is spark2)  # True!

Expected Output

True

If you need a truly new session, use newSession():

spark_new = spark1.newSession()

Stopping a SparkSession

Always stop your session when done:

spark.stop()

This releases resources (memory, ports, threads). In a script, put it at the end. In a notebook, stop it when you're done experimenting.

A Complete Example

Here's a full program that creates a session, does some work, and cleans up:

from pyspark.sql import SparkSession

# 1. Create the session
spark = SparkSession.builder \
    .appName("EmployeeAnalysis") \
    .master("local[*]") \
    .config("spark.sql.shuffle.partitions", "4") \
    .getOrCreate()

# 2. Create sample data
employees = [
    ("Alice", "Engineering", 95000),
    ("Bob", "Engineering", 110000),
    ("Charlie", "Marketing", 85000),
    ("Diana", "Marketing", 92000),
    ("Eve", "Engineering", 105000),
]

df = spark.createDataFrame(employees, ["name", "department", "salary"])

# 3. Analyze
print("All employees:")
df.show()

print("Average salary by department:")
df.groupBy("department").avg("salary").show()

# 4. Clean up
spark.stop()

Expected Output

All employees:
+-------+-----------+------+
|   name| department|salary|
+-------+-----------+------+
|  Alice|Engineering| 95000|
|    Bob|Engineering|110000|
|Charlie|  Marketing| 85000|
|  Diana|  Marketing| 92000|
|    Eve|Engineering|105000|
+-------+-----------+------+

Average salary by department:
+-----------+-----------+
| department|avg(salary)|
+-----------+-----------+
|Engineering|   103333.3|
|  Marketing|    88500.0|
+-----------+-----------+

Common Mistakes

Not calling spark.stop(). This leaves Spark processes running in the background, consuming memory and ports. Always stop when done.
Setting master in production code. The cluster manager should set the master, not your code. Only use .master("local[*]") during development. In production, submit with spark-submit --master yarn.
Creating multiple SparkSessions. Each session consumes resources. Use getOrCreate() to avoid duplicates.

Key Takeaways

SparkSession is the single entry point to all Spark functionality
Use the builder pattern to configure before creating
getOrCreate() reuses existing sessions
Always set appName for identification in the Spark UI
Use local[*] as master during development
Always call spark.stop() when finished

Next Lesson

Your SparkSession automatically starts a web UI at localhost:4040. Next, we'll explore how to use it to monitor and debug your Spark applications.

← PreviousInstalling PySpark Next →Understanding the Spark UI

Your First SparkSession

What You'll Learn

What is a SparkSession?

Creating a SparkSession

Expected Output

The Builder Pattern

Key Configuration Options

The master Setting

getOrCreate() vs New Session

Expected Output

Stopping a SparkSession

A Complete Example

Expected Output

Common Mistakes

Key Takeaways

Next Lesson

The `master` Setting