PySpark for Absolute Beginners/Getting Started

Understanding the Spark UI

Learn to read and use the Spark Web UI to monitor jobs, stages, and tasks. Debug performance issues like a pro.

Understanding the Spark UI

What You'll Learn

How to access and navigate the Spark Web UI
What jobs, stages, and tasks mean in Spark
How to identify performance bottlenecks
How to read the DAG visualization
Key metrics to monitor during execution

Accessing the Spark UI

When you create a SparkSession, Spark automatically starts a web server:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("UIDemo") \
    .master("local[*]") \
    .getOrCreate()

print(f"Spark UI: http://localhost:{spark.sparkContext.uiWebUrl.split(':')[-1]}")

By default, it runs at http://localhost:4040. If that port is taken, Spark tries 4041, 4042, and so on.

The Main Tabs

The Spark UI has several tabs. Here are the most important ones:

Jobs Tab

Shows all jobs triggered by actions (.show(), .count(), .write()). Each job has:

Status: Running, Succeeded, or Failed
Duration: Total execution time
Stages: How many stages the job was split into

Stages Tab

Each job is split into stages at shuffle boundaries. A stage contains tasks that can run in parallel:

graph LR
    A[Job] --> B[Stage 1: Read CSV]
    A --> C[Stage 2: Shuffle]
    A --> D[Stage 3: Aggregate]
    B --> E[Task 1.1]
    B --> F[Task 1.2]
    C --> G[Task 2.1]
    D --> H[Task 3.1]
    D --> I[Task 3.2]

Storage Tab

Shows cached/persisted DataFrames and their memory usage.

SQL/DataFrame Tab

Shows the execution plans for SQL queries and DataFrame operations — essential for optimization.

Jobs, Stages, and Tasks Explained

Understanding this hierarchy is crucial:

Level	What It Is	Triggered By
Job	A complete computation	An action (`.show()`, `.count()`, `.write()`)
Stage	A set of tasks with no shuffles between them	Shuffle boundaries (`.groupBy()`, `.join()`)
Task	Work done on one data partition	One task per partition per stage

Triggering a Job to Observe

Let's create some data and watch the UI:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, count

spark = SparkSession.builder \
    .appName("UIExplorer") \
    .master("local[*]") \
    .config("spark.sql.shuffle.partitions", "4") \
    .getOrCreate()

# Create sample data
data = [(f"user_{i}", f"product_{i % 5}", i * 10.0) for i in range(10000)]
df = spark.createDataFrame(data, ["user", "product", "amount"])

# This triggers Job 1 (read + filter)
expensive = df.filter(col("amount") > 50000)
print(f"Expensive purchases: {expensive.count()}")

# This triggers Job 2 (read + group + aggregate)
summary = df.groupBy("product").agg(
    avg("amount").alias("avg_amount"),
    count("*").alias("total")
)
summary.show()

Expected Output

Expensive purchases: 4999

+---------+----------+-----+
|  product|avg_amount|total|
+---------+----------+-----+
|product_0|   49980.0| 2000|
|product_1|   49990.0| 2000|
|product_2|   50000.0| 2000|
|product_3|   50010.0| 2000|
|product_4|   50020.0| 2000|
+---------+----------+-----+

Now open http://localhost:4040 and you'll see 2 completed jobs.

Reading the DAG Visualization

Click on any job to see its Directed Acyclic Graph (DAG). This shows the physical execution plan:

Green boxes: Completed stages
Blue boxes: Skipped stages (data was cached)
Arrows: Data flow between stages

The DAG reveals where shuffles happen. Each shuffle boundary means data is being redistributed across partitions — an expensive operation.

Key Metrics to Watch

In the Stages tab, look for:

Shuffle Read/Write Size: Large shuffles slow things down
Task Duration: If some tasks are much slower than others, you have a data skew problem
GC Time: High garbage collection time means you need more memory
Input Size: Verify Spark is reading the expected amount of data

Task Duration Distribution

The Event Timeline in the stage detail page shows task execution over time. Ideally, all tasks should finish around the same time. If one task takes 10x longer, that partition has more data than others.

Using the SQL Tab

For DataFrame operations, the SQL tab is gold. It shows:

The logical plan (what you asked for)
The physical plan (how Spark will do it)
Metrics per operator (rows processed, time spent)

# Run a query to see in SQL tab
df.createOrReplaceTempView("purchases")
result = spark.sql("""
    SELECT product, AVG(amount) as avg_amount
    FROM purchases
    WHERE amount > 1000
    GROUP BY product
    ORDER BY avg_amount DESC
""")
result.show()

Common Mistakes

Ignoring the Spark UI completely. The UI tells you exactly where time is spent. If your job is slow, the answer is in the UI — not in guessing.
Not checking for data skew. If one task takes 10 minutes and the rest take 10 seconds, one partition has most of the data. The Stages tab reveals this instantly.
Forgetting the UI disappears when SparkSession stops. The UI is only available while Spark is running. For historical data, configure the Spark History Server.

Key Takeaways

The Spark UI runs at localhost:4040 automatically
Jobs → Stages → Tasks is the execution hierarchy
Shuffles create stage boundaries — minimize them for performance
The SQL tab shows execution plans and per-operator metrics
Watch for data skew by comparing task durations
The UI is your best friend for debugging slow jobs

Next Lesson

You now have a solid foundation: you know what Spark is, when to use it, how to install it, how to create a session, and how to monitor it. In the next module, we'll dive into Spark DataFrames — the core data structure you'll use every day.

← PreviousYour First SparkSession Next →Creating DataFrames from Lists, Dicts, and Files