Running SQL Queries on PySpark DataFrames
Learn how to use SQL syntax to query PySpark DataFrames with createOrReplaceTempView and spark.sql. Includes SELECT, WHERE, GROUP BY examples.
Running SQL Queries on DataFrames
What You'll Learn
- How to register a DataFrame as a SQL table
- How to run SQL queries with
spark.sql() - When to use SQL vs the DataFrame API
- How to mix SQL and DataFrame operations in one pipeline
Why SQL in PySpark?
You already know the DataFrame API — df.select(), df.filter(), df.groupBy(). So why learn SQL too?
Two practical reasons. First, most data teams have analysts who know SQL but not Python. If you expose your DataFrames as SQL tables, they can query them directly. Second, some operations are simply easier to express in SQL — complex joins, subqueries, and CASE statements often read more naturally in SQL than in chained DataFrame calls.
The good news: PySpark's SQL engine uses the exact same optimizer (Catalyst) as the DataFrame API. There's no performance difference. Choose whichever is clearer for the task.
Step 1: Register a DataFrame as a SQL Table
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SparkSQL").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
data = [
("Alice", "Engineering", 120000, "2020-01-15"),
("Bob", "Marketing", 95000, "2019-06-01"),
("Charlie", "Engineering", 130000, "2021-03-20"),
("Diana", "Sales", 88000, "2022-11-10"),
("Eve", "Marketing", 92000, "2020-08-05"),
("Frank", "Engineering", 145000, "2018-04-12"),
("Grace", "Sales", 97000, "2021-07-22"),
("Henry", "Marketing", 105000, "2017-09-30"),
]
df = spark.createDataFrame(data, ["name", "department", "salary", "start_date"])
# Register as a temporary SQL view
df.createOrReplaceTempView("employees")
That last line is the key. It tells Spark: "Let me query this DataFrame using the name employees in SQL." The data doesn't move or copy — it's just a name pointing to the same DataFrame.
Step 2: Run SQL Queries
# Simple SELECT
result = spark.sql("SELECT name, salary FROM employees WHERE salary > 100000")
result.show()
Expected Output
+-------+------+
| name|salary|
+-------+------+
| Alice|120000|
|Charlie|130000|
| Frank|145000|
| Henry|105000|
+-------+------+
The result of spark.sql() is a regular PySpark DataFrame. You can chain DataFrame operations on it, save it, or register it as another view.
Common SQL Operations
GROUP BY with aggregations
spark.sql("""
SELECT
department,
COUNT(*) as headcount,
ROUND(AVG(salary), 0) as avg_salary,
MAX(salary) as max_salary
FROM employees
GROUP BY department
ORDER BY avg_salary DESC
""").show()
Expected Output
+-----------+---------+----------+----------+
| department|headcount|avg_salary|max_salary|
+-----------+---------+----------+----------+
|Engineering| 3| 131667.0| 145000|
| Marketing| 3| 97333.0| 105000|
| Sales| 2| 92500.0| 97000|
+-----------+---------+----------+----------+
CASE WHEN (SQL's if/else)
spark.sql("""
SELECT
name,
salary,
CASE
WHEN salary >= 120000 THEN 'Senior'
WHEN salary >= 95000 THEN 'Mid'
ELSE 'Junior'
END as level
FROM employees
ORDER BY salary DESC
""").show()
Expected Output
+-------+------+------+
| name|salary| level|
+-------+------+------+
| Frank|145000|Senior|
|Charlie|130000|Senior|
| Alice|120000|Senior|
| Henry|105000| Mid|
| Grace| 97000| Mid|
| Bob| 95000| Mid|
| Eve| 92000|Junior|
| Diana| 88000|Junior|
+-------+------+------+
HAVING (filter after GROUP BY)
# Departments where average salary exceeds 95,000
spark.sql("""
SELECT department, ROUND(AVG(salary), 0) as avg_salary
FROM employees
GROUP BY department
HAVING AVG(salary) > 95000
""").show()
Expected Output
+-----------+----------+
| department|avg_salary|
+-----------+----------+
|Engineering| 131667.0|
| Marketing| 97333.0|
+-----------+----------+
DISTINCT
spark.sql("SELECT DISTINCT department FROM employees ORDER BY department").show()
Expected Output
+-----------+
| department|
+-----------+
|Engineering|
| Marketing|
| Sales|
+-----------+
Mixing SQL and DataFrame API
You can freely switch between SQL and DataFrame operations:
# Start with SQL
high_earners = spark.sql("SELECT * FROM employees WHERE salary > 100000")
# Continue with DataFrame API
result = high_earners \
.groupBy("department") \
.count() \
.orderBy("count", ascending=False)
result.show()
Expected Output
+-----------+-----+
| department|count|
+-----------+-----+
|Engineering| 3|
| Marketing| 1|
+-----------+-----+
You can also go the other direction — take a DataFrame result and register it as a new view:
result.createOrReplaceTempView("dept_counts")
spark.sql("SELECT * FROM dept_counts WHERE count > 1").show()
SQL vs DataFrame API — When to Use Which
Both compile to the same execution plan. The choice is about readability:
# DataFrame API
df.filter(col("department") == "Engineering") \
.groupBy("department") \
.agg(avg("salary").alias("avg_salary")) \
.show()
# SQL — same result, same performance
spark.sql("""
SELECT department, AVG(salary) as avg_salary
FROM employees
WHERE department = 'Engineering'
GROUP BY department
""").show()
Use SQL when: the query involves complex joins, subqueries, CASE statements, or when your team thinks in SQL.
Use DataFrame API when: you're building a pipeline with many chained transformations, or when you need programmatic column construction (like looping over column names).
Use both when: that's what most real codebases do.
Common Mistakes
- Forgetting to create the view first.
spark.sql("SELECT * FROM employees")will fail with "Table or view not found" if you haven't calledcreateOrReplaceTempView("employees"). The view registration step is easy to forget. - Using double quotes for SQL strings. SQL string literals use single quotes:
WHERE name = 'Alice', notWHERE name = "Alice". Double quotes in SQL are for column/table names with special characters. - Assuming SQL views persist between sessions. Temp views exist only within the current SparkSession. If you restart Spark, all views are gone. We'll cover persistent views in the next lesson.
Key Takeaways
createOrReplaceTempView("name")registers a DataFrame as a SQL table.spark.sql("...")runs SQL and returns a DataFrame — no performance difference from the DataFrame API.- SQL and DataFrame API use the same Catalyst optimizer — choose whichever reads more clearly.
- The result of
spark.sql()is a normal DataFrame that you can chain further operations on. - Most real codebases mix both SQL and DataFrame API freely.
Next Lesson
We used createOrReplaceTempView to make our DataFrame queryable via SQL. But what exactly is a "temp view"? And what's a "global view"? In Lesson 14: Temp Views, Global Views, and When to Use Each, we'll understand view scoping and lifecycle.