PySpark for Absolute Beginners/DataFrames Deep Dive

Creating DataFrames from Lists, Dicts, and Files

Learn every way to create a PySpark DataFrame — from Python lists, dictionaries, CSV files, JSON, and Parquet.

Creating DataFrames from Lists, Dicts, and Files

What You'll Learn

Five different ways to create a PySpark DataFrame
When to use each approach
How to create DataFrames from CSV, JSON, and Parquet files
The difference between inferSchema and providing your own schema

Why This Matters

In Lesson 4, we created a simple DataFrame from a list of tuples. That's fine for learning, but in real data engineering, your data comes from files — CSVs exported from databases, JSON from APIs, Parquet files in data lakes. Knowing all the ways to get data into PySpark is the foundation for everything else.

Way 1: From a List of Tuples (Quick Testing)

You've seen this before, but let's look at it more carefully:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CreateDF").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

# List of tuples — each tuple is a row
data = [
    ("Alice", "Engineering", 120000),
    ("Bob", "Marketing", 95000),
    ("Charlie", "Engineering", 130000),
]

# Column names as second argument
df = spark.createDataFrame(data, ["name", "department", "salary"])
df.show()

Expected Output

+-------+-----------+------+
|   name| department|salary|
+-------+-----------+------+
|  Alice|Engineering|120000|
|    Bob|  Marketing| 95000|
|Charlie|Engineering|130000|
+-------+-----------+------+

Use this when: you need sample data for testing, prototyping, or writing unit tests.

Way 2: From a List of Row Objects

PySpark has a Row class that lets you create named rows:

from pyspark.sql import Row

data = [
    Row(name="Alice", age=30, city="Mumbai"),
    Row(name="Bob", age=25, city="Delhi"),
    Row(name="Charlie", age=35, city="Bangalore"),
]

df = spark.createDataFrame(data)
df.show()

Expected Output

+-------+---+---------+
|   name|age|     city|
+-------+---+---------+
|  Alice| 30|   Mumbai|
|    Bob| 25|    Delhi|
|Charlie| 35|Bangalore|
+-------+---+---------+

Use this when: you want self-documenting code where each row clearly shows its field names.

Way 3: From a Pandas DataFrame

If you're transitioning from pandas, you can convert directly:

import pandas as pd

# Create a pandas DataFrame
pdf = pd.DataFrame({
    "product": ["Laptop", "Phone", "Tablet"],
    "price": [999.99, 699.99, 449.99],
    "stock": [50, 200, 100],
})

# Convert to PySpark DataFrame
df = spark.createDataFrame(pdf)
df.show()
df.printSchema()

Expected Output

+-------+------+-----+
|product| price|stock|
+-------+------+-----+
| Laptop|999.99|   50|
|  Phone|699.99|  200|
| Tablet|449.99|  100|
+-------+------+-----+

root
 |-- product: string (nullable = true)
 |-- price: double (nullable = true)
 |-- stock: long (nullable = true)

Use this when: you prototyped in pandas and want to scale to PySpark, or when you need to convert results back and forth.

Way 4: From a CSV File

This is the most common real-world scenario:

# Read a CSV file
df = spark.read.csv("employees.csv", header=True, inferSchema=True)
df.show(5)
df.printSchema()

Expected Output

+-------+-----------+------+----------+
|   name| department|salary|start_date|
+-------+-----------+------+----------+
|  Alice|Engineering|120000|2020-01-15|
|    Bob|  Marketing| 95000|2019-06-01|
|Charlie|Engineering|130000|2021-03-20|
|  Diana|      Sales| 88000|2022-11-10|
|    Eve|  Marketing| 92000|2020-08-05|
+-------+-----------+------+----------+

root
 |-- name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: integer (nullable = true)
 |-- start_date: date (nullable = true)

Let's break down the options:

header=True — the first row contains column names (not data)
inferSchema=True — Spark reads through the data to guess column types

You can also use the more verbose format:

df = spark.read \
    .format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("sep", ",") \
    .load("employees.csv")

Way 5: From JSON Files

JSON is common when working with API data:

# Read a JSON file (one JSON object per line)
df = spark.read.json("orders.json")
df.show(3)

Expected Output

+--------+-----------+------+----------+
|order_id|customer_id|amount|order_date|
+--------+-----------+------+----------+
|    1001|        101| 59.99|2024-01-15|
|    1002|        102|129.50|2024-01-16|
|    1003|        101| 24.99|2024-01-17|
+--------+-----------+------+----------+

PySpark expects one JSON object per line (called JSON Lines or JSONL format), not a single JSON array. If your file is a JSON array, use:

df = spark.read.option("multiLine", "true").json("data.json")

Bonus: From Parquet Files

Parquet is the format data engineers use in production. It's columnar, compressed, and fast. We'll cover why in Lesson 12, but here's how to read it:

# Read Parquet — no options needed, schema is embedded in the file
df = spark.read.parquet("data/sales.parquet")
df.show(3)

Notice there's no header=True or inferSchema=True — Parquet files carry their own schema. This is one reason data engineers prefer Parquet over CSV.

inferSchema: Convenient but Costly

When you use inferSchema=True on a CSV, Spark has to read the entire file once just to figure out the data types, then read it again to actually process it. On a 100 GB CSV, that's two full passes.

In production, you should provide an explicit schema instead. We'll cover this in the next lesson.

Common Mistakes

Forgetting header=True when reading CSV. Without it, Spark treats the first row as data, not column names. Your columns will be named _c0, _c1, _c2 and your header row will appear as data.
Using inferSchema=True on large files in production. It doubles the read time. Define your schema explicitly instead — we'll show you how in Lesson 7.
Assuming JSON means a JSON array. PySpark expects JSON Lines format (one object per line) by default. A file with [{...}, {...}] will fail unless you set multiLine=true.

Key Takeaways

Use lists/tuples for quick testing, pandas conversion for prototyping, and file reads for real work.
CSV needs header=True to use the first row as column names.
inferSchema=True is convenient but slow on large files — provide an explicit schema in production.
Parquet is self-describing — no schema options needed.
JSON Lines (one object per line) is the default JSON format for PySpark.

Next Lesson

We mentioned that inferSchema is costly and that explicit schemas are better. In Lesson 7: Schemas and Data Types, we'll learn exactly how to define schemas, what data types PySpark supports, and why this matters for both performance and correctness.

← PreviousUnderstanding the Spark UI Next →Schemas and Data Types — Why Explicit Schemas Matter