DEV Community

Cover image for ๐Ÿš€ Day 1: Introduction to Apache Spark
Sandeep
Sandeep

Posted on • Edited on

๐Ÿš€ Day 1: Introduction to Apache Spark

Welcome to Day 1 of the 60 Day Spark Mastery Series!

Letโ€™s begin with the fundamentals.

๐ŸŒŸ What is Apache Spark?

Apache Spark is a lightning-fast distributed computing engine used for processing massive datasets.
It powers the data engineering pipelines of companies like Netflix, Uber, Amazon, Spotify, and Airbnb.

Sparkโ€™s superpower is simple:

It processes data in-memory, which makes it 10โ€“100x faster than Hadoop MapReduce.

โšก Why Should Data Engineers Learn Spark?

Here are reasons Spark is the industry standard:

  • Works with huge datasets (TBs/ PBs)
  • Built for batch + streaming + machine learning
  • Runs on GCP, AWS, Databricks, Kubernetes, Hadoop
  • Has easy APIs in Python (PySpark), SQL, Scala
  • Built-in optimizations from Sparkโ€™s Catalyst Optimizer

๐Ÿ”ฅ Spark Ecosystem Overview

Spark is not just a computation engine; itโ€™s a full ecosystem:

1. Spark Core

Handles: scheduling, memory, fault tolerance

2. Spark SQL

Allows SQL queries, Data Frames.

3. Structured Streaming

Real-time data pipelines (Kafka, sockets, event logs)

4. MLlib

Machine learning algorithms
Great for scalable ML operations.

5. GraphX

Graph processing engine (less used but powerful)

๐Ÿง  How Spark Executes Your Code Internally

Understanding Spark internals is key to becoming a senior-level engineer.

๐Ÿ”น Step 1: Driver Program Starts

It analyzes the job and creates a logical plan.

๐Ÿ”น Step 2: DAG (Directed Acyclic Graph) Creation

Spark breaks transformations into a DAG.

๐Ÿ”น Step 3: DAG Scheduler โ†’ Stages โ†’ Tasks

Stages are based on shuffle boundaries.
Tasks run in parallel across executors.

๐Ÿ”น Step 4: Executors Run Tasks

These nodes process data and store results.

This architecture gives Spark its scalability and speed.

โณ Lazy Evaluation : Transformations donโ€™t execute immediately.

Example:

df = spark.read.csv("sales.csv", header=True)
filtered = df.filter(df.amount > 1000)
Enter fullscreen mode Exit fullscreen mode

Nothing runs until you call:

filtered.show()
Enter fullscreen mode Exit fullscreen mode

This helps Spark:

  1. Optimize the whole query
  2. Reduce stages
  3. Avoid unnecessary work

๐Ÿ›  Create Your First SparkSession

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Day1IntroToSpark") \
    .getOrCreate()

df = spark.range(10)
df.show()
Enter fullscreen mode Exit fullscreen mode

Follow for more such content. Let me know if I missed anything in comments. Thank you!!

Top comments (0)