Skip to main content

Section outline

    • Big Data Concepts, Challenges

    • Hadoop & HDFS Architecture

    • Data Management in Hadoop

    • Lab: HDFS Commands, VM Setup

    • Hive Architecture, ETL with Hive

    • Working with CSV, JSON, Parquet

    • Spark Overview & Deployment Modes

    • RDD Operations, Regex, Pair RDD

    • Lab: Hive ETL, Spark RDD Examples

    • Spark DAG, Shuffle, Stages, Job Metrics

    • Performance Tuning: Memory, Executors, Caching

    • Setting up Spark on YARN, Kubernetes

    • Intro to DataFrames, Catalyst, Tungsten

    • Lab: Metrics, Caching, DataFrame Operations

    • Spark SQL, HiveContext, JDBC Integration

    • Joins, Bucketing, Analytical Queries

    • BI Tool Integration

    • Delta Lake: ACID Transactions, Time Travel

    • Lab: Delta Table Management, Format Conversions

    • Structured Streaming Concepts

    • Micro-Batch Triggers, Late Data, Joins

    • Kafka Architecture & Multi-node Setup

    • Spark-Kafka Integration for Real-time Apps

    • Lab: Twitter Stream Analysis, Kafka Receiver