Big Data Analytics Revision Guide

M1: Intro
M2: Ecosystem
M3: Storage
M4: Hadoop
M5: Spark
M6: ML
M7: Deep Learning Architectures
The 5 characteristics (5Vs)
Volume (Size)
Massive scale of data.
• Facebook: 4PB daily.
• Genomics: Single genome ~200GB.
Tech: HDFS, S3, GCS.
Velocity (Speed)
Speed of data generation.
• Stock trades: Millions/sec.
• Twitter: 500M tweets daily.
Tech: Kafka, Spark Streaming.
Variety (Format)

Structured: SQL Tables.
Semi-Structured: JSON, XML.
Unstructured: Video, Images.
Veracity (Trust)
Trustworthiness & Accuracy.
• Issues: Sensor errors, "fake news".
Tools: Pandas, OpenRefine.
Value (Impact)
Derived actionable insights.
• Ex: Amazon recommendation engine.
4 Types of Analytics
Type Focus Goal
Descriptive What happened? Historical summary (Sales reports).
Diagnostic Why did it happen? Root cause analysis (Churn).
Predictive What will happen? Future forecasting (Loan default).
Prescriptive What should we do? Action optimization (Logistics).
Essential Python Libraries
Pandas Dataframes & manipulation.
NumPy Numerical computing & arrays.
Matplotlib/Seaborn Statistical visualizations.
PySpark Interface for Spark using Python.
Scikit-learn Machine learning (regression, classification).
TensorFlow/PyTorch Deep learning frameworks.
HDFS Storage Architecture
  • NameNode: Master node; manages metadata and block locations.
  • DataNodes: Worker nodes; store actual data blocks (split into 128MB).
  • Fault Tolerance: Default replication factor is 3.
Batch vs. Real-Time Processing
Batch Processing
High latency (mins/hours). Processes data in bulk.
Use Case: Monthly reconciliation, ETL.
Tech: Hadoop MapReduce, Spark Batch.
Real-Time Processing
Low latency (ms/sec). Continuous stream processing.
Use Case: Stock trading, IoT monitoring.
Tech: Kafka, Spark Streaming, Flink.
Architectural Paradigms
Lambda Architecture
Hybrid system. Consists of:
Batch Layer: High accuracy, immutable raw storage.
Speed Layer: Real-time, low-latency updates.
Serving Layer: Merges batch and real-time views for queries.
Kappa Architecture
Stream-centric approach. Removes the batch layer entirely—everything is treated as a stream.
Tech: Kafka + Flink/Spark.
Cloud-Native Architecture
Leverages Serverless (AWS Lambda), Containers (Docker), and auto-scaling to manage Big Data costs and elastic workloads.
Cloud Platforms Comparison
Platform Storage Processing Streaming
AWS S3, Glue EMR (Spark/Hadoop) Kinesis
GCP BigQuery, GCS Cloud Dataproc Pub/Sub
Azure Data Lake (ADLS) Synapse Analytics Stream Analytics
Distributed File Systems: HDFS vs. Amazon S3
Feature HDFS (Hadoop) Amazon S3 (Cloud)
Storage Model Block storage (128MB blocks) Object storage (Key-Value)
Architecture Master-Slave (NameNode/DataNode) Decentralized/Cloud-based
Scalability Scales by adding more nodes Virtually infinite; managed by AWS
Fault Tolerance Replication (default: 3) High durability (across regions)
Access Internal Hadoop interface REST API, Web, CLI
Data Lake vs. Data Warehouse vs. Lakehouse
Data Warehouse
Optimized for structured data and BI.
Schema: Schema-on-write.
Cost: High.
Tech: Snowflake, Redshift, BigQuery.
Data Lake
Stores raw, unstructured data.
Schema: Schema-on-read.
Cost: Low.
Tech: HDFS, Amazon S3, Azure Data Lake.
Data Lakehouse
The modern standard: Combines the flexibility/low cost of a Lake with the performance/ACID transactions of a Warehouse.
Tech: Databricks, Delta Lake, Apache Iceberg.
NoSQL Databases for Unstructured Data
MongoDB: Document Store
Stores data in flexible, JSON-like documents (BSON).
  • Dynamic Schema: No predefined structure required.
  • Sharding: Distributes data across servers for horizontal scaling.
  • GridFS: Used for storing binary files > 16MB (images/videos).
Best For IoT data, content management, social media logs.
NoSQL Comparison
Type Database Ideal Use Case
Document MongoDB / CouchDB Flexible web apps, logs.
Wide-Column Cassandra / HBase High-write, distributed apps.
Key-Value Redis / DynamoDB Caching, session management.
Graph Neo4j Fraud detection, social networks.
Parallel vs. Distributed Computing
Feature Parallel Computing Distributed Computing
Memory Shared memory across processors Independent memory for each node
Communication Shared memory / Interconnect Network message passing
Scalability Limited by hardware vertical scaling High (Horizontal scaling)
Example Multi-core CPUs, GPUs Hadoop, Google Search, Cloud clusters
Hadoop Core Components
HDFS
Distributed storage. NameNode (Metadata master) + DataNodes (Block workers). Optimized for Data Locality.
YARN
Resource Negotiator. Separates resource management from processing. Manages job scheduling and cluster resources.
MapReduce Execution Lifecycle
  1. Input Split: Data is divided into chunks (HDFS blocks).
  2. Map Phase: Processes records into intermediate (Key, Value) pairs.
  3. Shuffle & Sort: Transfers and organizes data across the network so all values for the same key reach the same reducer.
  4. Reduce Phase: Aggregates intermediate values into the final result.
Advanced MapReduce Optimization
  • Combiners: "Mini-reducers" that aggregate data locally on the mapper node to reduce network traffic.
  • Speculative Execution: Identifies slow tasks ("stragglers") and launches redundant copies on other nodes to finish faster.
  • Partitioners: Custom logic to determine which reducer a key is sent to (prevents data skew).
  • Distributed Cache: Efficiently shares small lookup tables or side-data across all mapper nodes.
  • Input Formats: Using SequenceFile or Parquet is more efficient than plain TextInputFormat.
In-Memory vs. Disk-Based Computing
Apache Spark is up to 100x faster than Hadoop MapReduce because it processes data directly in RAM across cluster nodes instead of writing intermediate results to disk after every operation.
Spark Master-Slave Architecture
Driver Program
The central coordinator. Manages the SparkSession, converts code into tasks, and schedules them on executors.
Executors
Worker nodes that execute individual tasks and store data in-memory or on disk. They report results back to the Driver.
Cluster Manager
Allocates physical resources (CPU/RAM) across the cluster. Supports YARN, Mesos, Kubernetes, and Standalone mode.
Core Data Abstractions
Feature RDD (Resilient Distributed Dataset) DataFrame
Level Low-level API High-level API
Structure Unstructured (Collection of objects) Structured (Rows & Columns)
Optimization Manual optimization Automatic (Catalyst Optimizer)
Ease of Use Complex (Lambda functions) Easy (SQL-like syntax)
Transformations vs. Actions
Transformations (Lazy)
Do not execute immediately; they create a DAG (Directed Acyclic Graph).
filter(), map(), groupBy(), select().
Actions (Eager)
Trigger the actual computation and return results to the driver.
show(), collect(), count(), write().
PySpark Code Snippets & Best Practices
  • Initialization: Always use SparkSession.builder.
  • Optimization: Use .filter() early to reduce data size before a shuffle.
  • Storage: Use Parquet format for output; it is columnar and significantly faster than CSV.
  • Cleaning: Use dropna() or fillna() to handle missing records.
df.filter(col("status") == "Active").groupBy("id").agg(sum("amount"))
Challenges of High-Dimensional Data
High-dimensional data refers to datasets where the number of features (p) is much larger than the number of observations (n). This leads to the "Curse of Dimensionality":
  • Overfitting: Models capture noise rather than patterns.
  • Sparsity: Data points become isolated, making distance metrics (like Euclidean) less meaningful.
  • Computation: Exponentially higher processing requirements.
Regularization: LASSO vs. Ridge
Feature LASSO (L1) Ridge (L2)
Penalty Absolute value of coefficients Squared value of coefficients
Feature Selection Yes (can set coefficients to zero) No (shrinks them towards zero)
Best For Sparse data with few important features When most features are relevant
Scalable K-Means Clustering
Implementation with Spark MLlib
Spark enables clustering of millions of points through:
  • Mini-Batch K-Means: Processes subsets of data at each iteration to save memory.
  • K-Means|| (Parallel): An optimized initialization method (similar to K-Means++) that reduces the number of passes over data.
  • Feature Scaling: Using StandardScaler is critical because K-Means is distance-based and sensitive to varying scales.
ML Pipeline Code (PySpark)
Typical workflow for a scalable ML model:
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler, StandardScaler
  1. Assemble: Combine columns into a single features vector.
  2. Scale: Standardize features to have mean=0 and std=1.
  3. Fit: Run KMeans.fit() on the clustered environment.
Real-World Applications
Finance
Anomaly detection for fraudulent credit card transactions.
Healthcare
Genomic data clustering for disease pattern identification.
Core Deep Learning Architectures
Feed Forward Networks (FFN)
Data flows in one direction (Input → Hidden → Output). No cycles.
Best For: Simple classification, regression.
Limitation: No memory of past info.
Convolutional Neural Nets (CNN)
Uses filters to detect spatial hierarchies.
Key Layers: Convolutional, Pooling, Fully Connected.
Best For: Image recognition, medical imaging.
Recurrent Neural Networks (RNN)
Contains loops to retain sequential dependencies (hidden state).
Challenge: Vanishing Gradient problem.
Solution: LSTM (Long Short-Term Memory) units.
Best For: NLP, time-series forecasting, speech-to-text.
Generative Models: Variational Autoencoders (VAE)
VAEs learn probabilistic latent representations of data.
  • Encoder: Maps input data to a probability distribution (latent space).
  • Latent Space: A compressed, continuous bottleneck layer.
  • Decoder: Reconstructs data from samples taken from the latent space.
Applications: Image generation, anomaly detection, drug discovery.
Deep Learning Comparison
ArchitectureData TypeKey Concept
CNNGrid (Images/Video)Spatial Feature Extraction
RNN / LSTMSequential (Text/Time)Temporal Memory
VAEAny (Unsupervised)Probabilistic Latent Space
GANAny (Generative)Adversarial Competition