Big Data Analytics Revision Guide

M1: Intro

M2: Ecosystem

M3: Storage

M4: Hadoop

M5: Spark

M6: ML

M7: Deep Learning Architectures

The 5 characteristics (5Vs)

Volume (Size)

Massive scale of data.
• Facebook: 4PB daily.
• Genomics: Single genome ~200GB.
Tech: HDFS, S3, GCS.

Velocity (Speed)

Speed of data generation.
• Stock trades: Millions/sec.
• Twitter: 500M tweets daily.
Tech: Kafka, Spark Streaming.

Variety (Format)

• Structured: SQL Tables.
• Semi-Structured: JSON, XML.
• Unstructured: Video, Images.

Veracity (Trust)

Trustworthiness & Accuracy.
• Issues: Sensor errors, "fake news".
Tools: Pandas, OpenRefine.

Value (Impact)

Derived actionable insights.
• Ex: Amazon recommendation engine.

4 Types of Analytics

Type	Focus	Goal
Descriptive	What happened?	Historical summary (Sales reports).
Diagnostic	Why did it happen?	Root cause analysis (Churn).
Predictive	What will happen?	Future forecasting (Loan default).
Prescriptive	What should we do?	Action optimization (Logistics).

Essential Python Libraries

Pandas Dataframes & manipulation.
NumPy Numerical computing & arrays.
Matplotlib/Seaborn Statistical visualizations.
PySpark Interface for Spark using Python.
Scikit-learn Machine learning (regression, classification).
TensorFlow/PyTorch Deep learning frameworks.

HDFS Storage Architecture

NameNode: Master node; manages metadata and block locations.
DataNodes: Worker nodes; store actual data blocks (split into 128MB).
Fault Tolerance: Default replication factor is 3.

Batch vs. Real-Time Processing

Batch Processing

High latency (mins/hours). Processes data in bulk.
• Use Case: Monthly reconciliation, ETL.
• Tech: Hadoop MapReduce, Spark Batch.

Real-Time Processing

Low latency (ms/sec). Continuous stream processing.
• Use Case: Stock trading, IoT monitoring.
• Tech: Kafka, Spark Streaming, Flink.

Architectural Paradigms

Lambda Architecture

Hybrid system. Consists of:
• Batch Layer: High accuracy, immutable raw storage.
• Speed Layer: Real-time, low-latency updates.
• Serving Layer: Merges batch and real-time views for queries.

Kappa Architecture

Stream-centric approach. Removes the batch layer entirely—everything is treated as a stream.
Tech: Kafka + Flink/Spark.

Cloud-Native Architecture

Leverages Serverless (AWS Lambda), Containers (Docker), and auto-scaling to manage Big Data costs and elastic workloads.

Cloud Platforms Comparison

Platform	Storage	Processing	Streaming
AWS	S3, Glue	EMR (Spark/Hadoop)	Kinesis
GCP	BigQuery, GCS	Cloud Dataproc	Pub/Sub
Azure	Data Lake (ADLS)	Synapse Analytics	Stream Analytics

Distributed File Systems: HDFS vs. Amazon S3

Feature	HDFS (Hadoop)	Amazon S3 (Cloud)
Storage Model	Block storage (128MB blocks)	Object storage (Key-Value)
Architecture	Master-Slave (NameNode/DataNode)	Decentralized/Cloud-based
Scalability	Scales by adding more nodes	Virtually infinite; managed by AWS
Fault Tolerance	Replication (default: 3)	High durability (across regions)
Access	Internal Hadoop interface	REST API, Web, CLI

Data Lake vs. Data Warehouse vs. Lakehouse

Data Warehouse

Optimized for structured data and BI.
• Schema: Schema-on-write.
• Cost: High.
• Tech: Snowflake, Redshift, BigQuery.

Data Lake

Stores raw, unstructured data.
• Schema: Schema-on-read.
• Cost: Low.
• Tech: HDFS, Amazon S3, Azure Data Lake.

Data Lakehouse

The modern standard: Combines the flexibility/low cost of a Lake with the performance/ACID transactions of a Warehouse.
• Tech: Databricks, Delta Lake, Apache Iceberg.

NoSQL Databases for Unstructured Data

MongoDB: Document Store

Stores data in flexible, JSON-like documents (BSON).

Dynamic Schema: No predefined structure required.
Sharding: Distributes data across servers for horizontal scaling.
GridFS: Used for storing binary files > 16MB (images/videos).

Best For IoT data, content management, social media logs.

NoSQL Comparison

Type	Database	Ideal Use Case
Document	MongoDB / CouchDB	Flexible web apps, logs.
Wide-Column	Cassandra / HBase	High-write, distributed apps.
Key-Value	Redis / DynamoDB	Caching, session management.
Graph	Neo4j	Fraud detection, social networks.

Parallel vs. Distributed Computing

Feature	Parallel Computing	Distributed Computing
Memory	Shared memory across processors	Independent memory for each node
Communication	Shared memory / Interconnect	Network message passing
Scalability	Limited by hardware vertical scaling	High (Horizontal scaling)
Example	Multi-core CPUs, GPUs	Hadoop, Google Search, Cloud clusters

Hadoop Core Components

HDFS

Distributed storage. NameNode (Metadata master) + DataNodes (Block workers). Optimized for Data Locality.

YARN

Resource Negotiator. Separates resource management from processing. Manages job scheduling and cluster resources.

MapReduce Execution Lifecycle

Input Split: Data is divided into chunks (HDFS blocks).
Map Phase: Processes records into intermediate (Key, Value) pairs.
Shuffle & Sort: Transfers and organizes data across the network so all values for the same key reach the same reducer.
Reduce Phase: Aggregates intermediate values into the final result.

Advanced MapReduce Optimization

Combiners: "Mini-reducers" that aggregate data locally on the mapper node to reduce network traffic.
Speculative Execution: Identifies slow tasks ("stragglers") and launches redundant copies on other nodes to finish faster.
Partitioners: Custom logic to determine which reducer a key is sent to (prevents data skew).
Distributed Cache: Efficiently shares small lookup tables or side-data across all mapper nodes.
Input Formats: Using SequenceFile or Parquet is more efficient than plain TextInputFormat.

In-Memory vs. Disk-Based Computing

Apache Spark is up to 100x faster than Hadoop MapReduce because it processes data directly in RAM across cluster nodes instead of writing intermediate results to disk after every operation.

Spark Master-Slave Architecture

Driver Program

The central coordinator. Manages the SparkSession, converts code into tasks, and schedules them on executors.

Executors

Worker nodes that execute individual tasks and store data in-memory or on disk. They report results back to the Driver.

Cluster Manager

Allocates physical resources (CPU/RAM) across the cluster. Supports YARN, Mesos, Kubernetes, and Standalone mode.

Core Data Abstractions

Feature	RDD (Resilient Distributed Dataset)	DataFrame
Level	Low-level API	High-level API
Structure	Unstructured (Collection of objects)	Structured (Rows & Columns)
Optimization	Manual optimization	Automatic (Catalyst Optimizer)
Ease of Use	Complex (Lambda functions)	Easy (SQL-like syntax)

Transformations vs. Actions

Transformations (Lazy)

Do not execute immediately; they create a DAG (Directed Acyclic Graph).
• filter(), map(), groupBy(), select().

Actions (Eager)

Trigger the actual computation and return results to the driver.
• show(), collect(), count(), write().

PySpark Code Snippets & Best Practices

Initialization: Always use SparkSession.builder.
Optimization: Use .filter() early to reduce data size before a shuffle.
Storage: Use Parquet format for output; it is columnar and significantly faster than CSV.
Cleaning: Use dropna() or fillna() to handle missing records.

df.filter(col("status") ==
              "Active").groupBy("id").agg(sum("amount"))

Challenges of High-Dimensional Data

High-dimensional data refers to datasets where the number of features (p) is much larger than the number of observations (n). This leads to the "Curse of Dimensionality":

Overfitting: Models capture noise rather than patterns.
Sparsity: Data points become isolated, making distance metrics (like Euclidean) less meaningful.
Computation: Exponentially higher processing requirements.

Regularization: LASSO vs. Ridge

Feature	LASSO (L1)	Ridge (L2)
Penalty	Absolute value of coefficients	Squared value of coefficients
Feature Selection	Yes (can set coefficients to zero)	No (shrinks them towards zero)
Best For	Sparse data with few important features	When most features are relevant

Scalable K-Means Clustering

Implementation with Spark MLlib

Spark enables clustering of millions of points through:

Mini-Batch K-Means: Processes subsets of data at each iteration to save memory.
K-Means|| (Parallel): An optimized initialization method (similar to K-Means++) that reduces the number of passes over data.
Feature Scaling: Using StandardScaler is critical because K-Means is distance-based and sensitive to varying scales.

ML Pipeline Code (PySpark)

Typical workflow for a scalable ML model:
from pyspark.ml.clustering import KMeans

from pyspark.ml.feature import VectorAssembler,
              StandardScaler

Assemble: Combine columns into a single features vector.
Scale: Standardize features to have mean=0 and std=1.
Fit: Run KMeans.fit() on the clustered environment.

Real-World Applications

Finance

Anomaly detection for fraudulent credit card transactions.

Healthcare

Genomic data clustering for disease pattern identification.

Core Deep Learning Architectures

Feed Forward Networks (FFN)

Data flows in one direction (Input → Hidden → Output). No cycles.
• Best For: Simple classification, regression.
• Limitation: No memory of past info.

Convolutional Neural Nets (CNN)

Uses filters to detect spatial hierarchies.
• Key Layers: Convolutional, Pooling, Fully Connected.
• Best For: Image recognition, medical imaging.

Recurrent Neural Networks (RNN)

Contains loops to retain sequential dependencies (hidden state).
• Challenge: Vanishing Gradient problem.
• Solution: LSTM (Long Short-Term Memory) units.
• Best For: NLP, time-series forecasting, speech-to-text.

Generative Models: Variational Autoencoders (VAE)

VAEs learn probabilistic latent representations of data.

Encoder: Maps input data to a probability distribution (latent space).
Latent Space: A compressed, continuous bottleneck layer.
Decoder: Reconstructs data from samples taken from the latent space.

Applications: Image generation, anomaly detection, drug discovery.

Deep Learning Comparison

Architecture	Data Type	Key Concept
CNN	Grid (Images/Video)	Spatial Feature Extraction
RNN / LSTM	Sequential (Text/Time)	Temporal Memory
VAE	Any (Unsupervised)	Probabilistic Latent Space
GAN	Any (Generative)	Adversarial Competition