| Type | Focus | Goal |
|---|---|---|
| Descriptive | What happened? | Historical summary (Sales reports). |
| Diagnostic | Why did it happen? | Root cause analysis (Churn). |
| Predictive | What will happen? | Future forecasting (Loan default). |
| Prescriptive | What should we do? | Action optimization (Logistics). |
| Platform | Storage | Processing | Streaming |
|---|---|---|---|
| AWS | S3, Glue | EMR (Spark/Hadoop) | Kinesis |
| GCP | BigQuery, GCS | Cloud Dataproc | Pub/Sub |
| Azure | Data Lake (ADLS) | Synapse Analytics | Stream Analytics |
| Feature | HDFS (Hadoop) | Amazon S3 (Cloud) |
|---|---|---|
| Storage Model | Block storage (128MB blocks) | Object storage (Key-Value) |
| Architecture | Master-Slave (NameNode/DataNode) | Decentralized/Cloud-based |
| Scalability | Scales by adding more nodes | Virtually infinite; managed by AWS |
| Fault Tolerance | Replication (default: 3) | High durability (across regions) |
| Access | Internal Hadoop interface | REST API, Web, CLI |
| Type | Database | Ideal Use Case |
|---|---|---|
| Document | MongoDB / CouchDB | Flexible web apps, logs. |
| Wide-Column | Cassandra / HBase | High-write, distributed apps. |
| Key-Value | Redis / DynamoDB | Caching, session management. |
| Graph | Neo4j | Fraud detection, social networks. |
| Feature | Parallel Computing | Distributed Computing |
|---|---|---|
| Memory | Shared memory across processors | Independent memory for each node |
| Communication | Shared memory / Interconnect | Network message passing |
| Scalability | Limited by hardware vertical scaling | High (Horizontal scaling) |
| Example | Multi-core CPUs, GPUs | Hadoop, Google Search, Cloud clusters |
(Key, Value) pairs.
SequenceFile or Parquet is more
efficient than plain TextInputFormat.
SparkSession,
converts code into tasks, and schedules them on executors.
| Feature | RDD (Resilient Distributed Dataset) | DataFrame |
|---|---|---|
| Level | Low-level API | High-level API |
| Structure | Unstructured (Collection of objects) | Structured (Rows & Columns) |
| Optimization | Manual optimization | Automatic (Catalyst Optimizer) |
| Ease of Use | Complex (Lambda functions) | Easy (SQL-like syntax) |
filter(), map(), groupBy(),
select().
show(), collect(),
count(), write().
SparkSession.builder.
.filter() early to reduce
data size before a shuffle.
dropna() or
fillna() to handle missing records.
df.filter(col("status") ==
"Active").groupBy("id").agg(sum("amount"))
| Feature | LASSO (L1) | Ridge (L2) |
|---|---|---|
| Penalty | Absolute value of coefficients | Squared value of coefficients |
| Feature Selection | Yes (can set coefficients to zero) | No (shrinks them towards zero) |
| Best For | Sparse data with few important features | When most features are relevant |
StandardScaler is critical because K-Means is
distance-based and sensitive to varying scales.
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler,
StandardScaler
features vector.
KMeans.fit() on the clustered
environment.
| Architecture | Data Type | Key Concept |
|---|---|---|
| CNN | Grid (Images/Video) | Spatial Feature Extraction |
| RNN / LSTM | Sequential (Text/Time) | Temporal Memory |
| VAE | Any (Unsupervised) | Probabilistic Latent Space |
| GAN | Any (Generative) | Adversarial Competition |