Big Data Analytics (MDS 5212A) - Comprehensive Answer Key
Revision Notes
1. Foundational Concepts
Ref: (April 2025)
1. Briefly explain how big data analytics help businesses improve decision-making?
By analyzing large volumes of data and identifying patterns, trends, and insights, businesses can move from intuition-based to evidence-based strategies.
Ref: (April/Dec 2025)
2. Explain the Five V’s of big data and how they influence big data analytics.
-
Volume: Refers to the size of data.
Impact: Requires scalable storage and distributed processing systems.
-
Velocity: Refers to the speed of data generation and processing.
Impact: Enables real-time or near real-time analytics.
-
Variety: Refers to different types of data (structured, semi-structured, unstructured).
Impact: Requires flexible tools and advanced data integration techniques.
-
Veracity: Refers to the quality and reliability of data.
Impact: Affects the accuracy of insights and requires data cleaning.
-
Value: Refers to the usefulness of data.
Impact: Drives decision-making and ensures return on investment (ROI).
Ref: (April/Dec 2025)
3. Differentiate between structured, unstructured, and semi-structured data with examples.
Structured: Rigid format like SQL tables; Unstructured: No predefined model like video/images; Semi-structured: Contains tags/metadata like JSON or XML.
Ref: (Dec 2024)
4. Describe any three benefits of using big data analytics for businesses today.
Organizations use it to enhance customer experiences, optimize operations (like inventory), and drive innovation via new data-driven products.
2. Architectures & Data Storage
Ref: (April 2025/Dec 2024)
5. Differentiate between data lakes and data warehouses in terms of architecture, processing, and use cases.
Data Lakes store raw data in native formats (schema-on-read) for discovery; Data Warehouses store highly structured data (schema-on-write) optimized for business intelligence and reporting.
Ref: (August 2025)
6. Compare and contrast Lambda and Kappa architectures for real-time processing.
Lambda uses a batch layer for historical accuracy and a speed layer for real-time; Kappa treats everything as a stream, removing the batch layer for simplicity.
Ref: (Dec 2024)
7. Explain the steps involved in the big data analytics lifecycle.
- Define the problem
- Collect data
- Clean & prepare data
- Store data
- Process data
- Analyze data
- Visualize insights
- Make decisions
- Monitor & improve
Ref: (Dec 2024)
8. Explain the key concepts that guide how distributed file storage works.
It involves splitting massive files into smaller blocks, replicating them across multiple hardware nodes for fault tolerance, and using a central manager (like a NameNode) to track data locations.
3. Big Data Technologies
Ref: (Dec 2024)
9. Describe the role of Hadoop in Big Data.
Hadoop provides a distributed framework (HDFS for storage and MapReduce for processing) to handle datasets that are too large for single-server processing.
Ref: (Dec 2024)
10. Explain the term "Resilient Distributed Dataset (RDD)" in Apache Spark.
An RDD is Spark’s basic data structure: fault-tolerant (resilient), partitioned across a cluster (distributed), and a read-only collection of objects.
Ref: (Dec 2024)
11. What is the role of blockchain technology in Big Data Analytics?
It provides a decentralized, tamper-proof record of transactions, ensuring the veracity and trust of the data being analyzed.
Ref: (Dec 2024)
12. Differentiate between batch processing and stream processing.
| Aspect |
Batch Processing |
Stream Processing |
| Definition |
Processes large volumes of data at once (in batches) |
Processes data continuously as it arrives |
| Data Handling |
Data is collected first, then processed |
Data is processed in real-time |
| Speed |
Slower (delayed results) |
Very fast (near real-time results) |
| Latency |
High latency |
Low latency |
| Use Case |
Historical analysis, reporting |
Real-time analytics, monitoring |
| Complexity |
Simpler to implement |
More complex to design |
| Examples |
Payroll systems, monthly reports |
Fraud detection, live recommendations |
4. Machine Learning & Advanced Analytics
Ref: (April 2025)
13. Explain two key ways in which deep learning enhances large-scale analytics.
It enables high-accuracy classification of unstructured data (like images) and captures complex, non-linear relationships in massive datasets better than traditional models.
Ref: (Dec 2024)
14. Differentiate between supervised and unsupervised learning (tasks and examples).
| Aspect |
Supervised Learning |
Unsupervised Learning |
| Definition |
Learning using labeled data (input with known output) |
Learning using unlabeled data (no predefined output) |
| Goal |
Predict outcomes based on known labels |
Discover hidden patterns or structures in data |
| Tasks |
Classification, Regression |
Clustering, Association |
| Examples |
Email spam detection, house price prediction |
Customer segmentation, market basket analysis |
| Data Requirement |
Requires labeled dataset |
Works with unlabeled dataset |
Ref: (Dec 2024)
15. Summarize any four contributions of machine learning in Big Data Analytics.
Pattern recognition, predictive modeling, automation of data cleaning, and real-time anomaly detection.
5. Case Study: Walmart (Retail)
Ref: (April 2025)
16. Describe MapReduce and how it enables Walmart to process transaction data.
MapReduce splits data into chunks; the "Map" phase filters/sorts data while the "Reduce" phase aggregates it, enabling parallel processing of a billion daily transactions.
Ref: (April 2025)
17. How does Hadoop’s distributed model help Walmart's decision-making?
It allows for the simultaneous analysis of sales trends, purchasing behavior, and logistics, helping Walmart predict demand fluctuations globally.
Ref: (April 2025)
18. Identify a challenge Walmart faces with Hadoop and suggest two solutions.
Challenge: Latency in processing real-time stockouts. Solutions: Integrate Apache Spark for in-memory speed or adopt a Lambda architecture to handle real-time streams.
Ref: (April 2025)
19. Suggest how integrating Apache Spark could improve Walmart’s system.
Spark provides real-time, in-memory processing, significantly reducing the time required for inventory optimization compared to disk-based MapReduce.
Ref: (April 2025)
20. Identify another industry for Hadoop/MapReduce and justify.
Banking: To analyze millions of historical transaction records for long-term fraud patterns and risk assessment.
6. Case Study: Uber (Ride-Hailing)
Ref: (April 2025)
21. Explain ecosystem components Uber integrates into its pricing model.
Uber integrates real-time ride requests, GPS driver telemetry, traffic data, and external weather feeds to feed its dynamic algorithm.
Ref: (April 2025)
22. Describe how Uber's dynamic pricing algorithm processes real-time data.
It continuously analyzes current supply (drivers) vs. demand (requests) and raises prices when demand outstrips supply to attract more drivers to an area.
Ref: (April 2025)
23. Analyze three advantages of Uber's dynamic pricing model.
1) Incentivizes drivers during peak demand, 2) Reduces passenger wait times, 3) Balances overall marketplace supply and demand.
Ref: (April 2025)
24. Critically evaluate ethical concerns regarding Uber’s surge pricing.
The primary concern is affordability during emergencies, natural disasters, or major public events where prices can become prohibitive.
Ref: (April 2025)
25. Suggest three ML improvements to make Uber’s pricing fairer.
1) Capping prices during detected emergencies, 2) Predicting demand spikes earlier to pre-deploy drivers, 3) Adjusting surge factors based on local economic contexts.
7. Case Study: Social Media (Facebook & Twitter)
Ref: (April 2025)
26. Explain features of Facebook’s TAO architecture for real-time interactions.
TAO is a distributed store optimized for read-heavy workloads that bridges databases and caching to provide low-latency access to social graph data (likes/friends).
Ref: (April 2025)
27. Compare TAO with Cassandra or HBase for social media data.
TAO is uniquely optimized for fast social-graph "reads" and eventual consistency, serving billions of queries per second globally.
Ref: (April 2025)
28. Discuss how Facebook ensures scalability and fault tolerance in TAO.
By deploying TAO across multiple global data centers using replication; if one region fails, others continue global operations.
Ref: (April 2025)
29. Outline three stages in a Twitter sentiment analysis pipeline.
1) Data Collection (tweets), 2) Preprocessing (cleaning noise/bots), 3) Sentiment scoring using NLP techniques.
Ref: (April 2025)
30. Describe how Word2Vec, LSTM, and BERT enhance sentiment analysis.
Word2Vec captures word context; LSTM handles long-term dependencies in sentences; BERT (transformers) captures deep language nuances.
Ref: (April 2025)
31. Discuss three challenges in Twitter sentiment analysis for elections.
Challenges include the presence of misinformation, automated bot accounts that inflate sentiment, and echo chambers that skew overall opinion trends.
8. Case Study: Healthcare & Finance
Ref: (August 2025)
32. How can ML be applied in disease outbreak prediction?
By analyzing hospital patient records, geographical movement patterns (GPS), and climate factors to detect epidemic clusters.
Ref: (August 2025)
33. Discuss three challenges of integrating social media data with health data.
1) Data veracity (fake news vs. symptoms), 2) Privacy of medical history, 3) Technical difficulty of merging unstructured social data with rigid hospital records.
Ref: (August 2025)
34. Explain how anomaly detection is used in bank fraud detection.
It flags transactions that deviate from spending patterns (e.g., location anomalies or sudden large transfers).
Ref: (August 2025)
35. Discuss three advantages of supervised vs unsupervised learning in fraud detection.
Higher accuracy for known fraud types, clear labeling for investigator follow-up, and lower false positives in historical fraud scenarios.
Ref: (August 2025)
36. What risks are associated with false positives in fraud detection?
Risks include customer frustration, blocked legitimate transactions, and brand damage. Mitigated by refining ML thresholds.
9. Case Study: Smart Cities & Marketing
Ref: (August 2025)
37. Describe three ways big data and IoT improve traffic management.
Using sensors for real-time flow monitoring, using GPS for route optimization, and historical data to predict peak congestion.
Ref: (August 2025)
38. Explain three challenges of large-scale IoT data for urban planning.
Storage complexity, real-time latency in data transmission, and reliability of the massive sensor network.
Ref: (August 2025)
39. Explain how customer segmentation enhances marketing strategies.
It groups customers by behavior (browsing history/purchase) to allow for tailored promotions and highly relevant product recommendations.
Ref: (August 2025)
40. How does A/B testing help improve personalized marketing?
It tests different ad versions or layouts on subsets of users to determine which produces higher conversion rates before full deployment.
Ref: (August 2025)
41. Discuss three impacts of data privacy regulations (GDPR) on marketing.
It mandates explicit user consent, limits how long behavioral data can be kept, and gives users the right to delete their profiles.
10. Operations, Security & Ethics
Ref: (August 2025)
42. Explain the role of data preprocessing and three key techniques.
Preprocessing improves quality; techniques include Data Cleaning (removing noise), Data Integration (merging sources), and Data Transformation (normalization).
Ref: (August 2025/Dec 2024)
43. Describe different visualization techniques and their importance.
Techniques like heatmaps, scatter plots, and dashboards help translate complex trends into actionable insights for non-technical managers.
Ref: (Dec 2024)
44. Summarize the main security challenges associated with Big Data.
Securing distributed nodes, ensuring the privacy of massive datasets, and managing access control for raw data in lakes.
Ref: (August 2025)
45. What are ethical/legal considerations in handling large-scale data?
Key issues include privacy protection, preventing algorithmic bias, ensuring transparency in data use, and regulatory compliance.