MDS 5214 — Analytics Application Engineering

Paper

Ordinary Examination

Date

December 2024

Duration

2 Hours

Instructions

Answer Question One & ANY OTHER TWO questions

Question One

Compulsory 20 Marks

A(i)

Define the following concept in the context of analytics applications and provide an example: Data Pipeline

2 marks

▾

Model Answer

Definition: A data pipeline is an automated set of processes that moves data from one or more sources through a series of transformation steps to a destination for storage, processing, or analysis. It typically handles ingestion, validation, transformation, and loading in a coordinated workflow.

Example: An e-commerce company uses Apache Airflow to extract daily sales transactions from its MySQL database, clean and aggregate the data, then load it into a Redshift data warehouse where the BI team runs reports.

A(ii)

Define the following concept in the context of analytics applications and provide an example: Data Lake

2 marks

▾

Model Answer

Definition: A data lake is a centralised repository that stores raw data in its native format — structured, semi-structured, and unstructured — at any scale, without requiring a predefined schema. Schema is applied only at the time of analysis ("schema-on-read").

Example: A healthcare organisation stores millions of medical images (JPEG), patient records (JSON), and lab reports (CSV) in Amazon S3. Data scientists can query any data type on demand using Amazon Athena without prior data modelling.

A(iii)

Define the following concept in the context of analytics applications and provide an example: Data Quality Assurance

2 marks

▾

Model Answer

Definition: Data Quality Assurance is the systematic process of ensuring that data is accurate, complete, consistent, timely, and fit for its intended use. It encompasses validation rules, data profiling, automated checks, and monitoring throughout the data lifecycle.

Example: A bank uses Great Expectations to automatically validate that all customer records contain a valid national ID, no null account numbers, and transaction amounts within expected statistical ranges before loading data into the analytics platform.

Explain the challenges of integrating diverse data sources in analytics applications and describe two strategies to address these challenges.

4 marks

▾

Model Answer

Challenges: Diverse data sources use varying formats (JSON, XML, CSV, binary), schemas, encodings, and update frequencies. Inconsistencies in naming conventions, duplicate records across systems, and latency mismatches make unified analysis difficult. Security and access-control differences across systems add further complexity, as does managing schema drift when upstream systems change without warning.

Strategy 1 — Data Standardisation / Schema Mapping: Define a canonical data model and use middleware (e.g., Apache Kafka with Avro schemas) to enforce consistent formats across all sources before ingestion into the analytics layer.

Strategy 2 — Master Data Management (MDM): Implement an MDM platform (e.g., Informatica MDM) to create a single authoritative "golden record" by deduplicating and reconciling conflicting data across all integrated systems.

Compare the roles of Extract Transform Load (ETL) and Extract Load Transform (ELT) processes in analytics applications, focusing on their impact on performance and data management.

4 marks

▾

Model Answer

ETL

Data is extracted, transformed on a dedicated server, then loaded into the destination. Transformation happens before the data warehouse, protecting it from dirty data. Better for legacy, on-premise systems with limited destination compute. Can become a bottleneck at scale.

ELT

Raw data is loaded first into a cloud data warehouse (e.g., BigQuery, Snowflake), then transformed using the warehouse's compute. More scalable and flexible — analysts can re-transform data without re-ingestion. Requires strong governance to manage raw data sprawl.

Performance: ELT leverages massively parallel cloud compute, making it faster for big data workloads. ETL limits throughput to the capacity of the dedicated transformation server.

Data management: ETL delivers only clean data to the destination; ELT requires data governance practices applied within the warehouse to prevent analytical errors from raw data.

Discuss the main phases of developing an end-to-end analytics application, from data collection to deployment, and give an example of a tool used at each phase.

6 marks

▾

Model Answer

1. Data Collection: Gathering raw data from APIs, databases, IoT sensors, web scraping, or streaming sources. Tool: Apache Kafka — real-time distributed event streaming platform.

2. Data Storage: Storing raw and processed data at scale for downstream access. Tool: Amazon S3 — scalable object storage used as the foundation of a data lake.

3. Data Processing & Transformation: Cleaning, enriching, filtering, and aggregating data into analytical-ready form. Tool: Apache Spark — distributed processing engine for batch and streaming transformations.

4. Modelling & Analysis: Building statistical or machine learning models on processed data to extract insights. Tool: Python with Scikit-learn / TensorFlow.

5. Visualisation: Presenting insights via interactive dashboards and reports. Tool: Tableau or Power BI.

6. Deployment & Monitoring: Deploying models and dashboards into production and monitoring performance over time. Tool: MLflow — open-source platform for model tracking, versioning, and deployment.

Question Two

Optional 15 Marks

A·I

Explain the purpose of the following machine learning technique in analytics applications, and provide a sample use case: Association Rule Mining (e.g., for market basket analysis)

3 marks

▾

Model Answer

Purpose: Association Rule Mining discovers interesting relationships, co-occurrence patterns, and correlations among variables in large transactional datasets using if-then rules of the form "if item A is purchased, then item B is also purchased." Key metrics are Support (frequency of the rule), Confidence (reliability), and Lift (strength above random chance).

Use case: A supermarket uses the Apriori algorithm on point-of-sale data and finds that customers who buy bread and butter frequently also buy jam {bread, butter} → {jam}. This drives product placement decisions and targeted bundle promotions, increasing average basket value.

A·II

Explain the purpose of the following machine learning technique in analytics applications, and provide a sample use case: Neural Networks (e.g., for image recognition)

3 marks

▾

Model Answer

Purpose: Neural Networks are computational models inspired by the human brain, composed of interconnected layers of nodes (neurons) that learn complex, non-linear patterns from data through forward propagation and gradient-descent-based backpropagation. They excel at tasks involving images, audio, text, and sequences.

Use case: A hospital deploys a Convolutional Neural Network (CNN) trained on labelled chest X-rays to automatically detect pneumonia. The network learns hierarchical visual features — edges, shapes, then pathological patterns — achieving diagnostic accuracy comparable to specialist radiologists.

A·III

Explain the purpose of the following machine learning technique in analytics applications, and provide a sample use case: Anomaly Detection

3 marks

▾

Model Answer

Purpose: Anomaly Detection identifies data points, events, or observations that deviate significantly from expected patterns. It is used to surface unusual behaviours that may indicate fraud, equipment faults, security breaches, or novel events. Methods range from statistical thresholds to unsupervised ML algorithms such as Isolation Forest and Autoencoders.

Use case: A bank uses an Isolation Forest model to monitor real-time credit card transactions. Transactions that deviate in amount, geography, or frequency from a cardholder's historical profile are flagged and blocked pending customer verification, reducing fraud losses.

A·IV

Explain the purpose of the following machine learning technique in analytics applications, and provide a sample use case: Gradient Boosting

3 marks

▾

Model Answer

Purpose: Gradient Boosting is an ensemble learning technique that builds models sequentially; each new weak learner (typically a decision tree) is trained to correct the residual errors of the previous ensemble using gradient descent on a differentiable loss function. It consistently produces state-of-the-art results on structured/tabular data.

Use case: An insurance company uses XGBoost to predict customer churn. By iteratively correcting prediction errors on features such as policy age, claim frequency, and payment history, the model accurately identifies high-risk customers 30 days before expected cancellation, enabling proactive retention campaigns.

A·V

Explain the purpose of the following machine learning technique in analytics applications, and provide a sample use case: Transfer Learning

3 marks

▾

Model Answer

Purpose: Transfer Learning applies knowledge gained from training a model on one task or large dataset to a different but related task, dramatically reducing the need for large labelled datasets and training time. Pre-trained models (e.g., BERT for NLP, ResNet for images) are fine-tuned on domain-specific data.

Use case: A startup with only 500 labelled customer support tickets fine-tunes BERT (pre-trained on billions of words of general English text) to build a sentiment classifier. The model achieves high accuracy without the cost of training a large language model from scratch.

Question Three

Optional 15 Marks

Describe the benefits of using Infrastructure as Code (IaC) in deploying cloud-based analytics applications and provide an example of a tool used for IaC.

6 marks

▾

Model Answer

1. Repeatability & Consistency: Infrastructure is defined in code and deployed identically across dev, staging, and production environments, eliminating configuration drift and "works on my machine" problems.

2. Version Control: Infrastructure definitions stored in Git enable full change history, rollback to known-good states, peer review via pull requests, and collaboration — exactly like application code.

3. Automation & Speed: Entire cloud environments (networking, compute, databases) can be provisioned in minutes rather than hours of manual portal configuration.

4. Cost Optimisation: Environments can be automatically torn down after hours or on weekends through scheduled scripts, reducing idle cloud spend.

5. Auditability & Compliance: Every infrastructure change is documented in version history, providing evidence trails required for regulatory audits (ISO 27001, SOC 2).

Example tool — Terraform: An open-source IaC tool by HashiCorp that uses declarative HCL (HashiCorp Configuration Language) to provision and manage resources across AWS, Azure, GCP, and 1,000+ providers from a single configuration file.

Differentiate between serverless computing and containerization in cloud analytics applications, providing one advantage and one limitation of each.

5 marks

▾

Model Answer

Serverless (e.g., AWS Lambda)

Executes functions in response to events without managing servers. The cloud provider handles scaling, patching, and availability automatically.

Advantage: Zero infrastructure management; pay only per invocation — cost-effective for intermittent, event-driven analytics workloads.

Limitation: Cold-start latency and execution time limits (e.g., 15-minute max on AWS Lambda) make it unsuitable for long-running batch analytics jobs.

Containerisation (e.g., Docker + Kubernetes)

Packages an application and its dependencies into portable, isolated containers that run consistently across any environment.

Advantage: Full control over the runtime environment; supports long-running, stateful, and GPU-intensive analytics workloads.

Limitation: Requires significant expertise to manage orchestration, networking, storage, and security — higher operational overhead than serverless.

Identify and describe two common security threats to cloud-based analytics applications, and discuss strategies to mitigate each threat.

4 marks

▾

Model Answer

Threat 1 — Data Breaches / Unauthorised Access: Attackers gain access to sensitive data in cloud storage or databases via misconfigured permissions or stolen credentials.
Mitigation: Implement least-privilege IAM policies, enable multi-factor authentication (MFA), encrypt data at rest (AES-256) and in transit (TLS 1.3), and use cloud-native tools such as AWS Macie to automatically detect exposed sensitive data.

Threat 2 — SQL / API Injection Attacks: Malicious inputs inserted into queries or API calls manipulate the database to return unauthorised data or execute destructive commands.
Mitigation: Use parameterised queries and prepared statements at all database access points, validate and sanitise all user inputs server-side, deploy a Web Application Firewall (WAF), and conduct regular penetration testing.

Question Four

Optional 15 Marks

Describe the importance of load testing in analytics applications, and mention two tools commonly used for this purpose.

4 marks

▾

Model Answer

Importance: Load testing simulates expected and peak concurrent user loads on an analytics application to assess performance, identify bottlenecks (slow queries, memory limits, network saturation), and validate that the system meets SLA requirements before production release. It prevents unexpected failures during high-demand periods such as end-of-quarter reporting or product launch analytics and validates auto-scaling configurations.

Tool 1 — Apache JMeter: Open-source tool for simulating HTTP/API, database, and messaging load. Supports complex test scenarios with configurable ramp-up periods and detailed performance reports.

Tool 2 — Locust: Python-based load testing framework where test scenarios are written as code, making it easy to model realistic user behaviour and integrate with CI/CD pipelines for automated performance regression testing.

B·I

Explain how the following cloud-based analytics service supports big data applications, discussing its advantages and limitations: Microsoft Azure HDInsight

3 marks

▾

Model Answer

Azure HDInsight is a fully managed, cloud-based open-source analytics service that supports Apache Hadoop, Spark, Kafka, HBase, and Hive on Azure infrastructure.

Advantages: Seamless integration with the Azure ecosystem (Azure Data Lake Storage, Power BI, Azure Active Directory); enterprise-grade security including Kerberos authentication and role-based access control; auto-scaling clusters reduce cost during low-demand periods.

Limitations: Cluster startup and scaling times can be several minutes, adding latency for ad-hoc analytical workloads; pricing escalates with large persistent clusters; less flexible than running Spark natively on Azure Kubernetes Service.

B·II

Explain how the following cloud-based analytics service supports big data applications, discussing its advantages and limitations: Google Cloud Bigtable

3 marks

▾

Model Answer

Google Cloud Bigtable is a fully managed, high-performance, petabyte-scale NoSQL wide-column database designed for large-scale analytical and operational workloads — the same infrastructure that powers Google Search, Maps, and Gmail.

Advantages: Consistent sub-10ms read/write latency at petabyte scale; compatible with the Apache HBase API enabling portability; automatic replication across regions for high availability and disaster recovery.

Limitations: Expensive at minimum node cost, making it unsuitable for small datasets; no SQL support — requires the HBase API or custom client libraries; performance is highly sensitive to row key design, requiring specialised schema expertise.

B·III

Explain how the following cloud-based analytics service supports big data applications, discussing its advantages and limitations: IBM Cloud Data Engine

3 marks

▾

Model Answer

IBM Cloud Data Engine (formerly IBM SQL Query) is a serverless, fully managed query service that enables SQL-based analysis of data stored in IBM Cloud Object Storage without provisioning or managing any infrastructure.

Advantages: Serverless model — pay only per terabyte of data scanned per query; no cluster setup or management required; supports standard ANSI SQL on Parquet, CSV, and ORC file formats directly in object storage.

Limitations: Query performance varies with data file organisation — unpartitioned large files degrade performance significantly; less mature ecosystem compared to BigQuery or Amazon Athena; limited native support for real-time or streaming analytics.

Outline the role of data visualization in analytics applications, including an example of a tool used for real-time dashboards, and explain its impact on decision-making.

5 marks

▾

Model Answer

Role of Data Visualisation: Data visualisation translates complex, high-volume datasets into intuitive graphical representations — charts, heat maps, geographic maps, KPI dashboards — enabling stakeholders at all levels to identify trends, outliers, and patterns without requiring statistical or programming expertise.

Example tool — Grafana: An open-source real-time dashboard platform that connects to data sources such as Prometheus, InfluxDB, Elasticsearch, and SQL databases to display live metrics with configurable auto-refresh intervals, alerting, and drill-down capabilities.

Impact on decision-making — Speed: Executives can monitor KPIs in real time and respond immediately to anomalies (e.g., a sudden drop in system throughput or sales conversion) rather than waiting for next-day reports.

Impact on decision-making — Democratisation: Non-technical stakeholders can self-serve insights from dashboards without requesting custom queries from the data team, accelerating the pace of data-driven decision cycles across the organisation.

Impact on decision-making — Reduced cognitive load: Visual representations of complex relationships (trend lines, correlation heat maps) allow faster pattern comprehension than raw tabular data, reducing the risk of misinterpretation under time pressure.

Paper

Part-Time Ordinary Examination

Date

April 2025

Duration

2 Hours

Instructions

Question One Is Compulsory. Choose Any Other Two Questions

Note: The April 2025 part-time paper contains questions identical in wording, structure, and mark allocation to the December 2024 ordinary paper (Questions 1–4, all sub-parts). This is common when the same examination is administered to full-time and part-time cohorts in the same academic year. All model answers from the December 2024 tab apply in full. Switch to the December 2024 tab to view questions and answers.

Paper

Ordinary Examination

Date

August 2025

Duration

2 Hours

Instructions

Question One is Compulsory. Choose Two Other Questions

Question One

Compulsory 20 Marks

a(i)

Define the following term in the context of analytics application engineering and provide one practical example: Data Orchestration

1 mark

▾

Model Answer

Definition: Data Orchestration is the automated coordination, scheduling, and management of complex data workflows across multiple systems and tools, ensuring that data moves reliably from source to destination in the correct sequence, with dependency management and error handling.

Example: Apache Airflow orchestrates a nightly analytics pipeline — it first extracts sales data from Salesforce, then triggers an Apache Spark transformation job once extraction succeeds, and finally loads the results into BigQuery. Each step executes only when all dependencies have completed successfully.

a(ii)

Define the following term in the context of analytics application engineering and provide one practical example: Data Streaming

1 mark

▾

Model Answer

Definition: Data Streaming is the continuous, real-time processing of data in motion — events are ingested and processed as they arrive, rather than being accumulated into discrete batches before processing.

Example: A ride-hailing application uses Apache Kafka to stream GPS coordinates from thousands of active drivers every second. The backend processes each location event in real time, updating driver positions on passengers' screens with sub-second latency.

a(iii)

Define the following term in the context of analytics application engineering and provide one practical example: Data Versioning

1 mark

▾

Model Answer

Definition: Data Versioning is the practice of tracking and managing changes to datasets and ML model artefacts over time, enabling reproducibility of experiments, rollback to previous states, lineage tracking, and auditability.

Example: A data science team uses DVC (Data Version Control) alongside Git to version their model training datasets. When a deployed model unexpectedly degrades in production, the team rolls back to the exact dataset version used to produce the last known-good model and re-trains.

a(iv)

Define the following term in the context of analytics application engineering and provide one practical example: Model Drift

1 mark

▾

Model Answer

Definition: Model Drift is the gradual degradation of a deployed machine learning model's predictive accuracy over time, caused by changes in the statistical properties of input data (data drift) or changes in the underlying relationship between inputs and the target variable (concept drift).

Example: A credit scoring model trained on pre-pandemic consumer spending patterns performs poorly after the pandemic, because the relationship between income, spending behaviour, and creditworthiness has fundamentally changed, making historical training patterns an unreliable basis for current predictions.

a(v)

Define the following term in the context of analytics application engineering and provide one practical example: Edge Analytics

1 mark

▾

Model Answer

Definition: Edge Analytics involves processing and analysing data at or near the source — on IoT devices, sensors, or local gateways at the "edge" of the network — rather than transmitting all raw data to a centralised cloud data centre for processing. This minimises latency, bandwidth consumption, and data privacy exposure.

Example: A smart factory deploys edge computing devices directly on the production line that run anomaly detection models locally. Equipment defects are identified within milliseconds without sending raw video feeds to the cloud, enabling real-time shutdown of faulty machinery before damage occurs.

Explain two benefits and one drawback of using cloud-native tools for analytics applications.

5 marks

▾

Model Answer

Benefit 1 — Elastic Scalability: Cloud-native tools such as Google BigQuery and AWS Glue automatically scale compute resources up or down in response to workload demand. A quarterly financial close that requires 10× normal processing power can be handled instantly, without pre-purchasing dedicated hardware, and resources are released when the workload completes.

Benefit 2 — Reduced Operational Overhead via Managed Services: Cloud-native tools handle infrastructure patching, backups, high availability configurations, and security updates automatically. Data engineering teams can focus entirely on building analytics pipelines and business logic rather than maintaining servers.

Drawback — Vendor Lock-in: Deep integration with proprietary cloud services — AWS-specific data formats, Azure-specific SDKs, Google-specific APIs — makes migrating to another cloud provider or on-premises environment costly, time-consuming, and technically complex. This reduces negotiating leverage and creates single-provider dependency risk that can affect pricing, availability, and long-term strategy.

Discuss the main differences between batch processing and stream processing in analytics workflows. Provide one scenario where each is preferred.

5 marks

▾

Model Answer

Batch Processing

Collects data over time and processes it in discrete scheduled chunks (hourly, nightly, weekly). High throughput. Higher latency. Simpler to implement and debug. Lower cost for large historical datasets. Tools: Apache Spark, Hadoop MapReduce.

Stream Processing

Processes each data event as it arrives, continuously, with sub-second to second latency. More complex state management. Higher infrastructure cost. Tools: Apache Flink, Kafka Streams, Spark Structured Streaming.

Key differences: Latency (hours vs milliseconds), data currency (historical vs real-time), architectural complexity, and infrastructure cost.

Batch preferred scenario: Generating monthly payroll reports — all salary transactions for the month are collected and processed in a scheduled overnight run. Latency of several hours is acceptable; throughput, accuracy, and cost-efficiency are the priorities.

Stream processing preferred scenario: Detecting fraudulent credit card transactions — each transaction must be evaluated within milliseconds of submission. A latency of even one minute would allow fraudulent transactions to be completed and funds to be withdrawn before blocking can occur.

Describe the concept of CI/CD pipelines in the deployment of analytics applications and mention two tools commonly used.

5 marks

▾

Model Answer

CI/CD (Continuous Integration / Continuous Delivery or Deployment) is a set of engineering practices and automated pipelines that streamline testing, building, validating, and releasing analytics application code from development environments through to production, with minimal manual intervention.

Continuous Integration (CI): Every code commit automatically triggers a pipeline that runs unit tests, data quality validation checks, and ML model performance tests. This ensures that new changes do not break existing pipeline logic before they are merged into the main codebase.

Continuous Delivery/Deployment (CD): Once validated, code is automatically packaged and deployed to staging or production. For analytics applications this includes deploying updated ETL pipeline DAGs, new ML model versions to serving endpoints, or revised dashboard configurations.

Value in analytics: Enables frequent, low-risk updates to data pipelines and models; catches upstream data schema changes early; enforces data governance checkpoints before production; reduces deployment errors from manual processes.

Tool 1 — GitHub Actions: Integrates directly with GitHub repositories; triggers automated test and deployment workflows on pull requests or merges using YAML configuration files.

Tool 2 — Jenkins: Open-source automation server with an extensive plugin ecosystem widely used for building, testing, and deploying data pipelines across heterogeneous analytics technology stacks.

Question Two

Optional 15 Marks

Compare and contrast structured and unstructured data in the context of analytics applications. Provide one example of each.

5 marks

▾

Model Answer

Structured Data

Organised into a predefined schema with rows and columns. Easily queryable with SQL. Stored in relational databases or data warehouses. High data quality and consistency. Represents approximately 20% of enterprise data.

Example: A bank's transaction table with columns for transaction_id, amount, timestamp, and account_number.

Unstructured Data

No predefined format or schema. Requires specialised processing (NLP, computer vision, audio analysis) before analytical use. Stored in data lakes or object storage. Represents approximately 80% of enterprise data.

Example: Customer support call recordings, social media posts, or MRI/X-ray medical scans.

Key contrasts: Schema rigidity (strict vs none), query method (SQL vs ML feature extraction), storage system (data warehouse vs data lake), and processing complexity. Modern analytics increasingly combines both — structured transaction data with unstructured customer sentiment — for richer, more contextual insights.

Describe two common methods for real-time anomaly detection and a challenge associated with each.

5 marks

▾

Model Answer

Method 1 — Statistical Control Charts (e.g., Z-score, CUSUM): Continuously compares incoming data points against a computed statistical baseline (mean ± N standard deviations). Points exceeding the threshold trigger alerts.
Challenge: Assumes data follows a known, stationary distribution (typically normal). Performs poorly on non-stationary time series where the baseline shifts seasonally or due to business cycles, generating excessive false positives that erode operational trust in the system.

Method 2 — Isolation Forest: A tree-based unsupervised ML algorithm that isolates anomalies by recursively, randomly partitioning the feature space. Anomalous points require fewer partitions to isolate than normal points and receive lower anomaly scores.
Challenge: Computationally expensive to retrain at high frequency. If the underlying data distribution shifts significantly (model drift), the isolation boundaries become stale and detection accuracy degrades, requiring periodic retraining that introduces latency into the anomaly detection refresh cycle.

Explain the role of DataOps in modern analytics application development.

5 marks

▾

Model Answer

DataOps is a collaborative data management methodology that applies DevOps principles — automation, continuous integration, monitoring, and agile iteration — to data engineering and analytics workflows, with the goal of delivering reliable, high-quality data products faster.

1. Automation: Automates data pipeline testing, quality validation, and deployment processes, reducing manual errors and accelerating the time from data change to available insight.

2. Collaboration: Breaks organisational silos between data engineers, data scientists, analysts, and business stakeholders through shared tooling, documentation standards, and data product ownership models.

3. Observability: Implements data monitoring — freshness SLAs, schema change alerts, row count anomaly detection — so data issues are detected and resolved before they propagate to dashboards and decision-makers.

4. Governance & Lineage: Tracks data lineage from ingestion source through all transformations to the final analytical output, supporting regulatory compliance (GDPR, HIPAA) and enabling root-cause analysis when data quality issues arise.

5. Continuous Improvement: Short iterative development cycles enable analytics teams to release pipeline and model improvements frequently with confidence, replacing infrequent, high-risk "big bang" releases.

Question Three

Optional 15 Marks

Differentiate between horizontal and vertical scaling in cloud infrastructure for analytics workloads, and give an example where each is appropriate.

5 marks

▾

Model Answer

Vertical Scaling (Scale Up)

Increasing the resources (CPU, RAM, storage) of a single existing server. Simple to implement; no application architecture changes required. Limited by the maximum hardware specifications of a single machine. Creates a single point of failure.

Horizontal Scaling (Scale Out)

Adding more servers or nodes to distribute the workload. Theoretically unlimited scale. Requires distributed application architecture. More resilient — individual node failures do not bring down the entire system.

Vertical scaling example: A relational analytics database (e.g., PostgreSQL) running complex multi-table joins benefits from upgrading to a larger instance type with more RAM, reducing disk I/O through in-memory caching. Distributing a relational database horizontally is far more complex than a simple instance upgrade.

Horizontal scaling example: A Spark cluster processing petabytes of log data — adding worker nodes allows the computation to be parallelised across hundreds of machines, reducing processing time from days to hours. No single machine could hold or process the full dataset in memory.

Explain two best practices for securing data pipelines in analytics applications.

5 marks

▾

Model Answer

Best Practice 1 — End-to-End Encryption: Encrypt all data both in transit (using TLS 1.3 for all API calls, database connections, and pipeline message queues) and at rest (AES-256 encryption for storage systems). Use cloud-managed key management services (AWS KMS, Azure Key Vault, Google Cloud KMS) to rotate encryption keys automatically on a defined schedule. This ensures that intercepted data or compromised storage cannot be read without the cryptographic keys.

Best Practice 2 — Principle of Least Privilege via IAM: Grant each pipeline component only the specific permissions it requires and no more. A data extraction job receives read-only access to source systems; a loading job receives write access only to the designated destination; no component holds admin credentials. Use dedicated service accounts with narrowly scoped permissions, rotate credentials programmatically, and audit all access logs regularly using tools like AWS CloudTrail or Google Cloud Audit Logs to detect privilege misuse or anomalous access patterns.

Discuss the impact of data latency on business decision-making and how to mitigate it.

5 marks

▾

Model Answer

Data latency is the delay between an event occurring in a business system and that event's data being available for analysis and decision-making. High latency forces decisions to be based on stale information.

Business impacts: A retailer acting on yesterday's inventory data may over-order stock that sold out overnight; a financial trader using delayed prices may execute at incorrect valuations; an operations manager may miss an emerging equipment failure until it causes a costly production shutdown. In competitive markets, organisations with lower data latency respond to customer behaviour and market shifts faster, creating a direct competitive advantage.

Mitigation 1 — Stream processing architecture: Replace batch ETL pipelines with event-driven architectures using Apache Kafka and Apache Flink or Spark Structured Streaming, reducing pipeline latency from hours to seconds.

Mitigation 2 — In-memory caching: Use Redis or Apache Ignite to cache frequently queried aggregations, eliminating repeated full-table scans and reducing query response times from seconds to milliseconds.

Mitigation 3 — Edge analytics: Pre-process data locally at IoT sensors or branch office gateways to eliminate the round-trip latency to a centralised cloud data centre, enabling real-time local decision-making.

Mitigation 4 — Latency SLA monitoring: Set automated alerts (e.g., PagerDuty, Datadog) that notify the data engineering team immediately when pipeline freshness exceeds acceptable thresholds, ensuring issues are resolved before they affect business operations.

Question Four

Optional 15 Marks

Identify two common cost management challenges in running cloud analytics applications and suggest one solution for each.

6 marks

▾

Model Answer

Challenge 1 — Uncontrolled Query Costs in Cloud Data Warehouses: In pay-per-query cloud warehouses (e.g., BigQuery charges per terabyte scanned), poorly optimised queries or accidental full-table scans on large datasets can generate massive, unexpected bills that exceed monthly budgets.
Solution: Implement query cost governance — set per-user or per-project spending caps, enforce columnar storage formats (Parquet, ORC) with appropriate table partitioning and clustering to minimise bytes scanned, and introduce a query review workflow before running large ad-hoc analyses on production datasets.

Challenge 2 — Idle or Over-Provisioned Compute Resources: Analytics clusters (AWS EMR, Azure HDInsight, Google Dataproc) left running overnight or over weekends with no active jobs waste significant budget — idle compute can represent 60–70% of monthly cloud spend in organisations without resource governance.
Solution: Implement auto-termination policies on all development and test clusters, use Infrastructure as Code schedules that automatically shut down clusters after a configurable inactivity period, and use spot or preemptible instances for non-critical batch jobs to reduce compute costs by up to 90% compared to on-demand pricing.

Describe how container orchestration platforms (like Kubernetes) benefit large-scale analytics deployments.

4 marks

▾

Model Answer

Kubernetes (K8s) automates the deployment, scaling, networking, and lifecycle management of containerised applications across clusters of machines. For large-scale analytics:

1. Resource efficiency: Kubernetes bin-packs containers onto nodes optimally, maximising cluster hardware utilisation and reducing per-unit compute cost across large analytics workloads.

2. Auto-scaling: The Horizontal Pod Autoscaler scales analytics workloads (Spark executors, model serving pods, API endpoints) automatically based on real-time CPU, memory, or custom metrics such as queue depth, without manual intervention.

3. Self-healing: Kubernetes automatically restarts failed containers, reschedules jobs from failed nodes, and replaces unhealthy instances — ensuring long-running analytics jobs complete without requiring manual monitoring and intervention.

4. Environment consistency: Container images encapsulate all dependencies (Python version, library versions, configurations), eliminating environment mismatches between development, staging, and production analytics environments.

Explain what a data catalog is and how it supports data governance in analytics projects.

5 marks

▾

Model Answer

A data catalog is a centralised, searchable inventory of all data assets within an organisation — including datasets, database tables, columns, dashboards, reports, and ML features — enriched with business and technical metadata such as data type, owner, source system, transformation lineage, access policy, and quality metrics.

1. Data Discovery: Analysts can search the catalog to find the right dataset without emailing data owners or duplicating existing datasets, reducing time-to-analysis and preventing redundant data collection. Tools: Apache Atlas, Alation, Google Dataplex.

2. Data Lineage: Tracks the full journey of data from ingestion source through all transformation steps to the final dashboard or model. This enables impact analysis when upstream changes occur and provides the audit trails required for regulatory compliance (GDPR, HIPAA, SOX).

3. Ownership & Accountability: Every dataset has an assigned business owner responsible for quality and access decisions, enforcing clear accountability and reducing the risk of undocumented, untrustworthy data being used in analytics.

4. Access Control: Integrates with IAM and data governance platforms to enforce which teams and individuals can discover, view, or use sensitive datasets, supporting data privacy policy compliance.

5. Data Quality Visibility: Surfaces quality scores, freshness timestamps, and validation pass rates alongside dataset listings, helping analysts choose trustworthy, current data sources for their analytics work.

Topics ordered by frequency of appearance across all three exam papers. Prioritise high-frequency topics for revision.

High Priority · Appears in 2–3 Papers

● High Frequency

Data Pipeline, Data Lake, Data Quality Assurance — definitions & examples

Dec 2024 · Apr 2025

● High Frequency

ETL vs ELT — performance & data management comparison

Dec 2024 · Apr 2025

● High Frequency

End-to-end analytics application phases — tools at each phase

Dec 2024 · Apr 2025

● High Frequency

ML techniques — Association Rules, Neural Networks, Anomaly Detection, Gradient Boosting, Transfer Learning

Dec 2024 · Apr 2025

● High Frequency

Infrastructure as Code (IaC) — benefits & tools (Terraform)

Dec 2024 · Apr 2025

● High Frequency

Serverless computing vs Containerisation — advantages & limitations

Dec 2024 · Apr 2025

● High Frequency

Security threats to cloud analytics — data breaches, injection, mitigations

Dec 2024 · Apr 2025

● High Frequency

Cloud analytics services — Azure HDInsight, Google Bigtable, IBM Cloud Data Engine

Dec 2024 · Apr 2025

● High Frequency

Load testing — importance & tools (JMeter, Locust)

Dec 2024 · Apr 2025

● High Frequency

Data visualisation — role, tools, decision-making impact

Dec 2024 · Apr 2025

August 2025 — New Topics

◆ Aug 2025

Data Orchestration, Data Streaming, Data Versioning, Model Drift, Edge Analytics

Aug 2025

◆ Aug 2025

Batch processing vs Stream processing — differences & scenarios

Aug 2025

◆ Aug 2025

CI/CD pipelines in analytics deployment — GitHub Actions, Jenkins

Aug 2025

◆ Aug 2025

Structured vs Unstructured data — comparison & examples

Aug 2025

◆ Aug 2025

Real-time anomaly detection methods — Statistical & Isolation Forest

Aug 2025

◆ Aug 2025

DataOps — role in modern analytics development

Aug 2025

◆ Aug 2025

Horizontal vs Vertical cloud scaling — analytics workloads

Aug 2025

◆ Aug 2025

Securing data pipelines — encryption & least privilege IAM

Aug 2025

◆ Aug 2025

Data latency — business impact & mitigation

Aug 2025

◆ Aug 2025

Cloud cost management — query costs, idle resources

Aug 2025

◆ Aug 2025

Kubernetes / container orchestration for large-scale analytics

Aug 2025

◆ Aug 2025

Data catalog — definition & data governance support

Aug 2025

Revision tip: The August 2025 paper introduced a new category of operational and engineering topics — CI/CD, DataOps, model drift, data latency, Kubernetes, and data catalogs — that were absent from earlier papers. These topics reflect current industry practice and are likely to be tested in future sittings alongside the core definitional questions in Question 1.