Data Science Project Management
PFA is a structured pre-project evaluation that examines whether a proposed project can be successfully planned and executed. It is conducted before the project is approved and resources are committed.
PFA typically assesses the following dimensions:
- Technical feasibility: Can the project be built with existing technology, tools, and expertise? For DS projects this includes checking data availability and infrastructure.
- Economic feasibility: Do the projected benefits (ROI, NPV, payback period) outweigh the costs?
- Operational feasibility: Will the end-users and the organisation actually adopt and use the output?
- Schedule feasibility: Can the project be completed within acceptable time constraints?
- Legal/Ethical feasibility: Does the project comply with data-protection laws and ethical guidelines?
The result of PFA is a feasibility report that informs the go/no-go decision for project initiation.
Evaluates whether the organisation possesses (or can acquire) the hardware, software, data infrastructure, and skilled personnel needed to execute the project. For a DS project this means assessing data pipelines, compute resources, and model-deployment capabilities.
Examines cost-benefit trade-offs. Metrics such as Net Present Value (NPV), Return on Investment (ROI), and payback period are used to determine whether the project is economically justified and fundable.
Assesses whether the organisation has the processes, change-management capacity, and stakeholder support for the project outputs to be used effectively. A technically sound DS model is worthless if end-users are not trained or willing to adopt it.
Without understanding the business, data scientists risk building technically excellent models that answer the wrong question. For example, a churn-prediction model built without understanding customer-retention strategy may optimise the wrong metric. Business context ensures the problem definition, target variable, and success criteria are meaningful.
Business context reveals which data sources are available, which are trusted, and which face legal or ethical restrictions (e.g., GDPR, Kenya Data Protection Act). It also uncovers domain-specific rules that influence feature engineering and model interpretation — saving re-work late in the project.
A project checklist is a structured, sequential list of tasks, milestones, deliverables, and verification items that must be completed or confirmed at defined stages of the project. It acts as a quality-control and progress-tracking tool used by the project manager and team.
Project: Student Academic Performance Prediction System for KCA University
Objective: Predict students at risk of failing so that early interventions can be applied.
| # | Checklist Item | Description / Activities |
|---|---|---|
| 1 | Problem Definition & Business Case | Define prediction target (pass/fail/grade), establish stakeholder sign-off, document expected ROI and success metrics (e.g., reduction in dropout rate). |
| 2 | Data Identification & Access | Identify data sources (SIS, LMS, attendance records), confirm data-sharing agreements, ensure compliance with student privacy laws. Verify data completeness and time coverage. |
| 3 | Data Preparation & Quality Assurance | Handle missing values, remove duplicates, encode categorical variables, normalise numerical features. Document all transformations. Validate that class balance is acceptable. |
| 4 | Model Development & Validation | Train multiple candidate models (Logistic Regression, Random Forest, XGBoost). Use stratified k-fold cross-validation. Compare using accuracy, F1-score, and AUC-ROC. |
| 5 | Ethical Review & Bias Audit | Check that the model does not discriminate by gender, ethnicity, or disability. Obtain institutional ethics clearance before deployment. |
| 6 | Deployment & Integration | Integrate prediction dashboard into university student support system. Train academic advisors. Set up monitoring for model drift and schedule quarterly re-training. |
| 7 | Documentation & Closure | Produce technical report, user manual, and lessons-learned document. Archive code in version-controlled repository. Obtain final sign-off from stakeholders. |
Project: Disease Prediction from Patient Data for a Kenyan Healthcare Provider
- Stakeholder Identification & Requirements Gathering — Clinicians, hospital admin, and IT staff have different needs. Early alignment prevents scope creep and ensures the model output is clinically usable.
- Data Governance & Ethics Clearance — Patient data is sensitive. Obtaining IRB/ethics clearance and confirming compliance with Kenya's Data Protection Act (2019) before accessing records is legally mandatory.
- Data Availability & Quality Assessment — Determine which electronic health record fields are consistently populated. Poor data quality is the leading cause of DS project failure. Early assessment allows remediation time.
- Infrastructure & Resource Planning — Identify compute resources, storage, and software licenses needed. Healthcare facilities in Kenya often have limited IT infrastructure; cloud vs on-premise decisions must be made early.
- Risk Register Creation — Document risks (biased training data, model misuse by untrained staff, connectivity outages) and assign mitigation owners. Proactive risk planning reduces firefighting during execution.
- Project Schedule & Budget Baseline — Create a Gantt chart with milestones and a costed work breakdown structure. This enables earned-value monitoring and early detection of budget overruns.
Planning forces stakeholders to agree on what the project will and will not deliver. This reduces scope creep — one of the most common causes of cost and schedule overruns in DS projects.
Preparation reveals the human, technical, and financial resources required. Without this, teams may discover mid-project that critical skills (e.g., ML engineers, domain experts) or data infrastructure are unavailable.
A documented plan gives the project manager a baseline against which actual progress, cost, and quality can be compared. This enables early detection and correction of deviations before they become critical.
The model must reflect the true situation of the project and organisation. It should capture real constraints (budget, technology maturity, regulatory environment) and real objectives, rather than idealistic assumptions. A realistic model produces decisions that can actually be implemented.
The selection model must be capable of adequately distinguishing between good and bad projects — i.e., it must be sensitive enough to differentiate project proposals on the dimensions that matter (ROI, risk, strategic fit). A model that rates all projects similarly has no discriminatory power and is useless for decision-making.
Other recognised criteria include: flexibility (adaptable to change), ease of use (understood by decision-makers), and cost-effectiveness (the cost of the selection process should not exceed its benefit).
These models use measurable financial or statistical criteria to rank and select projects. Common examples include Net Present Value (NPV), Internal Rate of Return (IRR), Payback Period, and Scoring Models (weighted criteria matrices). They are objective and comparable across projects but require reliable data estimates which may be difficult early in a project.
These rely on human judgment, strategic priorities, or political considerations rather than numbers. Examples include Sacred Cow (a project championed by senior leadership), Operating Necessity (required for the organisation to continue operating), and Comparative Benefit (comparison of projects by committee opinion). They capture intangible strategic value but can be subjective and inconsistent.
Benefits:
- Speed: Decisions can be made quickly without lengthy financial modelling, which is valuable when a project is time-critical.
- Captures intangibles: Factors like brand reputation, staff morale, and societal impact — which resist quantification — can still influence the decision.
Drawbacks:
- Bias and subjectivity: Decisions may reflect the preferences of influential leaders rather than the organisation's best interest (the "HiPPO effect" — Highest-Paid Person's Opinion).
- Lack of consistency: Without a structured framework, different decision-makers may evaluate the same project very differently, making it hard to compare alternatives fairly.
The following stages reflect an industry-standard DS lifecycle (aligned to CRISP-DM and agile DS practice):
- Business Understanding (Problem Definition): Engage stakeholders to clarify the business problem, define measurable success criteria, and document constraints (budget, timeline, data access). For KCA, this might mean confirming what "student at risk" means to academic advisors and what data the SIS holds.
- Data Understanding: Perform an initial exploration of available data — inventory sources, assess volume/velocity/variety, identify obvious quality issues, and document data lineage. Exploratory Data Analysis (EDA) begins here.
- Data Preparation: Clean and transform raw data into a model-ready dataset. This includes handling missing values, outlier treatment, encoding, normalisation, feature engineering, and train/test splitting. This stage typically consumes 60–80% of project time.
- Modelling: Select and train candidate algorithms (e.g., Logistic Regression, Decision Trees, Neural Networks). Tune hyperparameters. Track experiments using version control (MLflow, DVC).
- Evaluation: Assess models against business success criteria (not just technical metrics). Validate on holdout data. Conduct bias and fairness audits. Present results to stakeholders for approval.
- Deployment: Package the approved model as an API or dashboard and integrate into the operational system. Establish monitoring for data drift and model performance decay.
- Monitoring & Maintenance: Continuously track prediction accuracy in production. Schedule periodic re-training as new data accumulates. Document model version changes.
- Project Closure: Archive code, data, and documentation. Conduct a lessons-learned review. Obtain formal stakeholder sign-off and transition to operational support teams.
CRISP-DM (Cross-Industry Standard Process for Data Mining) provides a cyclical 6-stage framework: Business Understanding → Data Understanding → Data Preparation → Modelling → Evaluation → Deployment. However, industry DS projects require extensions to address enterprise concerns. The extended approach adds:
Before CRISP-DM begins, the extended approach adds formal project charter creation, stakeholder mapping, ethics review, data-sharing agreements, and project selection sign-off. This grounds the DS work in project management discipline.
Beyond defining objectives, the extended approach requires mapping the business process the DS solution will support, identifying KPIs for success, and conducting a PFA. ROI calculations and strategic alignment are documented here.
Extends CRISP-DM by adding data-quality profiling tools, data lineage mapping, and a data-readiness assessment. Data infrastructure needs (warehouses, lakes, streaming platforms) are identified and provisioned.
Introduces automated data pipelines, feature stores, and reproducibility requirements (version-controlled datasets). This industrialises the preparation step for ongoing production use.
Adds MLOps practices: experiment tracking, model registry, and automated hyperparameter optimisation (AutoML). Multiple model architectures are evaluated systematically rather than ad hoc.
Beyond technical metrics, the extended approach requires a business-impact evaluation (does it meet the KPIs from Stage 1?), an ethical audit (fairness, explainability, bias), and a risk review before deployment approval.
Extends CRISP-DM deployment to include: CI/CD pipelines for model updates, A/B testing frameworks, rollback strategies, SLA agreements, and handover documentation for operations teams.
A new stage added by the extended approach. Covers drift detection, model retraining schedules, user feedback loops, and formal project termination with lessons-learned documentation. Critically, it treats the deployed model as a living system requiring ongoing stewardship.
- Data Dependency and Uncertainty: Software projects start with defined specifications; DS projects start with data of unknown quality, completeness, and relevance. Poor data quality cannot be compensated for by better algorithms, making DS projects inherently riskier from the outset.
- Iterative and Non-Linear Workflow: DS development requires frequent cycling back to earlier stages (e.g., discovering in the modelling phase that more feature engineering is needed). Traditional software SDLC is more sequential. This makes scheduling DS projects difficult and increases scope-creep risk.
- Experimentation and Reproducibility: DS involves testing many hypotheses (algorithms, features, hyperparameters). Without proper experiment tracking tools, results are hard to reproduce. Software projects produce deterministic code; DS produces probabilistic models whose outcomes can differ between runs.
- Model Decay / Concept Drift: A deployed DS model degrades over time as the real-world data distribution changes (e.g., consumer behaviour shifts after COVID-19). Software rarely "wears out" unless the business rules change. DS systems require ongoing monitoring and retraining post-deployment.
- Explainability and Ethics: Stakeholders increasingly demand interpretable models (especially in healthcare, finance, and law). A software feature is inherently explainable (it follows coded logic); a neural network's decision may not be. DS projects therefore carry ethical and regulatory risks not present in standard software development.
A data science project uses statistical methods, machine learning, and domain knowledge to extract insights from data and build predictive or decision-support systems. For example, M-Pesa (Safaricom's mobile money platform) uses data science to detect fraudulent transactions in real time by analysing patterns in millions of daily transactions.
Unlike a typical IT project (e.g., building an HR payroll system) where deliverables are defined upfront, a DS project does not guarantee a useful model. It is possible to complete all project activities and still find that the data cannot produce a model that meets business accuracy thresholds. In Kenya, where data collection infrastructure is still maturing, this uncertainty is amplified by patchy or inconsistent historical records.
In a conventional IT project, the product is software with defined functionality. In a DS project, data quality and availability are the primary constraint. In Kenya, challenges such as limited digitisation of records (e.g., paper-based health records in rural clinics) or multilingual unstructured data (Swahili, Sheng, English) create data preparation challenges not encountered in a typical IT project.
| Challenge | Description | Solution |
|---|---|---|
| Poor Data Quality | Real-world data is often incomplete, inconsistent, duplicated, or mislabelled. Models trained on poor data produce unreliable predictions regardless of algorithm sophistication. | Implement a formal data quality management framework — automated profiling tools (e.g., Great Expectations), data-cleaning pipelines, and a data stewardship programme with clear ownership of data assets. |
| Stakeholder Communication Gap | Business users and data scientists speak different languages. Decision-makers may not understand model outputs, leading to distrust or misuse of DS recommendations. | Use model explainability tools (SHAP, LIME) and invest in data visualisation dashboards. Embed a "data translator" role — someone who bridges technical and business teams and presents findings in business-metric terms. |
| Deployment and Integration Challenges | Many DS projects produce excellent prototypes that never reach production due to incompatibilities with existing IT infrastructure, security policies, or operational workflows. | Adopt MLOps practices — containerise models (Docker/Kubernetes), use CI/CD pipelines for automated testing and deployment, and involve IT operations and security teams from the planning stage rather than at the end. |
- Data Availability and Quality: Acquiring sufficient, labelled, high-quality data is the most persistent challenge. In many African contexts, data exists in siloed, undigitised, or inconsistently structured forms.
- Talent Shortage: Skilled data scientists, ML engineers, and data engineers are scarce and expensive. Teams may lack the breadth of skills (statistics, coding, domain expertise, communication) needed for a full DS project.
- Ethical and Bias Issues: Models can encode historical biases present in training data, producing discriminatory outcomes. Ensuring fairness requires deliberate bias auditing which adds time and cost.
- Scalability and Infrastructure: Prototype models that work on small samples often fail at production scale. Cloud costs, latency requirements, and data-security constraints create new challenges at scale.
- Organisational Change Management: Even technically successful DS systems fail if the organisation is not prepared to change its workflows and trust algorithmic recommendations. Resistance from staff who fear job displacement is a real barrier.
- Missing Data — Incomplete records reduce dataset size and can bias results. Solution: Use imputation techniques (mean/median/mode imputation, KNN imputation, or multiple imputation) or flag and explicitly model missingness as a feature.
- Inconsistent Data Formats — Dates, currencies, and categorical labels stored in multiple formats cause parsing errors. Solution: Standardise formats early using data pipeline schemas and enforce validation rules at the point of ingestion.
- Duplicate Records — Duplicates inflate sample size and bias model training toward repeated patterns. Solution: Apply deduplication algorithms and entity-resolution techniques; use unique identifiers where possible.
- Imbalanced Classes — In classification problems (e.g., fraud detection), the target class may represent <1% of records, causing models to learn to always predict the majority class. Solution: Apply oversampling (SMOTE), undersampling, or class-weight adjustments; use metrics like AUC-ROC instead of accuracy.
- High Dimensionality (Curse of Dimensionality) — Datasets with hundreds of features make model training slow and prone to overfitting. Solution: Apply feature selection (correlation analysis, mutual information, Lasso regularisation) or dimensionality reduction (PCA, t-SNE) to retain only informative features.
Numeric (quantitative output / regression) techniques:
- Linear Regression: Models the relationship between a continuous dependent variable and one or more independent variables by fitting a straight line (hyperplane) that minimises the sum of squared residuals. Produces a numeric output (e.g., predicted house price).
- Support Vector Regression (SVR): Extends SVM to regression tasks. It finds a function within an ε-margin of the true values while minimising model complexity. Effective for non-linear relationships using kernel functions.
- Neural Networks (Deep Learning Regression): Multi-layered networks of artificial neurons learn complex non-linear mappings from input features to a continuous output through backpropagation. Used for tasks like demand forecasting and price prediction.
Non-numeric (classification / categorical output) techniques:
- Decision Trees: Recursively splits the feature space into subsets based on feature thresholds that best separate classes. Each leaf node represents a class label. Highly interpretable and used for customer churn or disease classification.
- Naïve Bayes: A probabilistic classifier based on Bayes' theorem with the assumption that features are conditionally independent given the class. Computes the posterior probability of each class and assigns the most probable. Widely used in text classification (spam detection) and sentiment analysis.
DS project success is evaluated at two levels — technical model performance and business impact:
Technical Metrics:
- Accuracy: % of correct predictions (not reliable for imbalanced datasets).
- Precision, Recall, F1-Score: More meaningful for imbalanced classification. F1 balances precision and recall.
- AUC-ROC: Area Under the ROC Curve — measures model discrimination power across thresholds. AUC = 0.5 is random; AUC = 1.0 is perfect.
- RMSE / MAE: Root Mean Square Error and Mean Absolute Error for regression tasks — measure average prediction error magnitude.
Business / Project-Level KPIs:
- ROI (Return on Investment): Financial return generated by the DS solution relative to its cost.
- Time-to-Insight: How much faster decisions are made using the DS system vs the previous process.
- Adoption Rate: Percentage of intended users actively using the DS output.
- Project Completion Metrics: On-time delivery, on-budget performance, scope adherence.
Project risk management exists because all projects operate under uncertainty. No project — regardless of how well planned — unfolds exactly as anticipated. Risks, if unmanaged, can cause schedule delays, cost overruns, quality failures, or complete project failure.
Ultimately, risk management protects project investment, increases stakeholder confidence, and improves the probability of achieving project objectives within constraints.
- Plan Risk Management: Decide how risk management activities will be conducted and resourced throughout the project. Produce a Risk Management Plan that defines methodology, roles, risk categories, probability/impact scales, and reporting thresholds.
- Identify Risks: Systematically determine which risks could affect the project. Techniques include brainstorming, Delphi technique, SWOT analysis, checklists from past DS projects, and expert interviews. The output is a Risk Register.
- Perform Qualitative Risk Analysis: Prioritise identified risks by assessing their probability of occurrence and potential impact using a probability-impact matrix. This focuses attention on the most significant risks without requiring detailed quantitative data.
- Perform Quantitative Risk Analysis: Apply numerical methods (Monte Carlo simulation, sensitivity analysis, decision trees) to the highest-priority risks to estimate their effect on project objectives (time, cost, scope) in measurable terms.
- Plan Risk Responses: Develop options and strategies for each significant risk. For threats: Avoid, Transfer, Mitigate, or Accept. For opportunities: Exploit, Share, Enhance, or Accept. Assign a risk owner for each response action.
- Monitor and Control Risks: Track identified risks, monitor trigger conditions, implement response plans, and identify new risks as the project progresses. Conduct regular risk reviews at project milestones.
Risk management requires resources (time, budget, tools) and the authority to escalate issues. Without visible commitment from senior leadership, risk processes are treated as bureaucratic exercises and are ignored when they conflict with schedule pressure.
Risks must be identified with input from all stakeholder groups — technical teams, domain experts, end-users, and legal/compliance officers. No single person sees all risks. Continuous engagement ensures new risks (e.g., a regulatory change mid-project) are captured promptly.
Every risk in the register must have a named owner responsible for monitoring and implementing the agreed response. Without ownership, risk responses are planned but never executed. In DS projects, data-quality risks may be owned by the data engineer, while ethical risks may be owned by the project sponsor.
Handling risk in a DS project involves both standard PM risk practices and DS-specific adaptations:
- Risk Identification: At project kick-off, conduct a risk workshop covering data risks (availability, quality, privacy), model risks (bias, overfitting, interpretability), operational risks (infrastructure, adoption), and schedule/cost risks.
- Risk Register: Document each risk with: description, probability (High/Medium/Low), impact (H/M/L), risk score, planned response, and owner.
- Data-Specific Mitigations:
- Run a data-quality audit before committing to timelines — adjust scope if data is found to be insufficient.
- Establish data backup and fallback sources in case the primary data source is unavailable.
- Include privacy impact assessments (PIAs) for projects using personal data.
- Agile Iteration as Risk Management: Using short sprint cycles means risks are surfaced and resolved frequently rather than accumulating until late in the project. Each sprint review is effectively a risk checkpoint.
- Contingency Reserves: Build schedule buffer (typically 10–15%) and cost contingency into the project plan to absorb realised risks without derailing the project.
- Continuous Monitoring: Review the risk register at every project milestone. Update probability and impact ratings as the project progresses and new information becomes available.
Agile is an iterative, incremental approach to project management that emphasises flexibility, collaboration, and delivering working outputs in short cycles called sprints (typically 1–4 weeks).
Key Agile principles applied in DS:
- Iterative development: Instead of delivering a complete model at project end, DS teams deliver prototype models or partial analyses at the end of each sprint for stakeholder feedback.
- Scrum framework: Daily standups, sprint planning, sprint reviews, and retrospectives keep the team aligned and blockers visible. The product backlog contains DS user stories (e.g., "As a bank manager, I need a fraud-probability score per transaction").
- Kanban boards: Used to visualise DS workflow stages (data collection → EDA → feature engineering → modelling → review) and manage work-in-progress limits.
- Continuous stakeholder involvement: Business owners review model outputs at every sprint, reducing the risk of building a technically correct but business-irrelevant model.
| Dimension | Agile | Traditional (Waterfall) |
|---|---|---|
| Planning | Rolling; adapts each sprint | Upfront; fixed baseline plan |
| Deliverables | Incremental working outputs every sprint | Single delivery at project end |
| Change management | Embraces change; backlog adjusted | Resists change; formal change control |
| Stakeholder involvement | Continuous (sprint reviews) | At defined milestones only |
| Risk exposure | Lower — issues surface in each sprint | Higher — issues surface late |
| Documentation | Lean; working model over documentation | Heavy; detailed specs required |
Scheduling and budget control: A project schedule (Gantt chart or Kanban board) translates the work breakdown structure into time-bounded tasks with dependencies. When each task is costed, the schedule becomes a time-phased budget baseline. Earned Value Management (EVM) then compares actual spend against the value of work completed:
- Gantt charts show planned vs actual progress for each DS phase (data collection, model training, deployment). Slippage is immediately visible and budget impact can be calculated.
- Kanban boards in Agile DS projects surface bottlenecks (e.g., data cleaning tasks piling up) that predict upcoming delays and cost increases before they materialise.
Two strategies to synchronise financial tracking with schedule updates:
- Earned Value Management (EVM) with monthly re-baselining: Link the project schedule directly to the cost account structure. Each time the schedule is updated (Gantt bars moved), the cost baseline is automatically recalculated. The project manager reviews Cost Performance Index (CPI) and Schedule Performance Index (SPI) at every sprint review or Gantt milestone.
- Milestone-based payment and budget release gates: Structure the project budget so funds are released only upon achieving verified milestones (e.g., "data pipeline signed off" before model development budget is released). This creates a direct link between schedule progress and financial authority, incentivising teams to complete stages before moving on.
- Final Deliverable Acceptance: Obtain formal written sign-off from the client/sponsor confirming that all deliverables meet agreed specifications. Why essential: Prevents disputes about what was delivered and triggers contractual payment.
- Resource Release: Release project staff back to their functional departments or other projects; terminate cloud compute subscriptions; return leased equipment. Why essential: Prevents ongoing costs after project completion and frees resources for other initiatives.
- Contract Closure: Close all vendor and subcontractor agreements. Confirm that all payments are settled and warranties are documented. Why essential: Protects the organisation from future financial and legal liability.
- Documentation and Archiving: Archive all project artefacts — code repositories, datasets, model files, reports, meeting minutes — in a retrievable system with clear labelling. Why essential: Enables future model auditing, retraining, or project reactivation. Regulatory compliance may also require retention.
- Lessons Learned Review: Conduct a structured retrospective with the full project team to document what went well, what went wrong, and what should be done differently. Why essential: Organisational learning prevents repeating mistakes in future DS projects and builds institutional knowledge.
- Formal Project Closure Report: Produce a closure report summarising final costs, schedule performance, quality outcomes, and lessons learned. Distribute to all stakeholders. Why essential: Creates an official record that the project has ended and documents accountability.
Context: A credit-scoring ML model for a Kenyan bank is terminated mid-deployment after an audit reveals the model uses demographic proxies that discriminate against rural customers.
Suspend all model predictions immediately. Quarantine the training dataset and any outputs that may have caused harm. Justification: Prevents further discriminatory decisions while the full impact is assessed; protects the bank from regulatory sanctions.
Identify all customers whose loan decisions were influenced by the biased model. Communicate transparently with the regulator (CBK), affected customers, and internal stakeholders. Justification: Regulatory bodies expect prompt, honest disclosure. Delayed communication compounds reputational and legal risk.
Revert the credit-scoring process to the previous (manual or rule-based) system to ensure business continuity. Hand over interim operational procedures to the credit department. Justification: The business cannot pause lending operations while the remediation is designed.
Review vendor and data-supplier contracts for termination clauses. Ensure data-deletion obligations under the Kenya Data Protection Act (2019) are met. Justification: Failure to honour contractual termination terms creates financial liability; data-deletion is a legal obligation.
Release DS team members to other projects. Archive but do not delete project assets (they may be needed for the ethics investigation). Decommission cloud environments except audit-required storage. Justification: Controls ongoing costs while preserving evidence for regulatory review.
Document: root cause of the bias (which features acted as proxies), when it should have been caught (ethics review gap), and what controls failed. Recommend: mandatory bias audits at model evaluation stage in all future DS projects. Justification: Institutional learning is the primary long-term benefit of any termination — this prevents recurrence across the organisation.
Termination is the formal conclusion of a project — planned or unplanned — in which project activities cease, resources are released, and the project is formally closed. It may occur upon successful completion, or early due to changed circumstances, failure, or ethical issues.
Example: The Kenyan government terminates a crime-prediction DS pilot after civil society groups raise concerns about bias against specific communities — this is an ethical early termination.
The project reaches its planned end: all deliverables are accepted, the system is deployed, and the project team is disbanded. This is the ideal scenario. Best suited for: Projects that achieve their objectives on schedule — e.g., a DS model for supermarket demand forecasting that is successfully integrated into the ERP and handed to operations. Critical point: Even natural closure requires formal documentation; many organisations skip lessons-learned reviews when projects end positively, losing valuable institutional knowledge.
The project is so successful that it becomes a permanent part of the organisation — it is absorbed into an ongoing business unit rather than closed. Best suited for: DS systems that generate continuous value and need ongoing maintenance (e.g., a fraud-detection model at a bank becomes a permanent ML Ops function). Critical point: The transition from project to operations must be carefully managed — project-style governance (sprint planning, daily standups) must transition to a service-management model (SLAs, change control).
Resources are progressively cut until the project can no longer continue. This often signals a politically difficult termination where leadership does not want to formally cancel a project but withdraws support. Best suited for: Projects that have lost strategic relevance or where results are disappointing but the organisation is unwilling to admit failure. Critical point: This is the most damaging strategy — it drains resources slowly, demoralises the team, and produces no clean closure. Better practice is an explicit go/no-go decision at a defined review gate than allowing a project to die gradually.
Short answer: No — a project cannot be deemed fully successful if it causes demonstrable harm to users, even if financial targets are met.
Limits of profit-based success models:
- They ignore externalities: Financial ROI captures value to the organisation but not costs imposed on affected users, communities, or society. A credit-scoring model that denies loans to creditworthy rural women may generate profit for the lender but perpetuates financial exclusion.
- They are backward-looking: Positive NPV is calculated at a point in time. Regulatory fines, reputational damage, litigation costs, and forced system shutdowns that arise post-deployment can obliterate the financial case entirely. An ethical failure can turn a positive NPV project into a net loss.
- They don't account for trust: Long-term organisational value depends on customer trust. A discriminatory DS system — even a profitable one — erodes the social licence to operate.
Challenges:
- Tool fragmentation: DS projects commonly use Jira (project management), Git (version control), MLflow (experiment tracking), Airflow (pipelines), and Tableau (visualisation) — all from different vendors with different APIs. Keeping these in sync creates integration overhead.
- Data security across platforms: When data moves between on-premise databases, cloud storage, and SaaS tools, data-governance policies are harder to enforce consistently.
- Skill requirements: Team members must be proficient in multiple tools. Tool sprawl can reduce productivity if team members spend more time managing the toolchain than doing DS work.
- Vendor lock-in: Heavy reliance on proprietary platforms (e.g., AWS SageMaker, Azure ML) makes it costly to switch vendors if pricing or service quality changes.
Opportunities:
- End-to-end automation: A well-integrated ecosystem enables CI/CD for ML — from data ingestion through model training, testing, and deployment — reducing manual errors and accelerating delivery.
- Improved traceability: Integrated platforms link data versions, code versions, experiment results, and deployed models, creating a complete audit trail that satisfies regulatory requirements.
- Scalability: Cloud-native integrated platforms scale compute and storage on demand, enabling DS projects to handle large datasets without upfront infrastructure investment.
Phase 1 — Initiation & Planning (Risk Identification):
- Conduct a structured risk workshop with the transport agency, Nairobi City County, traffic police, and sensor vendors to identify risks: sensor failure, data latency, data privacy (vehicle tracking), political interference, and budget cuts.
- Create a risk register with probability/impact ratings. Given Kenya's infrastructure variability, assign high probability to connectivity outages and power failures at sensor nodes.
Phase 2 — Data Collection (Data Risk Controls):
- Diversify data sources: combine fixed sensors with mobile probe data (e.g., matatu GPS, Google Maps APIs) to reduce single-source failure risk.
- Implement real-time data quality monitoring — alert when sensor dropout rates exceed 10% in any corridor.
Phase 3 — Modelling (Model Risk Controls):
- Test models on historical traffic data before live deployment. Include scenarios for extreme events (Nairobi Marathon, state funerals).
- Build fallback rule-based systems so traffic signal controllers revert to fixed timing if the ML system fails.
Phase 4 — Deployment & Monitoring (Operational Risk Controls):
- Deploy in a single corridor first (e.g., Uhuru Highway) as a pilot before citywide rollout — limits blast radius of any failure.
- Establish 24/7 operations monitoring with defined incident response procedures. Conduct quarterly model performance reviews and retrain as traffic patterns evolve (e.g., post-construction of Nairobi Expressway).
Issue: Crop Failure and Food Insecurity in Kenya's Agricultural Sector
Business/Social Problem: Kenya's agricultural sector employs roughly 40% of the workforce, but smallholder farmers — who produce the majority of food — face high crop-failure rates due to unpredictable rainfall, pest outbreaks, and poor soil management. This drives food insecurity (particularly in arid/semi-arid counties such as Turkana and Marsabit) and economic loss for farmers and agribusinesses that depend on reliable supply chains.