What Six Years of Building Production ML at a Telco Actually Taught Me

DRAFT. This is a first-pass voice draft, not a final. The goal was to get the whole arc onto the page — argument, case studies, hiring signal — without editing it to death. Michael: read it for voice first. Where I invented a number or a detail I couldn't verify from the vault, I left a [FLAG: …] inline. Replace those with the real figures before this goes anywhere public.

The short version

For the last six years I've been the person who keeps data products alive after the demo is over. Not the notebook. Not the proof-of-concept that wins the slide deck. The thing that runs every day, feeds a dashboard a stakeholder actually opens, and breaks at 7am when an upstream table changes shape.

Most ML writing is about training a model. Almost none of it is about the part that consumes 90% of the calendar: making the model boring. Reproducible. Documented. Owned. Cheap enough to run that nobody asks you to turn it off.

This is a field report from inside a telco — Virgin Media O2 — where I work as an Analytics Engineer on Strategic Data Projects. Three case studies, one argument: the hard part of production ML isn't the model. It's everything that lets the model stay in production.

Case study 1 — A customer trust score that has to mean the same thing every day

The headline project is an ensemble scoring system. We wanted a single, historically comparable number that captured how a customer's experience and value were trending — across Mobile, TV, and Broadband.

The naive version of this is easy and wrong: train a model, score everyone, ship the number. The number drifts, the dashboard moves, and a stakeholder asks "did the customer get worse, or did your model get worse?" — and you can't answer.

Here's the architecture we actually shipped, and why each decision exists:

1. Three cohorts, three models — not one. Mobile, TV, and Broadband customers behave differently enough that a single model smears them together and leaks signal across products. We split into three product cohorts and trained specialised models per cohort. More models to maintain, but each one is honest about its population.

2. An ensemble of three targets, not one score pulled from thin air. For each cohort we trained models to predict three real business outcomes observed in the 6 months after a fixed reference date:

is_churn — logistic regression
is_recontract — logistic regression
revenue_change — linear regression

The model coefficients become feature weights. A composite weight per feature is derived conceptually as weight_recontract + weight_nrc − weight_churn, so the final score rewards the behaviours the business actually cares about and penalises the ones it doesn't.

[FLAG: the production ETI ensemble uses logistic + linear regression, per the Technical Design Doc in the vault — NOT gradient boosting. The task brief and some career materials reference a "gradient boosting model (XGBoost)" churn model. If you've built a separate XGBoost churn model elsewhere at VMO2, name it explicitly here and keep the two distinct. Do not write "XGBoost neural networks" anywhere — always "gradient boosting model (XGBoost)". Michael: confirm which model belongs in this case study.]

3. A fixed ruler for standardisation — this is the real lesson. Every feature is converted to a z-score, but the mean and standard deviation are computed once, from a fixed historical reference period (the 6 months up to 2025-07-31), and then stored. Every future day is standardised against that same stored ruler:

z_score = (raw_feature_value − stored_reference_avg) / stored_reference_stddev

This is the decision that makes the whole thing trustworthy. If you re-fit the mean and stddev daily, a customer's score changes when the population moves, not when the customer moves. By freezing the ruler, a change in the score is a real change in behaviour. That single architectural choice is the difference between a metric a business can track over time and a metric that quietly lies.

4. Outliers handled deliberately, not accidentally. Aggregated raw scores live on arbitrary scales with brutal outliers, so plain min-max normalisation fails. We standardise the raw score to a z-score, then push it through TANH() to squash it into a clean −1 to +1 range. Outliers compress gracefully instead of blowing up the axis.

The takeaway: none of the hard decisions here were modelling decisions. They were consistency decisions — fixed reference periods, stored parameters, a ruler that doesn't move. That's what "production" means. The model was the easy 20%.

Case study 2 — Experimentation you can actually defend

[FLAG: this case study needs Michael's real A/B testing details. The task brief asks for an experimentation/A-B case study, but I could not find a specific experiment write-up in the vault. Below is a SCAFFOLD with the right shape and the right lessons — replace the bracketed specifics with a real experiment you ran. If you don't have a clean A/B story, consider swapping this for the Contact Network Analysis / contact-reason work, which is fully evidenced in the vault.]

The fastest way to lose stakeholder trust is to ship a "win" that doesn't replicate. So the experimentation work was less about clever statistics and more about discipline:

Define the success metric before you look at the data. [FLAG: name the metric — e.g. FTR (Failure to Respond), recontract rate, contact rate.]
Decide the sample size and stopping rule up front, so nobody peeks and declares victory early. [FLAG: state the actual approach you used — fixed horizon, sequential testing, etc.]
Separate the analysis cohort from the training cohort to avoid leakage — the same instinct that drove the cohort split in case study 1.
Write the result down even when it's null. A documented null result is a saved quarter for the next person who'd have run the same idea.

[FLAG: insert one concrete experiment — what you tested, the population, the metric movement (with the real number), and the decision it drove. One real number is worth ten general principles here.]

The takeaway is the same shape as case study 1: the trustworthy part isn't the test statistic, it's the process around it that stops you fooling yourself.

Case study 3 — Analytics engineering: making the boring 90% repeatable

This is the work I'd argue is most underrated in the whole ML lifecycle, and it's where most of my last few years have actually gone: turning ad-hoc analysis into versioned, tested, documented data products with dbt on BigQuery.

A few concrete things that shipped and why they mattered:

customer_360_definition — the foundational Customer 360 model. It's the prerequisite layer the trust ensemble sits on top of. Built deliberately to a "Business Layer" standard: certifiable, independently deployable, joins clean. Without a solid 360 layer, every downstream model re-invents "who is the customer" slightly differently and the numbers stop reconciling.
customer_trust_signal — unions the three cohort outputs into one table, projecting missing columns as NULL so the schema stays consistent (e.g. a Mobile record has no icoms_account_uid). Boring, essential, and the kind of thing that silently breaks a dashboard if you get it wrong.
hierarchical_reference_stats — the stored "fixed ruler" from case study 1, materialised as a reference table. The architecture is the discipline.

The pattern under all of this: documentation lives in the same merge request as the model. I learned this the expensive way — shipping a model in one MR and its YML docs in another created documentation drift almost immediately. Now the YML goes in with the model, every time. It sounds trivial. It's the difference between a catalogue people trust and a catalogue people ignore.

This is also why I care about the Engineering Excellence work at VMO2 — six internal products (Auto-Lineage Builder, Auto-Code Review, an EE Scorecard, and others) all aimed at the same thing: making good engineering the path of least resistance for the whole team, not a heroic act by one careful person.

The takeaway: analytics engineering is the discipline that decides whether your ML system is an asset or a liability in eighteen months. A model with no lineage, no tests, and no docs is a future incident with a countdown timer.

What I actually believe after six years

The model is the easy 20%. Reproducibility, consistency, ownership, and cost are the 80% — and they're what separate a demo from a system.
Freeze your rulers. Any metric meant to be compared over time needs fixed reference parameters, or it drifts and lies.
Leakage hides in the obvious places — cohort splits, train/test boundaries, reference periods. Most "great" results are leakage until proven otherwise.
Documentation is a deploy artefact, not an afterthought. Same MR as the model. Always.
The best engineering makes the right thing the easy thing for everyone else on the team. That's what Engineering Excellence actually means.

Where I'm headed (the honest hiring signal)

I'm an Analytics Engineer who's spent six years on the unglamorous, load-bearing end of production ML in a large UK telco: dbt, BigQuery, ensemble scoring, business-layer modelling, and the engineering discipline that keeps all of it running. I'm now looking for roles that push further into production ML / ML platform / senior analytics engineering — places that take reproducibility and data quality as seriously as model accuracy.

[FLAG: Michael — decide how explicit to be about the £80k+ search and whether this paragraph goes in a public Stack Overflow post or only in the LinkedIn / personal-site cut. A softer "open to conversations about senior AE / ML platform roles" reads better on a technical hub; the explicit salary band belongs in DMs and recruiter conversations, not the article body.]

If you're building data products that have to survive contact with reality, I'd like to talk.

Drafting notes (delete before publish)

Voice was kept first-person, plain, slightly dry — your register, not LinkedIn hype-speak. Edit toward more of your voice, not less.
All technical specifics (cohort split, three-target ensemble, fixed reference z-scores, TANH normalisation, the named dbt models) are sourced from the Experiential Trust Index Technical Design Doc and the April monthly review in the vault, so they're real. The model type in case study 1 is the one place to double-check (regression vs XGBoost) — see the FLAG.
Case study 2 (A/B) is the weakest section because the vault has no concrete experiment. Either supply one real experiment, or swap in the Contact Network Analysis work.
Word count: ~1,300. Comfortable Stack Overflow / blog length. Could trim case study 1 if you want it tighter.

What Six Years of Building Production ML at a Telco Actually Taught Me ​

The short version ​

Case study 1 — A customer trust score that has to mean the same thing every day ​

Case study 2 — Experimentation you can actually defend ​

Case study 3 — Analytics engineering: making the boring 90% repeatable ​

What I actually believe after six years ​

Where I'm headed (the honest hiring signal) ​

Drafting notes (delete before publish) ​