Indonesia Singapore ไทย Pilipinas Việt Nam Malaysia မြန်မာ ລາວ
← Back to Blog

ML Problem Framing: Why Bad Setup Kills Good Pipelines

Define your ML problem with the rigour of a data contract before writing a single line of training code — or your pipeline will optimise for the wrong outcome.

An architect's blueprint overlaid on a tangled mess of data pipelines, symbolising the gap between good infrastructure and poorly framed ML problems
Illustrated by Mikael Venne

80% of ML projects fail before a model is trained. Here's how to fix your problem framing before you build the pipeline — not after.

Eight in ten ML projects fail — not because the model is wrong, but because the problem was never right to begin with. That ratio should make every data architect uncomfortable, because the pipeline is usually the last thing that needs fixing.

I’ve built enough data infrastructure to know that the most expensive mistakes don’t happen in the model training loop. They happen weeks earlier, when someone decides what the model is supposed to do — and nobody pushes back hard enough. Bad problem framing doesn’t just waste compute. It wastes the entire upstream investment: the clean lakehouse, the carefully orchestrated ETL, the feature store you spent three sprints engineering. All of it optimised to answer the wrong question.

The Problem Framing Debt Nobody Measures

Kaushik Rajan’s analysis in Towards Data Science puts it plainly: most ML teams treat problem definition as a formality, then spend months tuning hyperparameters on a model that’s solving a proxy metric nobody actually cares about. The five-step protocol he outlines — anchoring to a business decision, defining success before data collection, stress-testing the causal assumptions — reads like good data architecture practice applied one layer earlier in the stack.

That framing resonates from where I sit. When we design a data warehouse schema or a lakehouse partition strategy, the first question is always: what decisions will this data support? A warehouse built to answer the wrong question is a liability, not an asset. ML projects are no different. In a SEA context, where a Shopee or Lazada merchant team might be running 12 experiments simultaneously across five markets, a poorly framed churn prediction model can misdirect retention spend across an entire region before anyone notices the error rate in production.

The fix isn’t more sophisticated modelling. It’s a documented problem statement — treated with the same rigour as a data contract — reviewed before a single feature is engineered.

Hybrid Search and the Architecture of Retrieval

One area where problem framing failures are particularly costly right now: Retrieval-Augmented Generation (RAG) systems. The default assumption is that semantic vector search is sufficient. It often isn’t.

Maria Mouschoutzi’s breakdown of hybrid search mechanics in Towards Data Science is worth internalising if you’re designing any retrieval pipeline in 2026. TF-IDF and BM25 — the keyword-based workhorses that predate neural embeddings by decades — still outperform dense vector retrieval on exact-match queries, rare terminology, and structured product codes. In SEA markets, where queries frequently mix English with Bahasa, Thai, or Tagalog, and where SKU identifiers matter as much as intent signals, a pure semantic retrieval layer will quietly degrade recall in ways that aggregate metrics won’t catch.

The architectural implication: a well-designed RAG pipeline requires a hybrid index — BM25 for lexical precision, dense embeddings for semantic coverage — with a fusion layer (typically Reciprocal Rank Fusion) to merge results. That’s not a model tuning problem. It’s a pipeline design decision that needs to be made at the schema level, before the retrieval layer is built. Getting it wrong means reengineering the index under production load.


When the Engineer Becomes the Bottleneck

Egor Howell’s account of leaving a $130,000 ML engineering role cuts closer to the organisational nerve. The lessons he describes — around misaligned incentives, the gap between research and production, and the slow erosion of ownership over what actually ships — reflect a structural problem in how many organisations staff data functions.

From a data architecture standpoint, this matters because pipeline quality degrades when the people who understand the data model aren’t the same people making decisions about what gets built on top of it. I’ve seen this play out repeatedly: a lakehouse architecture designed for flexibility gets calcified by an ML team that hardcodes assumptions about grain and freshness into their feature engineering, and neither team realises the coupling until something breaks in production at 2am on a campaign day.

The model isn’t the product. The system that produces and maintains the model is. That system includes the data contracts, the monitoring, the retraining triggers, and the documentation that lets a new engineer understand why a particular partition key exists. Treating ML engineering as a model-shipping function, rather than a systems function, is how organisations end up with impressive demo accuracy and unreliable production behaviour.

For SEA teams specifically — often operating with leaner headcount across more markets than their counterparts in mature economies — this coupling risk is amplified. A single badly architected dependency can cascade across multiple country-level deployments simultaneously.

Fixing the Foundation Before the Model

The practical synthesis across these threads is straightforward, even if the execution isn’t. Before any ML initiative touches a training set, three architectural questions need documented answers: What decision does this model support, and who owns that decision? What does the retrieval or feature layer look like at the data contract level — including freshness, grain, and null handling? And what does failure look like in production, and how will we detect it?

These aren’t ML questions. They’re data architecture questions. The pipeline is only as intelligent as the problem it was designed to solve. A lakehouse built on a well-framed problem statement will outlast three generations of model iterations. One built to support a vague brief will accumulate technical debt faster than any hyperparameter grid search can compensate for.

The organisations winning on data in SEA right now aren’t necessarily the ones with the most sophisticated models. They’re the ones who spent the unglamorous time getting the problem definition right before the first Spark job ran.

The real question for growth teams in 2026: how much of your current ML investment is sitting on top of a problem statement that was never properly stress-tested — and what would it cost to find out?

An architect's blueprint overlaid on a tangled mess of data pipelines, symbolising the gap between good infrastructure and poorly framed ML problems
Illustrated by Mikael Venne
Chunky Grizzly

Written by

Chunky Grizzly

Designing the foundational plumbing — data warehouses, lakehouse models, and ETL pipelines — that separates organisations with genuine intelligence from those drowning in dashboards.

Enjoyed this?
Let's talk.

Start a conversation