Why Spec-Driven Development? — Claude Code Training

The basic workflow

Most people start working with Claude Code the same way. You open a session, describe what you want in plain language, Claude writes some code, you read it, ask for changes, and repeat. For a short, well-bounded task — rename a column, write a unit test, fix a specific bug — this works well.

The problem appears when the task grows. Building a feature engineering pipeline. Designing a model evaluation harness. Refactoring a data preprocessing module that touches ten files. These tasks have many moving parts, implicit constraints, and decisions that need to be made before the first line of code is written. The basic workflow has no mechanism for any of that. You just start, and you figure it out as you go.

That approach produces a recognizable pattern: a long, wandering session that ends with code that sort of works, a nagging feeling that some decisions were wrong, and no clear record of what was decided or why. The next time you start Claude Code, that context is gone.

Where it breaks down

Here are the specific failure modes, each illustrated with an example from a feature engineering pipeline project.

Scope creep through negotiation

You ask Claude to build a pipeline that handles missing values. Claude implements median imputation everywhere. You say “actually, use mode for categorical columns.” Claude changes it. You realize it should be configurable per column. You say that. Claude rewrites again, but now the API has changed. You spend ten messages negotiating the interface rather than building the thing. No individual response was wrong — Claude was responding reasonably each time. But without a clear spec, you were defining requirements through trial and error in production.

Scope creep through negotiation

direction: right

prompt: "missing value\npipeline"
v1: v1\nmedian everywhere
fb1: "mode for\ncategoricals"
v2: v2\nmedian + mode
fb2: "configurable\nper column"
v3: v3\nAPI changed
fb3: "broke the\ntraining loop"

prompt -> v1
v1 -> fb1
fb1 -> v2
v2 -> fb2
fb2 -> v3
v3 -> fb3

Lost context between sessions

Claude Code has no memory across sessions. Every /clear or new session starts from zero. In a long project, you will make dozens of decisions: this column gets log-transformed, this feature is out of scope, imputation happens before scaling, not after. Those decisions live in the conversation. Once the conversation ends, they are gone. In the next session, Claude makes different choices — reasonable ones, but different. You spend time re-explaining, and sometimes you don’t realize the choices diverged until the model performance changes.

Lost context between sessions

direction: right

s1: Session 1 {
  direction: down
  d1: log-transform income
  d2: impute before scaling
  d3: drop correlated features
}

gap: /clear\n(context gone)

s2: Session 2 {
  direction: down
  d1: standard scaling
  d2: scale before impute
  d3: keep all features
}

s1 -> gap
gap -> s2

Misaligned assumptions

Claude will make assumptions when your prompt leaves gaps. When you say “build a feature engineering pipeline,” Claude assumes something about the input data format, the output schema, the sklearn version, whether the result should be a Pipeline object or a plain function. Those assumptions are reasonable in the abstract. They may not match your actual system. You typically discover the mismatch 200 lines in, when the pipeline fails to fit inside your training loop.

Misaligned assumptions

direction: right

prompt: "build a feature\nengineering pipeline"

claude: Claude assumes {
  direction: down
  a1: input is a DataFrame
  a2: returns a Pipeline object
  a3: sklearn 1.x API
}

actual: Your system expects {
  direction: down
  a1: input is a numpy array
  a2: returns a plain function
  a3: sklearn 0.24 API
}

conflict: Mismatch found\nat line 200

prompt -> claude
prompt -> actual
claude -> conflict
actual -> conflict

Non-reproducible process

You finish a session and the pipeline works. The model trains. The evaluation looks good. Two weeks later, a colleague asks how the pipeline handles outliers. You look at the code and can’t tell — there’s no comment, no doc, and the conversation where you worked it out is gone. The code is the only artifact, and code doesn’t explain the decisions behind it. A conversation is not a specification.

Non-reproducible process

direction: right

session: Session {
  direction: down
  c1: "clip outliers at p99?"
  c2: "yes, clip them"
  c3: "or winsorize instead?"
  c4: "let's winsorize"
}

clear: /clear

code: pipeline.py {
  direction: down
  l1: winsorize(df, limits=0.01)
}

question: "Why winsorize?\nWhat threshold?"

session -> clear: conversation ends
clear -> code: only artifact remains
code -> question: no answer

No review surface

A pull request with fifty changed files and no written specification is difficult to review. A reviewer can read the code, but they can’t check “is this what the team agreed to?” without a written description of intent. The only source of truth is the conversation, and conversations don’t diff.

No review surface

direction: right

pr: Pull request\n(50 files changed)

reviewer: Reviewer

spec: Written spec {
  shape: document
}

question: "Is this what\nwe agreed to?"

pr -> reviewer
reviewer -> spec: looks for
spec -> question: does not exist
reviewer -> question: can't answer

The cost for DS/MLEs specifically

Software engineers face these problems too, but data scientists face them more acutely. Here is why.

Pipelines have implicit data contracts. A function that processes a pandas DataFrame embeds assumptions about column names, dtypes, the presence or absence of nulls, the range of values, and the meaning of each column. None of that is in the function signature. When Claude writes a pipeline based on a vague description, those assumptions get baked in silently. If they don’t match the actual data, the pipeline may run without error and produce subtly wrong results — wrong features, silent data leakage, or an output schema that breaks downstream training.

“The code works” is not the same as “the experiment is valid.” In software, a passing test suite is usually sufficient evidence that the code does what it should. In machine learning, you also need to know that the preprocessing is correct, the train/test split is sound, the label is defined consistently, and the evaluation metric matches the business goal. Claude can write code that runs correctly and still get these things wrong if the spec doesn’t address them.

Experiments are inherently exploratory. You often don’t know exactly what you want until you’ve tried a few things. That’s fine — exploration is part of the work. But there’s a difference between exploring deliberately and exploring because you never wrote down what you were trying to build. The basic workflow conflates the two. A spec lets you keep the exploration but eliminate the accidental ambiguity.

Decisions accumulate invisibly. A feature engineering pipeline might encode thirty decisions: which features to include, how to handle each type of missing value, what transformations to apply, what the output schema looks like. In the basic workflow, those decisions get made implicitly during the session and never written down. When a model unexpectedly degrades on new data, you need to audit those decisions. Without a written spec, you’re auditing code that may not fully reflect the intent behind it.

What would solve this?

The failure modes above have a common root: there is no shared, written artifact that describes what should be built before Claude starts building it. The conversation substitutes for that artifact, and conversations are a poor substitute — they’re ephemeral, hard to review, and don’t force clarity upfront.

The solution is a specification: a structured, written description of the task that exists before any code is written. Not a vague README paragraph. Not a list of bullet points in a comment. A document that answers the hard questions explicitly — inputs, outputs, constraints, edge cases, what’s out of scope — in a form that both you and Claude can treat as a contract.

When Claude has a spec, it can implement against it rather than infer against your description. It can flag when the spec is ambiguous before writing code that embeds a wrong assumption. It can validate its own output against the acceptance criteria. And when the session ends, the spec persists — so the next session starts with context, not from scratch.

This is Spec-Driven Development. The next lesson covers what it actually looks like: how a spec is structured, what belongs in one, and how to write specs that work well with Claude Code.