Spec-Driven AI Development: Building Koog Agent for Mobile Test Failure Analysis with Claude Skill

How applying SDD sessions produces better architecture, fewer rewrites, and a codebase you actually understand without cognitive load during development.

My Experience

Developing code with AI has become standard practice for many developers today. Scrolling through my LinkedIn feed one day, I came across a post about open-source Claude Skills shared by Addy Osmani. The repo had thousands of stars and had quickly become one of the most popular GitHub repositories dedicated to AI coding skills. One of them caught my eye: the Spec-Driven Development Skill.

I wasn’t coming to this completely cold. I’d already completed the Spec-Driven Development with Coding Agents certificate on DeepLearning.AI, and had been inspired by Martin Fowler’s team’s hands-on write-up of the same methodology. So I decided to put it to the test myself — building a simple Koog AI Agent running against a local Ollama LLM.

This article reflects my experience with spec-driven development and a maximum shift-left approach, where quality is embedded into the specification before a single line of code is generated.

A few honest thoughts about how it actually went:

I started writing the spec, plan, and tasks on my iPhone — from the bathroom.
I genuinely didn’t expect to do serious software architecture while enjoying a bath.
Once the specs were generated, I realised this was actually going somewhere — so I got out and sat down to implement all 14 AI-generated tasks properly.
Connecting Claude CLI in IntelliJ IDEA and generating the project took just a couple of minutes.
Breaking the work into 14 small tasks made the whole process surprisingly enjoyable — generate, review, tweak, run unit tests, repeat. No cognitive overload.

By the end, I was genuinely surprised: everything worked. The Koog agent ran correctly against a local Ollama LLM from the first try.

The Project: Why This, Why Now

For this experiment, I chose to build a simple project using the Koog AI Framework. Koog comes with plenty of examples and makes it straightforward to write a new AI agent from scratch. My instinct as a developer is always to start simple — understand the full flow first before tackling anything complex.

So I picked a problem I know well from my mobile testing work: analysing failed test logs produced by XCUITest or Espresso. The basic use case looks like this:

Test cases run on CI.
The test framework produces results, including a failure log for each failed test.
My CLI tool takes a single log file and extracts the relevant lines for analysis.
An AI Agent analyses the issue and classifies it into a known failure category — or flags it as unknown.
Once all logs are analysed, you can produce an aggregated report grouped by issue category.

One thing worth being upfront about: this is a learning project, not a production tool. Existing platforms like ReportPortal and Allure already have built-in features — deterministic regex parsing, AI classification, rich dashboards. My goal here was different. I wanted to answer a few personal questions:

How do you actually wire all the pieces together into a working AI agent?
What is the cognitive load of building a simple project this way?
How fast can you ship an AI agent within the Kotlin ecosystem?
Where does spec-driven development genuinely help — and where doesn’t it?

Below is my short journey through that process. A few honest notes before you read on:

The content below is AI-assisted. The article structure and prose were generated with AI help to simplify the editing process.
Almost all code was generated by Claude using the spec-driven skill — including SPEC.md and the individual spec prompts for each of the 14 tasks.
Don’t try to follow this as a step-by-step tutorial in one sitting. Reading a full SDD session as a document is not the same as doing it. Instead, I’d encourage you to try the SDD Claude skill on your own small project or feature. Learning by doing is the only way to understand whether this approach works for you.

The Spec-Driven Development Skill

The workflow is built around a published agent skill by Addy Osmani — a structured four-phase gate process for AI-assisted development:

SPECIFY → PLAN → TASKS → IMPLEMENT

The key insight is that each phase is a gate, not a suggestion. You do not move from SPECIFY to PLAN until the spec is agreed. You do not move from PLAN to TASKS until the plan is locked. You do not write a single line of code until the task list exists.

This discipline prevents the most painful failure mode of AI coding sessions: confident divergence from intent. An AI that starts coding before the spec is clear will produce code that is internally consistent but externally wrong. The gate structure forces the assumptions to surface before they become bugs.

Phase 1: SPECIFY

The specification phase begins not with writing a spec, but with surfacing assumptions. Before producing any document, the AI explicitly states what it believes to be true about the project and asks the developer to correct any mismatches.

For log-analyst-koog, the assumption surfacing looked like this:

1. Greenfield Kotlin project — not constrained by the Java version’s design decisions. 2. Core ‘shrinking AI layer’ philosophy is a first-class architectural constraint. 3. Target is CLI tool, not a library or service. 4. Ollama local LLM remains the backend. 5. This is a solo open-source side project.

Three targeted questions shaped the scope before a word of spec was written:

What is the primary input source for logs? (Local files)
Who is the primary user? (Solo personal productivity tool)
What is the most important new design goal? (Better Kotlin-native architecture)

Two more questions locked the v1 boundary:

Which platforms? (Both iOS and Android)
Analysis mode? (Single log file at a time)

This is the discipline that most developers skip. The questions feel obvious in retrospect. But without them, an AI will make its own assumptions — and its assumptions optimise for code that compiles, not code that solves your actual problem.

The resulting spec covered six areas: objective, tech stack, commands, project structure, domain model, and boundaries. The boundaries section is particularly important — it defines three tiers of behaviour:

Always: Deterministic classifier runs before any LLM call. LogAnalysis always includes evidence. classifiedBy is always populated.

Ask first: Adding a new FailureCategory. Changing the LogAnalysis model shape. Introducing a new Gradle dependency.

Never: Call Ollama on every analysis. Commit real production log files. Return UNKNOWN without populating rawSummary.

These boundaries encode the architecture’s constraints as explicit rules rather than implicit expectations — making them enforceable in every subsequent prompt.

Phase 2: PLAN

With the spec agreed, the plan defines implementation order and verification checkpoints. The critical decision made here was to reject parallel work streams in favour of a strict sequential flow:

model → parser → deterministic classifier → report writer →
CLI → koog scaffolding → llm classifier → classifier pipeline →
fixture tests → mock-llm integration test → CLI polish

This sequence was deliberate. Stage 1 (Tasks 1–8) produces a fully working tool with zero AI involvement. The deterministic classifier handles known patterns. The CLI wires everything together. A user can run the tool and get results before a single line of Koog code is written.

This matters because it validates the architecture before introducing the most complex dependency — the AI framework. If the pipeline design is wrong, you discover it at Task 6 (deterministic classifier composition), not at Task 11 (classifier pipeline with LLM). The cost of rework is proportional to how late you discover the problem.

Three verification checkpoints anchored the plan:

After Stage    Checkpoint
Stage 1        analyse --file ios_sample.log prints valid JSON to stdout
Stage 2        Unknown log pattern triggers Ollama, returns LLM-classified result
Stage 3        ./gradlew test fully green; no Ollama call in deterministic test cases

Phase 3: TASKS

The task list is the spec translated into implementation units. Each task has exactly three components: acceptance criteria, verification method, and files affected.

The original 14-task list went through one refinement during planning: Task 9 (Koog agent scaffolding) was split into 9a and 9b when it became clear that the agent scaffolding and the JSON response correction pipeline were distinct concerns with different testing strategies.

Task 9a: Koog agent scaffolding + prompt
- Agent connects to Ollama
- Prompt instructs JSON-only output
- Files: LogAnalystAgent.kt, AnalysisPrompt.kt

Task 9b: JSON response parser + correction
- Strip markdown fences
- Fix trailing commas
- Fallback to UNKNOWN on unrecoverable failure
- Files: JsonResponseParser.kt
- Unit tests: valid JSON, malformed JSON, garbage input

Task 9b deserved its own unit tests immediately — not in Stage 3 — because the correction logic is deterministic and independently testable without any LLM dependency. This is a pattern worth noting: when a task contains logic that is both deterministic and non-trivial, it should have tests at the task level rather than waiting for the hardening stage.

The final task list was 15 tasks, with Task 15 added for documentation — a README.md targeting beginner users, covering what the project does, command-line usage with realistic examples, project structure as an ASCII tree, and a tech stack table pulled from build.gradle.kts.

Phase 4: IMPLEMENT

Implementation follows a strict per-task prompt discipline. Each prompt to Claude CLI contains:

Project context — one sentence describing the project
Task declaration — which task and what it produces
Context files — explicit list of files to read before writing anything
Spec — the acceptance criteria from the task list, expanded with implementation details
Style rules — idiomatic Kotlin constraints, applied consistently

The style rules deserve particular attention. Without them, an AI will produce code that compiles but doesn’t fit the codebase. For this project the rules were:

Expression bodies where natural
when as an expression, not a statement
Extension functions over utility classes
No Builder pattern
No Java-isms (no checked exception patterns, no verbose null handling)
PascalCase classes, camelCase functions, SCREAMING_SNAKE constants

The full set of spec prompts isn’t included here — they are auto-generated resources produced by applying the Claude skill. Here is the Task 4 prompt as a concrete example:

I'm building a Kotlin CLI tool for mobile test log analysis 
called log-analyst-koog.

Implement Task 4: Deterministic classifier — iOS rules.

Context files to read first:
- src/main/kotlin/model/LogAnalysis.kt
- src/main/kotlin/model/FailureCategory.kt
- src/main/kotlin/model/ClassifierSource.kt
- src/main/kotlin/parser/ParsedLog.kt
- src/main/kotlin/classifier/deterministic/IosRuleClassifier.kt

Spec:
Implement IosRuleClassifier.classify(log: ParsedLog): LogAnalysis?
- Returns null if no rule matches — signals LLM fallback needed
- ASSERTION_FAILURE: match XCTAssertEqual, XCTAssertTrue, XCTAssertFalse...
- ELEMENT_NOT_FOUND: match "Unable to find", "No matches found"...
- TIMEOUT: match "Exceeded timeout", "timed out", "waitForExistence"...
- confidence = 1.0f always for deterministic
- classifiedBy = ClassifierSource.DETERMINISTIC
- evidence = matching lines from log.relevantLines (never empty)
- No LLM calls, no coroutines, no suspend functions

Style: idiomatic Kotlin, when-as-expression, extension functions
where natural, no Java-isms, no Builder pattern.

Write unit tests in: src/test/kotlin/classifier/IosRuleClassifierTest.kt
- 3 positive tests, one per category
- 3 negative tests confirming null on non-matching content
- Test naming: given_[signal]_when_classify_then_returns_[category]()

Token Efficiency

A methodology that produces better results but consumes ten times the tokens is not practical. The workflow includes explicit token management techniques:

/clear between tasks. Each task is independent. Conversation history from Task 4 is dead weight during Task 5. Starting each task with a clean context window prevents the previous session’s code from inflating the token count.
Reference files, don’t paste them. Claude CLI reads project files directly. The prompt lists file paths; the AI fetches them. This avoids duplicating potentially large files as prompt text.
Style rules by reference after the pattern is established. By Task 5 the codebase establishes its conventions. Instead of repeating six style rules, a single line suffices: “Match the style of IosRuleClassifier.kt exactly.”
Split implementation from tests. For complex tasks, two focused calls are cheaper than one long call that produces and refines both in the same session.
/compact for debugging sessions. When a failing test requires iterative back-and-forth, /compact summarises the conversation history without losing context, reducing the token overhead of each subsequent message.

The Domain Model as Architectural Contract

One decision worth examining in detail is the treatment of LogAnalysis as an architectural contract rather than a data container.

data class LogAnalysis(
    val platform: Platform,
    val failureCategory: FailureCategory,
    val confidence: Float,
    val classifiedBy: ClassifierSource,
    val evidence: List<String>,
    val recommendedAction: String,
    val rawSummary: String
) {
    init {
        require(confidence in 0.0f..1.0f) {
            "confidence must be in range [0.0, 1.0], got $confidence"
        }
        require(evidence.isNotEmpty() || failureCategory == FailureCategory.UNKNOWN) {
            "evidence must not be empty unless category is UNKNOWN"
        }
    }
}

The init block encodes two spec boundaries as runtime constraints. This means every AI-generated implementation that violates these rules fails immediately at the call site — not silently, not in production, but at the moment the object is constructed. This is the difference between a spec that lives in a document and a spec that lives in the code.

The classifiedBy field is similarly intentional. Every LogAnalysis carries a record of whether a deterministic rule or the LLM produced it. This transparency is essential for the pattern promotion pipeline — the v2 feature where repeated LLM classifications are graduated into deterministic rules by the developer. Without classifiedBy, you cannot identify which results are candidates for promotion.

The Architecture That Emerges

The methodology produces a pipeline that is simple to understand, simple to extend, and simple to test:

LogFile
   │
   ▼
LogParser
  ├─ detectPlatform()     ← weighted signal scoring, IOS vs ANDROID
  └─ extractRelevantLines() ← noise stripped, failure signals kept, capped at 50
   │
   ▼
ClassifierPipeline
  ├─ DeterministicClassifier  ← always runs first, zero LLM cost
  │   ├─ IosRuleClassifier    ← ASSERTION_FAILURE, ELEMENT_NOT_FOUND, TIMEOUT
  │   └─ AndroidRuleClassifier ← APP_CRASH, ASSERTION_FAILURE, ELEMENT_NOT_FOUND
  │        hit → LogAnalysis (confidence=1.0, classifiedBy=DETERMINISTIC)
  │        miss ↓
  └─ LlmClassifier            ← Koog agent, Ollama, qwen2.5-coder:7b
       └─ JsonResponseParser  ← correction pipeline, fallback to UNKNOWN
            │
            ▼
       LogAnalysis (classifiedBy=LLM)
   │
   ▼
JsonReportWriter → stdout or file

Every component has a single responsibility. Every boundary is explicit. The LLM sits at the end of the pipeline, not the beginning — and it fires only when the deterministic layer has already failed.

What This Methodology Is Not

It is not a replacement for engineering judgment.

The methodology structures the conversation with the AI; it does not eliminate the need to review the output. Task 2 produced a weighted signal scoring approach for platform detection that was better than the first-match when expression in the original stub — but recognising that required understanding why weighted scoring is more robust for ambiguous log content.

It is not suitable for exploratory work.

When the goal is to understand a problem space rather than implement a known solution, the gate structure creates overhead without benefit. The methodology works best when the architecture is understood and the task is specific.

It is not a one-size-fits-all prompt template.

The prompts in this article are calibrated for this specific project. The principles — surface assumptions, lock the spec, define done precisely, separate concerns into tasks — transfer to any project. The specific constraints do not.

Summary

The methodology can be summarised in four principles:

1. Surface assumptions before writing anything.

The most expensive bugs are the ones that result from wrong assumptions made at the start. Explicit assumption surfacing catches them before they become code.

2. Spec boundaries are first-class constraints.

The “Always / Ask first / Never” tier structure encodes architectural decisions as enforceable rules, not suggestions.

3. Sequential implementation order reduces rework cost.

Build the foundation first. Validate it before introducing complex dependencies. Discover design problems early.

4. Precise prompts produce precise code.

Every constraint in a prompt should map to a decision in the spec. If you cannot justify a constraint, question whether the spec is complete.

Notes:

The project described in this article — log-analyst-koog — is an open-source Kotlin CLI tool for AI-augmented mobile test failure analysis
The specification, task list, and implementation prompts referenced throughout this article are committed to the repository as SPEC.md
Claude Spec Skill used for development with Claude CLI: SKILL.md
Koog AI Agents examples

Spec-Driven AI Development: Building Koog Agent for Mobile Test Failure Analysis with Claude Skill was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.