I Built an RL Benchmark for API Contract Debugging in One Friday Evening

And shipped it to a hackathon the same night.

There’s a certain kind of Friday evening energy — the kind where instead of opening Netflix, you open a terminal and think: how fast can I build something real?

That’s how the API Contract Debugger was born. It’s a reinforcement learning environment built for the Meta × PyTorch OpenEnv Hackathon, where AI agents learn to debug broken OpenAPI specifications by proposing targeted, field-level corrections. From blank repo to deployed HuggingFace Space in roughly four hours.

This is the full story: the problem, the architecture, the sharp edges, and the lessons I want to remember.

The Problem: API Contracts Break All the Time, and It’s Boring to Fix Them

Any backend engineer who has worked across team boundaries knows this pain intimately. You consume an API. The contract says the product_id field is an integer. The actual response sends a string. The DELETE /orders/{id} endpoint says it returns 204 No Content but returns 200 with a body. The POST /auth/login response silently leaks a password_hash field that was never supposed to be there.

These aren’t exotic bugs. They’re the everyday friction of distributed systems — mismatches between what a spec says and what an implementation does. Debugging them is tedious, mechanical, and follows clear rules: find the violation, identify the right fix, apply it precisely.

That mechanical structure is exactly what makes it a perfect RL task.

The agent receives a broken spec and a list of violations. Each step, it proposes exactly one fix — add a missing field, remove a forbidden one, change a field’s type, or correct a status code. It gets rewarded for each violation it resolves, penalised for each new violation it introduces, and gets a completion bonus if it clears everything.

The environment is deterministic, the success criteria are unambiguous, and the difficulty is tunable. This is not a toy problem — this is a benchmark that models how real engineering judgment works.

Architecture: The Full Stack

The project is organized as a clean Python package with a FastAPI HTTP server that implements the OpenEnv interface.

api-contract-debugger/
├── server/
│   ├── app.py          # FastAPI app, route registration
│   ├── environment.py  # OpenEnv Environment subclass
│   ├── models.py       # Pydantic Action / Observation / State
│   ├── graders.py      # Violation detection + reward shaping
│   └── fixtures.py     # Task definitions (broken + golden specs)
├── frontend/           # Next.js dark-themed dashboard UI
├── tests/
│   └── test_env.py     # 56 unit tests
├── inference.py        # Baseline LLM-powered agent
└── openenv.yaml        # OpenEnv metadata

The OpenEnv Interface

OpenEnv defines a clean three-method contract for RL environments: reset(), step(), and state(). All observations, actions, and state are Pydantic models — typed, validated, and serializable.

# Action space — five kinds of targeted fixes
class DebugAction(BaseModel):
    kind: Literal["add_field", "remove_field", "change_type", "change_status", "no_op"]
    endpoint_index: int
    location: Literal["request_body", "response_body", "status_code"]
    field_name: Optional[str] = None
    new_value: Optional[Union[str, int, Dict[str, Any]]] = None

The observation the agent receives is rich: the current (partially fixed) endpoint specs, the remaining violations, how many violations its last action resolved or introduced, its step budget, and whether the episode is done.

class DebugObservation(BaseModel):
    task_name: str
    task_description: str
    endpoints: List[Dict[str, Any]]
    violations: List[Dict[str, Any]]
    violations_fixed_this_step: int
    violations_introduced_this_step: int
    total_violations_at_start: int
    step_count: int
    max_steps: int
    last_action_error: Optional[str]
    reward: float
    done: bool

The agent always knows exactly where it stands. No hidden state, no ambiguous observations.

Violation Detection and Reward Shaping

This was the most important design decision. The grader needs to be deterministic, fair, and nuanced enough to actually teach something.

Every violation has a severity weight that reflects how bad it is in practice:

Violation Type Severity Reasoning missing_field 1.0 Required contract element is absent wrong_type 0.9 Data arrives in wrong shape — breaks parsing wrong_status 0.8 HTTP semantics violated extra_field 0.7 Data leakage — bad but recoverable

The per-step reward:

+0.2 × severity  for each violation fixed
−0.15 × severity for each violation introduced
−0.05            for a malformed action
+0.5             bonus for clearing all violations

This is dense reward shaping — the agent receives useful signal at every step, not just at the end of the episode. This matters enormously for training. A binary end-of-episode reward tells the agent nothing about whether it’s getting warmer.

The final score is normalized to [0.0, 1.0] by grade_episode(), which makes it comparable across tasks with different violation counts.

Three Difficulty Tiers

The tasks are designed so that easy is solvable in a single step and hard genuinely challenges frontier models.

Easy — 1 endpoint, 1 violation, 5-step budget: A user registration endpoint is missing created_at in its response. One add_field action solves it perfectly. Score: 1.0.

Medium — 3 endpoints, 3 violations, 10-step budget: An e-commerce API with a type error (product_id should be integer, not string), another type error (quantity accepted as string), and a wrong status code (DELETE returning 200 instead of 204). Requires three targeted fixes across three different endpoints. Score: 1.0 for capable models.

Hard — 4 endpoints, 6 violations, 15-step budget: An auth + profile API with two violations on the login endpoint (missing refresh_token, wrong type on expires_in), two on the profile GET (missing created_at, forbidden password_hash leaking), and two on the profile PATCH (wrong status code, missing updated_at). This requires planning — the agent has to manage both addition and removal violations across multiple endpoints. Expected score for frontier models: 0.7–1.0.

FastAPI and Route Registration Order

One gotcha worth documenting: when you use OpenEnv’s HTTPEnvServer, route registration order matters.

The OpenEnv framework has two modes — DEVELOPMENT and PRODUCTION. In development mode, it registers its own /reset, /step, and /state routes. In production mode, it only adds /health, /schema, /metadata, and /ws.

The pattern that works:

# 1. Register your stateful routes FIRST
@app.post("/reset")
async def reset(req: ResetBody) -> Dict[str, Any]: ...

@app.post("/step")
async def step(req: StepBody) -> Dict[str, Any]: ...

# 2. Attach OpenEnv framework routes LAST, in PRODUCTION mode
_server = HTTPEnvServer(env=_get_env, action_cls=DebugAction, ...)
_server.register_routes(app, mode=ServerMode.PRODUCTION)

If you do it the other way, the framework routes shadow yours and you get strange behavior.

Inference Script: Making the Agent Talk

The inference script uses the OpenAI-compatible API client to drive a language model against the environment. The required stdout format is strict:

[START] task=easy env=api_contract_debugger model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action={"kind":"add_field",...} reward=0.70 done=true error=null
[END] success=true steps=1 score=1.000 rewards=0.70

The prompt construction is the interesting part. The agent needs to understand the current violation list and produce a valid JSON action. The system prompt describes the action space; the user prompt for each step shows the current state.

One thing I got right from the start: the [END] line must always be emitted, even if an exception occurs. This is handled in a finally block, and the score is clamped strictly between 0 and 1 before being logged. Getting this wrong means the evaluator can't score your submission.

The Frontend and Running Everything Together

The project ships a full Next.js frontend — a dark-themed dashboard where you can manually play against the live environment, explore tasks, inspect the current spec state, and apply fix actions without writing a single curl command.

The UI is split into four panels. The top half shows the Current Endpoint Spec (all endpoints with their fields, types, and HTTP status codes) and Active Violations (flagged with tags like WRONG TYPE, WRONG STATUS, each showing exactly which endpoint and field is affected). The bottom half has the Action Builder on the left and a Step Log on the right.

Because the environment is a pure HTTP API, the frontend is completely decoupled. It only needs three endpoints: POST /reset, POST /step, and GET /score. Everything else is UI.

The Action Builder

The Action Builder is where you construct each fix. It has four fields: Action Kind, Endpoint Index, Location, and Field Name — plus a New Value textarea for the payload.

One thing worth understanding is how new_value works across different action kinds, because it's not uniform:

add_field expects a JSON object: {"type":"string","required":true}
change_type expects a plain string: integer
change_status expects a plain number: 204
remove_field expects nothing — leave it blank

The reason "type" takes a quoted string but "required" takes an unquoted boolean comes down to JSON's own rules: strings need quotes, booleans don't. Writing "required":"true" would pass a string to a boolean field, which is a different type entirely and will fail validation.

The Endpoint Index field is how the backend knows which endpoint to modify when a task has multiple. Index 0 is the first endpoint in the list, 1 is the second, and so on — matching the order they appear in the Current Endpoint Spec panel.

The dev.sh Script

Running both servers by hand across two terminals is fine once. After the third time it becomes friction. I wrote a dev.sh script that handles the whole thing in one command:

chmod +x dev.sh
./dev.sh

That’s it. The script:

Checks that Python, uvicorn, and npm are available before doing anything
Creates frontend/.env.local pointing to the local backend on first run — no manual config
Runs npm install automatically (skippable with SKIP_INSTALL=1)
Starts the FastAPI backend with --reload
Waits for the backend /health endpoint to respond before starting the frontend — so Next.js never boots into a broken state
Streams colour-coded logs from both processes side-by-side
Shuts both processes down cleanly on Ctrl+C

Output looks like this:

════════════════════════════════════════
  API Contract Debugger — Dev Server
════════════════════════════════════════
  Backend API  →  http://localhost:7860
  Frontend UI  →  http://localhost:3000
  API Docs     →  http://localhost:7860/docs
════════════════════════════════════════
  Press Ctrl+C to stop both servers.

The health-check wait loop is the part I’m happiest about. Next.js starts in under a second and immediately tries to hit the backend. Without the wait, you’d see a flash of API errors in the UI on startup. With it, the frontend only opens after the backend is confirmed healthy.

Deployment: The Rough Parts

HuggingFace Spaces has specific requirements that are easy to get wrong if you’re moving fast.

Missing uv.lock: If your pyproject.toml uses uv for dependency management, the Space build will fail unless uv.lock exists in the repo. Fix: pip install uv && uv lock.

Divergent git histories: If you initialized the HF Space repo separately from your local development repo, you’ll hit a merge conflict when trying to push. Fix: git pull --allow-unrelated-histories --no-rebase origin main, resolve conflicts, then push.

The GET / 404: HuggingFace renders Spaces in an iframe that hits GET / for a preview. Our FastAPI app doesn't have a GET / route (by design — it's an API, not a web app). The 404 in the iframe is expected behavior and not an error. Don't burn time investigating it.

Port: HF Spaces expects your app to listen on port 7860. Make sure your app_port in the README frontmatter matches your Dockerfile EXPOSE and uvicorn configuration.

The openenv validate . command is your ground truth. If it passes, you're good.

What Worked

Pydantic v2 for everything. Using typed models throughout — not just for the OpenEnv interface but for internal state, violations, and fixtures — meant that errors surfaced at definition time, not runtime. The 56 passing tests are only achievable because the data model is so rigid.

Dense reward shaping. Choosing severity-weighted per-step rewards instead of a binary end-of-episode signal was the right call. The signal is informative: the agent can tell immediately whether its last action helped or hurt, and by how much.

Three-tiered difficulty with specific fixtures. Rather than procedurally generating broken specs, the tasks are hand-crafted fixtures with known violations and known solutions. This makes the grader deterministic and reproducible — you always get the same score on the same task with the same sequence of actions.

A frontend that doubles as a debugger. The dark-themed dashboard wasn’t just a demo layer — it turned out to be the fastest way to manually verify that the grader was behaving correctly. Watching the Step Log emit fixed=1 reward=0.700 in real time after a correct action is much faster than reading test output.

What I’d Do Differently

A procedural task generator. The three tasks are good but static. A task generator that randomly places violations across a spec template would make the environment much harder to overfit and more valuable as a training benchmark.

Partial episode resumption. Right now, reset() completely restarts the episode. An option to load a specific violation configuration (by seed or fixture name) would make it easier to test specific agent behaviors.

WebSocket streaming for the inference script. The current inference script polls via HTTP. OpenEnv supports WebSocket connections, which would make the step loop more efficient for agents running at scale.

Baseline Results

Task Model Score Steps Used easy Qwen2.5–72B-Instruct 1.000 1 medium Qwen2.5–72B-Instruct 1.000 3 hard Qwen2.5–72B-Instruct ~0.85 12

The easy and medium tasks are reliably solved in the minimum number of steps. Hard is genuinely hard — the model occasionally introduces a new violation while fixing another one, costing it the completion bonus.

Running It Yourself

The environment is live on HuggingFace Spaces. You can hit it directly:

# Health check
curl https://keerthanas1011-api-contract-debugger.hf.space/health

# Reset to easy task and start playing
curl -X POST https://keerthanas1011-api-contract-debugger.hf.space/reset \
  -H "Content-Type: application/json" \
  -d '{"task_name": "easy"}'

Or clone and run locally:

git clone https://github.com/KeerthanaShivakumar/api-contract-debugger-env
cd api-contract-debugger-env
pip install -r requirements.txt
uvicorn server.app:app --host 0.0.0.0 --port 7860

# In a second terminal, start the frontend
cd frontend && npm install && npm run dev
# Backend at :7860, Frontend at :3000

Final Thought

The best benchmarks are the ones that test skills people actually use. API contract debugging is not glamorous — no one writes blog posts about it the way they do about reasoning chains or multimodal capabilities. But it’s work that every engineer does, constantly, and it has clear success criteria: either the contract is satisfied or it isn’t.

That’s exactly why it makes a good RL environment. The task is real, the feedback is tight, and the difficulty scales in a principled way. If a model can score 1.0 on hard, it’s genuinely learned something useful about structured reasoning over specifications.

Shipped it on a Friday. That part felt good too.

The full project is at github.com/KeerthanaShivakumar/api-contract-debugger-env and the live HuggingFace Space is at huggingface.co/spaces/keerthanas1011/api-contract-debugger.

I Built an RL Benchmark for API Contract Debugging in One Friday Evening was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.