SDLC Benchmark & RL

Engineering work, not coding puzzles.Where code meets real operational data.

SDLCbench evaluates AI agents on real software-development work drawn from private enterprise codebases and the operational artifacts around them. Tasks resemble 10-20 hour workloads that senior practitioners — software engineers, project managers, QA engineers, and other knowledge workers — perform with harnesses like Claude Code, Codex, and Claude Cowork: sift through email and chat, write code, populate trackers, draft PRDs, design documents, and architecture docs, then deliver outputs validated against human solutions.

10–20 hour tasks Real SDLC workflows Contamination-free
RL Environments
Calendar
Chat
Mail
File System Manager
Code Execution
PDF Manager
Slides
Sheets
Docs

Why this benchmark

Unlike code-only benchmarks (SWE-bench) or web/agent benchmarks (GAIA, AgentBench, WebArena, τ-bench) that operate on published or synthetic environments, SDLCbench tasks run on private, never-indexed enterprise data and require reconciling multiple source types — code, email, PDF, spreadsheet — in a single trial.

Realism

Real enterprise workloads on real codebases plus the artifacts around them — multi-sprint trackers, hundred-email inboxes, slide decks, ADRs, regulatory PDFs. Not synthetic, not GitHub-indexed boilerplate.

Contamination-free

No task or operational artifact is published or indexed anywhere; models cannot have memorized the source or the answer on any trial.

Cross-tool orchestration

A single trial reconciles signals across mail, PDF, slides, trackers, and source code under conflicting truth — e.g., a brand-palette email overrides a legacy spec PDF.

Diversity

Multiple role personas (UX Architect, Backend Developer, Technical Product Manager, …), eleven task labels (New Feature, Architecture & System Design, Triage & Request Management, …) from an internal task-type taxonomy, and nine MCP backends (mail, sheets, docs, slides, pdf, calendar, chat, filesystem, bash) — not a single-domain coding suite.

Long-horizon, billable work

Each task is 10–20 hours of senior-practitioner work — shipping modules that gate a release, clearing a triage backlog, closing a security audit. Landing them unblocks real revenue, not leaderboard movement.

Leaderboard

Each model/harness combination attempts every task several times independently. The headline metric is Soft-Gated Reward — the mean shaped reward emitted by the verifier — averaged across tasks. All-Core Pass Rate is the share of runs that clear every core verifier, analogous to the resolve rate measured by benchmarks like SWE-Bench. model/harness combinations · tasks · runs per combination.

#Model / HarnessReward
Model Comparison — All Tasks

Soft-Gated Reward combines the honest pass/fail signal that Verifier Pass % misses while staying dense enough to differentiate failing and passing trials. See Reward.

Failure Modes

What the Results Say

The patterns below cut across the taxonomy — read them before the table and heatmap.

  • Incomplete Deliverable is the most common reason agents fail — more than the next two failure modes combined. The agent reads the requirements, but leaves key details out of the deliverable. On task3 (a front-end design audit), models that read the code still dropped findings they were asked to report: that the live nav is Navigation.tsx, not the unused Navbar.tsx, and that animation timings ran over the agreed limits. Found, then omitted.
  • The omissions often start at the Context Gathering step. Where specs are scattered, the agent misses unopened PDFs and spreadsheets, emails behind pagination, and later replies buried in a thread — e.g., on task6 (a TypeScript→PostgreSQL migration brief) and task10 (triage of 10 client emails). Incomplete input, incomplete report.
  • Fable 5 leads because it reports most completely. It tops the board — 40% of tasks fully passed, 0.51 reward (0–1) — and wins task3 and task41 outright. It reads about as much of the required source as Codex but turns more of it into the findings the task asked for, at comparable cost ($4.17 vs $2.58/task).
  • Opus 4.8 wins on spec discovery and verifier pass-rate, but loses where it counts most. All-core credit requires every core check to pass at once — and there Opus trails (20% of tasks vs Codex's 25%). Strong average coverage doesn't close the last required check, so it finishes behind Codex despite reading and resolving more.
  • Fable's worst task is a safety-policy stop, not a capability gap. On task26 (a backend security audit), Fable was halted by a cyber-safety policy while reading the vulnerable code, before writing anything — its lowest score (0.05). Exclude that task and Fable fully passes 50% of the rest, vs Codex 31% and Opus 25%.
  • Gemini 3.1 Pro is cheap but under-reads the source. ~$0.85/task and the second-lowest reading rate (~57%); its misses are unopened files and emails it never paginated through, and the findings that go missing as a result.
  • Grok 4.20 is the cheapest because it does the least, and it reports work it never did. It runs the fewest tool calls (33, about half the frontier models), costs the least (~$0.44/task), and reads the least source (~46%); it ranks last on reward (0.12) and verifier pass-rate (38%), with zero all-core clears. It is the only model that stops before finishing: on 4 of 5 tasks its final message declares the task complete while required deliverables are missing, and on the security audit (task26) it spent its whole run reading and wrote nothing.
Select Task
Harbor Package

The complete Harbor task package as it lives in the repo — harbor_tasks/<task>/ — including the harness, instructions, seed datastores, reference solution, and tests. Browse the filesystem tree to inspect any file.

harbor_tasks/
Select a file
Click a file to view its contents
Working Environment

Files and data provided to each model as their working environment. Browse the filesystem tree to inspect individual files.

Working Environment Structure
.apps_data/ System directory powering MCP tools Hidden
filesystem/ Workspace visible to the model Visible
~/workspace
Select a file
Click a file to view its contents
Model Performance

Side-by-side performance across all models on this task. Each model attempts the task several times independently. The three headline metrics are All-Core Pass Rate (% of trials clearing every core verifier), Verifier Pass % (mean passing fraction across all verifiers), and Soft-Gated Reward (mean shaped reward emitted by the verifier).

Aggregate Charts

Headline outcomes, reward-component signals, cost, latency, and tool usage compared across all models.

Failure Modes — Tagged Failures

Top-level stages (L1) for this task; open a stage to drill into its individual failure modes (L2). Use the selector to compare two model/harnesses or show all models. Click any non-zero tile or bar to open an example trial for that model in this task.

Verifier Pass Rates

Per-verifier breakdown showing the pass rate across all runs for each model.

Trials

Each model attempts the task multiple times. Pick a model, then any trial to inspect that run's reward, verifier results, trajectory, workspace diff, and trial-specific failure-mode observations.

Insights

Common failure patterns and analysis across all runs. Select a model to see where and why it struggled.

Human Workflow

The recommended human approach to solving this task — the ideal sequence of steps, tools, and decisions.

Taxonomy Definitions

Every task is labelled with one SDLC task type. The definitions below are the authoritative descriptions of each type.

Benchmark Profile

SDLC taxonomy coverage across 14 tasks.

SDLC Taxonomy Coverage

of SDLC categories are evaluated.
MCP Environments

Tasks run inside a sandboxed RL environment where agents work through MCP servers — mail, calendar, chat, documents, spreadsheets, presentations, PDFs, the filesystem, and code execution.

These MCP servers are mocks of real productivity apps, faithful to their full functionality but not to any vendor's API. That's deliberate: what transfers is workflow parity, not signature parity. An agent that learns to triage an inbox, reconcile a spreadsheet against an email thread, or chase a requirement through a document carries that skill to any real provider — matching some vendor's exact argument names was never the valuable part.

MCP Tool Explorer

Every tool agents can call, per MCP server — with parameters and full input schemas.

The app servers are meta-tools: one MCP tool per server, dispatching every operation through an action field. The static schema stays deliberately generic — agents discover each action's parameters at runtime by calling help and the *_schema introspection tools. Tool discovery is part of the task; the explorer below includes each server's help registry, so you see exactly what agents can discover.

File System Manager and Code Execution are different — a compatibility layer, not app environments. They give file access and shell execution to harnesses that lack native tools for it; a harness that ships its own (OpenCode, for example) can use either these MCP tools or its native equivalents.

MCP Server Coverage

How many tasks exercise each MCP server.
Task Lifecycle
The end-to-end pipeline for creating, validating, and generating evaluation tasks — from raw data procurement through final deliverable review.
Manual Work
QC Work
Automated Work
Procurement
Sourcers
QA
Domain Experts
Expert 1
Expert 2
Expert 3
Reviewers
Reviewer 1
Reviewer 2
1
Procurement of Data
Sourcers
Source Repository + Operational Artifacts
Repo + Artifacts QA
QA
Expert 1
2
Task Creation
if pass@k ≤ 50%
3
Task Iteration & QC
Proceed when verifiers PASS against golden data
4
Verifier Iteration & QC
Proceed when verifiers PASS against ALL golden data
5
Trajectory Generation + QC
Shared Filesystem
Docs MCP.docx
Spreadsheet.xlsx
PDF MCP.pdf
Presentation.ppt
Calendar.ical
Mail MCP.mbox
Chat MCP.json
Filesystem MCP
Container Runtime
Dockerfile + Repo
Tool
Execution
Reward Shaping

Every trial is scored by a set of verifiers, each emitting a pass/fail signal. Those signals play three different roles — and how we fold them into a single number per model is what the leaderboard ultimately rewards.

Three roles, three signals

Each verifier on a task is one of three kinds. Each kind contributes a 0–1 component to the score.

Core

outcome

Check that the output is correct. These are gates: if a single core verifier fails, the task is not solved — no amount of polish can buy it back.

Core = core passed ÷ core total

Secondary

outcome

Differentiate the quality of correct solutions — edge cases, polish, and nice-to-haves that separate a good solve from a great one. Never gates.

Secondary = secondary passed ÷ secondary total

Spec Discovery

process

A process-verifier signal: the percentage of spec-surface excerpts the agent surfaced while working. Split into requirement excerpts found and additional-information excerpts found.

Spec Discovery = ⅔ requirement excerpts found + ⅓ additional-info excerpts found
From signals to a single number

How you combine those components determines what the leaderboard rewards. SDLCbench reports three views — two extremes and the compromise it actually uses — so partial credit stays visible without papering over unsolved tasks.

Sparse

All-Core Pass Rate

A trial counts only when every core verifier passes; secondary verifiers are ignored. The model-level score is the average across runs.

rewardtrial = 1 if all core passed, else 0

Why it's clean: a model that solves 9/10 core verifiers but misses one looks identical to a model that solved nothing — no false credit. Why it hurts: at frontier difficulty, almost every trial fails. The signal collapses to zero for most models, so you can't tell which ones are closer to solving the task.

Dense

% Verifiers Pass / Verifier Pass %

Per trial, the fraction of all verifiers (core + secondary) that passed. The model-level score is the average across runs. This is the maximally dense extreme — every successful check moves the needle, regardless of importance.

rewardtrial = (n_core_passed + n_secondary_passed) / (n_core + n_secondary)

Why it's expressive: partial progress shows up across the entire verifier set. A trial passing 12/18 verifiers scores 0.67, not 0. Why it misleads: a trial that passes most verifiers but misses load-bearing core ones looks like a strong solve when in practice the task is unsolved. Polishing edge cases (secondary) and silently breaking the build (core) score the same as solving the task.

Soft-Gated

Reward (used by this dashboard)

Blend the three components into one score — weighted toward correctness, with quality and process along for the ride. Then, if any core verifier failed, scale the whole thing down by 0.3.

Blend= 0.7·Core + 0.1·Secondary + 0.2·Spec Discovery Reward= Blend — if every core verifier passes 0.3 · Blend — otherwise

If a task has no secondary or no discovery signal, that weight is redistributed across the components it does have.

The compromise: among solved trials, secondary and spec discovery still nudge the score; among unsolved trials, every component still moves it — but the 0.3 multiplier opens a hard gap between the regimes (≤ 0.3 capped vs. ≥ 0.7 floor), so rank order respects task success first and partial credit second.

Two trials, three shapes

Two hypothetical trials of the same task. Same verifier set: 10 core, 8 secondary. Notice how each shape ranks them.

Trial A — partial solve
7/10 core passed 5/8 secondary passed 0.60 spec discovery
All-Core Pass Rate (sparse)0.003 core failed → 0
% verifiers pass (dense)0.6712 / 18
Reward (soft-gated)0.200.3 · (0.7·0.7 + 0.1·0.625 + 0.2·0.60)
Trial B — full solve, partial polish
10/10 core passed 5/8 secondary passed 0.80 spec discovery
All-Core Pass Rate (sparse)1.00all core passed
% verifiers pass (dense)0.8315 / 18
Reward (soft-gated)0.920.7·1.0 + 0.1·0.625 + 0.2·0.80

Under All-Core Pass Rate (sparse), Trial A and a do-nothing trial both score 0 — indistinguishable. Under % verifiers pass (dense), Trial A (0.67) and Trial B (0.83) sit only 0.16 apart even though Trial A leaves the task unsolved and Trial B doesn't. The soft-gated reward opens that to a 0.72-wide gap (0.20 vs 0.92), so task success drives rank order while partial progress still moves the score within each regime.