Realism
Real enterprise workloads on real codebases plus the artifacts around them — multi-sprint trackers, hundred-email inboxes, slide decks, ADRs, regulatory PDFs. Not synthetic, not GitHub-indexed boilerplate.
SDLCbench evaluates AI agents on real software-development work drawn from private enterprise codebases and the operational artifacts around them. Tasks resemble 10-20 hour workloads that senior practitioners — software engineers, project managers, QA engineers, and other knowledge workers — perform with harnesses like Claude Code, Codex, and Claude Cowork: sift through email and chat, write code, populate trackers, draft PRDs, design documents, and architecture docs, then deliver outputs validated against human solutions.
Unlike code-only benchmarks (SWE-bench) or web/agent benchmarks (GAIA, AgentBench, WebArena, τ-bench) that operate on published or synthetic environments, SDLCbench tasks run on private, never-indexed enterprise data and require reconciling multiple source types — code, email, PDF, spreadsheet — in a single trial.
Real enterprise workloads on real codebases plus the artifacts around them — multi-sprint trackers, hundred-email inboxes, slide decks, ADRs, regulatory PDFs. Not synthetic, not GitHub-indexed boilerplate.
No task or operational artifact is published or indexed anywhere; models cannot have memorized the source or the answer on any trial.
A single trial reconciles signals across mail, PDF, slides, trackers, and source code under conflicting truth — e.g., a brand-palette email overrides a legacy spec PDF.
Multiple role personas (UX Architect, Backend Developer, Technical Product Manager, …), eleven task labels (New Feature, Architecture & System Design, Triage & Request Management, …) from an internal task-type taxonomy, and nine MCP backends (mail, sheets, docs, slides, pdf, calendar, chat, filesystem, bash) — not a single-domain coding suite.
Each task is 10–20 hours of senior-practitioner work — shipping modules that gate a release, clearing a triage backlog, closing a security audit. Landing them unblocks real revenue, not leaderboard movement.
Each model/harness combination attempts every task several times independently. The headline metric is Soft-Gated Reward — the mean shaped reward emitted by the verifier — averaged across tasks. All-Core Pass Rate is the share of runs that clear every core verifier, analogous to the resolve rate measured by benchmarks like SWE-Bench. model/harness combinations · tasks · runs per combination.
| # | Model / Harness | Reward |
|---|
Soft-Gated Reward combines the honest pass/fail signal that Verifier Pass % misses while staying dense enough to differentiate failing and passing trials. See Reward.
The patterns below cut across the taxonomy — read them before the table and heatmap.
Incomplete Deliverable is the most common reason agents fail —
more than the next two failure modes combined. The agent reads the requirements, but leaves
key details out of the deliverable. On task3 (a front-end design audit), models that read
the code still dropped findings they were asked to report: that the live nav is
Navigation.tsx, not the unused Navbar.tsx, and that animation
timings ran over the agreed limits. Found, then omitted.
Context Gathering step.
Where specs are scattered, the agent misses unopened PDFs and spreadsheets, emails behind
pagination, and later replies buried in a thread — e.g., on task6 (a TypeScript→PostgreSQL
migration brief) and task10 (triage of 10 client emails). Incomplete input, incomplete report.
The complete Harbor task package as it lives in the repo — harbor_tasks/<task>/ — including the harness, instructions, seed datastores, reference solution, and tests. Browse the filesystem tree to inspect any file.
Files and data provided to each model as their working environment. Browse the filesystem tree to inspect individual files.
.apps_data/ System directory powering MCP tools Hidden filesystem/ Workspace visible to the model Visible Side-by-side performance across all models on this task. Each model attempts the task several times independently. The three headline metrics are All-Core Pass Rate (% of trials clearing every core verifier), Verifier Pass % (mean passing fraction across all verifiers), and Soft-Gated Reward (mean shaped reward emitted by the verifier).
Headline outcomes, reward-component signals, cost, latency, and tool usage compared across all models.
Top-level stages (L1) for this task; open a stage to drill into its individual failure modes (L2). Use the selector to compare two model/harnesses or show all models. Click any non-zero tile or bar to open an example trial for that model in this task.
Per-verifier breakdown showing the pass rate across all runs for each model.
Each model attempts the task multiple times. Pick a model, then any trial to inspect that run's reward, verifier results, trajectory, workspace diff, and trial-specific failure-mode observations.
Common failure patterns and analysis across all runs. Select a model to see where and why it struggled.
The recommended human approach to solving this task — the ideal sequence of steps, tools, and decisions.
Every task is labelled with one SDLC task type. The definitions below are the authoritative descriptions of each type.
SDLC taxonomy coverage across 14 tasks.
Tasks run inside a sandboxed RL environment where agents work through MCP servers — mail, calendar, chat, documents, spreadsheets, presentations, PDFs, the filesystem, and code execution.
These MCP servers are mocks of real productivity apps, faithful to their full functionality but not to any vendor's API. That's deliberate: what transfers is workflow parity, not signature parity. An agent that learns to triage an inbox, reconcile a spreadsheet against an email thread, or chase a requirement through a document carries that skill to any real provider — matching some vendor's exact argument names was never the valuable part.
The app servers are meta-tools: one MCP tool per server, dispatching every operation
through an action field. The static schema stays deliberately generic — agents
discover each action's parameters at runtime by calling help and the
*_schema introspection tools. Tool discovery is part of the task; the explorer
below includes each server's help registry, so you see exactly what agents can discover.
File System Manager and Code Execution are different — a compatibility layer, not app environments. They give file access and shell execution to harnesses that lack native tools for it; a harness that ships its own (OpenCode, for example) can use either these MCP tools or its native equivalents.
Every trial is scored by a set of verifiers, each emitting a pass/fail signal. Those signals play three different roles — and how we fold them into a single number per model is what the leaderboard ultimately rewards.
Each verifier on a task is one of three kinds. Each kind contributes a 0–1 component to the score.
Check that the output is correct. These are gates: if a single core verifier fails, the task is not solved — no amount of polish can buy it back.
Differentiate the quality of correct solutions — edge cases, polish, and nice-to-haves that separate a good solve from a great one. Never gates.
A process-verifier signal: the percentage of spec-surface excerpts the agent surfaced while working. Split into requirement excerpts found and additional-information excerpts found.
How you combine those components determines what the leaderboard rewards. SDLCbench reports three views — two extremes and the compromise it actually uses — so partial credit stays visible without papering over unsolved tasks.
A trial counts only when every core verifier passes; secondary verifiers are ignored. The model-level score is the average across runs.
rewardtrial = 1 if all core passed, else 0 Why it's clean: a model that solves 9/10 core verifiers but misses one looks identical to a model that solved nothing — no false credit. Why it hurts: at frontier difficulty, almost every trial fails. The signal collapses to zero for most models, so you can't tell which ones are closer to solving the task.
Per trial, the fraction of all verifiers (core + secondary) that passed. The model-level score is the average across runs. This is the maximally dense extreme — every successful check moves the needle, regardless of importance.
rewardtrial = (n_core_passed + n_secondary_passed) / (n_core + n_secondary) Why it's expressive: partial progress shows up across the entire verifier set. A trial passing 12/18 verifiers scores 0.67, not 0. Why it misleads: a trial that passes most verifiers but misses load-bearing core ones looks like a strong solve when in practice the task is unsolved. Polishing edge cases (secondary) and silently breaking the build (core) score the same as solving the task.
Blend the three components into one score — weighted toward correctness, with quality and process along for the ride. Then, if any core verifier failed, scale the whole thing down by 0.3.
If a task has no secondary or no discovery signal, that weight is redistributed across the components it does have.
The compromise: among solved trials, secondary and spec discovery still nudge the score; among unsolved trials, every component still moves it — but the 0.3 multiplier opens a hard gap between the regimes (≤ 0.3 capped vs. ≥ 0.7 floor), so rank order respects task success first and partial credit second.
Two hypothetical trials of the same task. Same verifier set: 10 core, 8 secondary. Notice how each shape ranks them.
| All-Core Pass Rate (sparse) | 0.00 | 3 core failed → 0 |
| % verifiers pass (dense) | 0.67 | 12 / 18 |
| Reward (soft-gated) | 0.20 | 0.3 · (0.7·0.7 + 0.1·0.625 + 0.2·0.60) |
| All-Core Pass Rate (sparse) | 1.00 | all core passed |
| % verifiers pass (dense) | 0.83 | 15 / 18 |
| Reward (soft-gated) | 0.92 | 0.7·1.0 + 0.1·0.625 + 0.2·0.80 |
Under All-Core Pass Rate (sparse), Trial A and a do-nothing trial both score 0 — indistinguishable. Under % verifiers pass (dense), Trial A (0.67) and Trial B (0.83) sit only 0.16 apart even though Trial A leaves the task unsolved and Trial B doesn't. The soft-gated reward opens that to a 0.72-wide gap (0.20 vs 0.92), so task success drives rank order while partial progress still moves the score within each regime.