SPOQ Performance Benchmarks
Four controlled experiments plus a 17-repository deployment study. Cross-validated against both Claude (frontier cloud) and Qwen3.6-35B-A3B (local open-weights).
(synthetic DAG ceiling)
pass rate
across 183 epics
(591 suites)
Controlled Benchmarks
Four experiments testing scheduling efficiency, planning quality, validation effectiveness, and human-AI collaboration. All four were replicated against a locally hosted Qwen3.6-35B-A3B model to verify gains come from orchestration rather than model capability.
Algorithmic ceiling (synthetic-sleep DAGs, unbounded): critical-path ratio 1.03–1.11, speedup up to 14.3x on a 20-task fully-parallel DAG.
Hardware floor (real Qwen LLM calls, 2-slot llama-server): stable 1.4x across all five DAG configurations, matching the slot ceiling.
Claude: coverage 93.0 → 99.75, parallelism 31.0 → 75.25, cyclic plans 3/4 → 0/4.
Qwen: coverage 56.2 → 91.2 (+35 pts), parallelism 12.5 → 65.0, cycles 2/4 → 0/4. SPOQ-on-Qwen lands within 1.8 points of Claude's unaided baseline.
Defects per task fall 0.34 → 0.20, test pass rate rises 91.25% → 99.75%, static warnings drop 4.25 → 0, rework cycles drop 3.75 → 1.0.
Monotonic ordering Full SPOQ > Code-Val > No-Val replicates exactly under Qwen.
Residual defects fall 0.47 → 0.03 per task (16x reduction), test pass rate 96.5% → 99.75%, security issues identified up 2.5 → 1.25 after fixes.
Qwen planning-quality replication: coverage +6.25 pts, dependency errors −40%.
Field Evidence — Case Study Summary
Two detailed case studies (one internal, one external) with full wave structures, plus aggregate metrics across the adoption survey. These are field measurements, not benchmark conditions — see the controlled experiments above for matched-baseline comparisons.
| Metric | UI Epic | Rebrand Epic | Adoption (agg.) |
|---|---|---|---|
| Tasks | 13 | 12 | 1,446 |
| Waves | 2 | 4 | varies |
| Max Parallelism | 12 | 5 | varies |
| Speedup Factor | 5.3x | 2.8x | varies |
| Completion Rate | 92% | 100% | 100% |
| Orchestrator Interventions | 1 | 3 | varies |
| Rework Cycles | 2 | 0 | 0–1 |
| Test Cases | — | 174 | varies |
| Avg. Confidence | — | 0.92 | 0.90–0.95 |
Case Study 1: UI Improvements (Internal)
| Total Tasks | 13 |
| Wave 0 Parallelism | 12 concurrent agents |
| Wave 1 Parallelism | 1 agent |
| Sequential Estimate | 18.5 hours |
| Parallel Wall-Clock | 3.5 hours |
| Speedup Factor | 5.3x |
| First-Pass Rate | 12 of 13 (92%) |
| Rework Cycles | 2 (tasks 04, 07) |
Independent component implementations: Sonner setup, DataTable component, modal dialogs, toast integrations, search functionality, and tests. All executed concurrently.
End-to-end QA depending on all Wave 0 tasks. Final integration verification after all components completed.
Failure Mode Encountered
Task 04 entered a runaway retry loop, executing npm install sonner over 100 times. SPOQ now enforces a 3-retry maximum with pre-installation verification.
Case Study 2: Client Website Rebrand (External)
| Total Tasks | 12 |
| Max Parallelism | 5 concurrent agents (Wave 1) |
| Sequential Estimate | 18 hours |
| Parallel Wall-Clock | 6.5 hours |
| Speedup Factor | 2.8x |
| Completion Rate | 12 of 12 (100%) |
| Orchestrator Interventions | 3 |
| Test Suites / Cases | 20 / 174 (0 failures) |
Content removal and routing scaffold (parallel).
Homepage rewrite, pricing update, and three persona-specific landing pages (parallel).
Navigation, SEO, and section refinement (parallel).
Test expansion and accessibility polish (parallel). Test coverage grew from 134 to 174 tests.
Key Challenge
Three orchestrator interventions for test fixture synchronization—parallel agents' code changes invalidated sibling test assertions. SPOQ now recommends treating test files as implicit dependents of the components they exercise.
Multi-Project Adoption Survey
Beyond the two detailed case studies, SPOQ has been deployed across 17 repositories spanning web platforms, backend services, infrastructure, and developer tooling. The full observation period from November 2025 through March 2026 logged 8,589 commits, 894,664 lines of code, 183 completed epics, and 1,822 completed tasks. On a live snapshot dated 2026-03-21 the projects collectively executed 13,866 tests with an aggregate 99.87% pass rate across 591 suites. The table below shows the seven highlighted deployments.
| Project | Domain | Epics | Tasks | Stack |
|---|---|---|---|---|
| Speedrun | Sales intelligence SaaS | 45 | 794 | Next.jsRustAWS |
| Portfolio | Personal brand sites | 23 | 247 | Next.jsRust Lambda |
| Pinpoint | CI/CD testing platform | 17 | 182 | Spring BootTerraform |
| SPOQ | Methodology & tooling | 13 | 124 | Next.jsRustPython |
| Railroad | Terminal tooling | 4 | 43 | BashNext.js |
| Savvy Expat | Consulting site | 4 | 30 | Next.jsDocker |
| Longship | PKI SaaS platform | 2 | 20 | Next.jsStripe |
| Super-SPOQ | Project template | 1 | 12 | Next.js |
| Beam Chat | Chat application | 1 | 10 | Next.js |
| Blade | Multiplayer terminal | 1 | 9 | MCPWebRTC |
The Pinpoint Rebrand epic (12 tasks) and a companion analytics epic (7 tasks) were both planned and executed in a single three-hour session using six concurrent Claude Code instances under a single Max license. The 19 combined tasks were completed from cold start to full verification between 4:00 AM and 7:00 AM, yielding a sustained rate of approximately 6 tasks per hour.
Token Cost Analysis
| Model | Input | Output | SPOQ Role |
|---|---|---|---|
| Opus | $15/M | $75/M | Worker |
| Sonnet | $3/M | $15/M | Reviewer |
| Haiku | $0.25/M | $1.25/M | Investigator |
Typical Opus worker task: ~25K input + ~5K output tokens = ~$1.95/task
13-task epic (UI Improvements scale): ~$28 total worker cost
SPOQ's three-tier hierarchy is an economic optimization: Opus tokens reserved for task execution while validation and triage route through Sonnet and Haiku.
Pricing as of early 2026; subject to change. Token estimates vary by task complexity, codebase size, and context requirements.
Measurement Methodology
How Speedup Is Measured
Speedup factor = sequential time estimate / parallel wall-clock time. The sequential estimate is the sum of individual task durations as if executed one after another. The parallel wall-clock time is the actual elapsed time with concurrent agents executing within computed waves.
Quality Validation
All deployments were scored against SPOQ's dual validation framework: 10 planning metrics (95/90 threshold) and 10 code metrics (95/80 threshold). Agent confidence scores are self-reported on a 0.0–1.0 scale and cross-referenced with validation gate results.
Task Completion
A task is "completed" when it passes the code validation gate (average score ≥ 95, per-metric minimum ≥ 80). Tasks requiring rework cycles are counted as completed once they pass validation, but the rework is tracked separately. "First-pass rate" measures tasks that pass validation without any rework.
Limitations
Sample Size
n=2 practitioners. All deployments share a single primary author. Independent replication by other teams and controlled studies with baseline comparisons are needed.
Operator Bias
One individual conducted all detailed case studies and most adoption deployments, limiting generalizability of speedup claims to other operators.
LLM-Scored Validation
The 20 validation metrics rely on LLM-based assessment without inter-rater reliability. Automated quality scores may exhibit biases, inconsistency across runs, or susceptibility to gaming.
Dependency Structure
Speedup depends on the task DAG. The 5.3x result came from an embarrassingly parallel wave structure. Projects with deep dependency chains see lower speedups (1.3–3.0x in the adoption survey).
Cost Estimates
Token costs are approximate and vary significantly by task complexity and codebase size. Infrastructure, human oversight time, and rework loop costs are not comprehensively included.