SPOQ Performance Benchmarks

Four controlled experiments plus a 17-repository deployment study. Cross-validated against both Claude (frontier cloud) and Qwen3.6-35B-A3B (local open-weights).

Read the full paper (PDF, 55pp)

Aggregate Results

17 repositories across web, backend, infrastructure, and tooling domains, Nov 2025 – Mar 2026

14.3x

Algorithmic speedup
(synthetic DAG ceiling)

99.87%

Aggregate test
pass rate

1,822

Tasks completed
across 183 epics

13,866

Tests executed
(591 suites)

Controlled Benchmarks

Four experiments testing scheduling efficiency, planning quality, validation effectiveness, and human-AI collaboration. All four were replicated against a locally hosted Qwen3.6-35B-A3B model to verify gains come from orchestration rather than model capability.

Exp 1 — DAG Scheduling

14.3x · 1.4x

Wave-based dispatch vs sequential

Algorithmic ceiling (synthetic-sleep DAGs, unbounded): critical-path ratio 1.03–1.11, speedup up to 14.3x on a 20-task fully-parallel DAG.

Hardware floor (real Qwen LLM calls, 2-slot llama-server): stable 1.4x across all five DAG configurations, matching the slot ceiling.

Exp 2 — Planning Quality

+25 to +35 pts coverage

SPOQ-guided planning vs baseline

Claude: coverage 93.0 → 99.75, parallelism 31.0 → 75.25, cyclic plans 3/4 → 0/4.

Qwen: coverage 56.2 → 91.2 (+35 pts), parallelism 12.5 → 65.0, cycles 2/4 → 0/4. SPOQ-on-Qwen lands within 1.8 points of Claude's unaided baseline.

Exp 3 — Validation Gates

91.25% → 99.75% pass rate

Dual-gate ablation (No Val / Code Val / Full SPOQ)

Defects per task fall 0.34 → 0.20, test pass rate rises 91.25% → 99.75%, static warnings drop 4.25 → 0, rework cycles drop 3.75 → 1.0.

Monotonic ordering Full SPOQ > Code-Val > No-Val replicates exactly under Qwen.

Exp 4 — Human-as-Agent

16x fewer defects

Auto SPOQ vs Human-assisted SPOQ

Residual defects fall 0.47 → 0.03 per task (16x reduction), test pass rate 96.5% → 99.75%, security issues identified up 2.5 → 1.25 after fixes.

Qwen planning-quality replication: coverage +6.25 pts, dependency errors −40%.

Field Evidence — Case Study Summary

Two detailed case studies (one internal, one external) with full wave structures, plus aggregate metrics across the adoption survey. These are field measurements, not benchmark conditions — see the controlled experiments above for matched-baseline comparisons.

Metric	UI Epic	Rebrand Epic	Adoption (agg.)
Tasks	13	12	1,446
Waves	2	4	varies
Max Parallelism	12	5	varies
Speedup Factor	5.3x	2.8x	varies
Completion Rate	92%	100%	100%
Orchestrator Interventions	1	3	varies
Rework Cycles	2	0	0–1
Test Cases	—	174	varies
Avg. Confidence	—	0.92	0.90–0.95

Case Study 1: UI Improvements (Internal)

Execution Metrics

Monitoring dashboard modernization with toast notifications, data tables, and API key management

Total Tasks	13
Wave 0 Parallelism	12 concurrent agents
Wave 1 Parallelism	1 agent
Sequential Estimate	18.5 hours
Parallel Wall-Clock	3.5 hours
Speedup Factor	5.3x
First-Pass Rate	12 of 13 (92%)
Rework Cycles	2 (tasks 04, 07)

Wave Breakdown

Near-embarrassingly parallel structure enabled maximum speedup

Wave 012 tasks

Independent component implementations: Sonner setup, DataTable component, modal dialogs, toast integrations, search functionality, and tests. All executed concurrently.

Wave 11 task

End-to-end QA depending on all Wave 0 tasks. Final integration verification after all components completed.

Failure Mode Encountered

Task 04 entered a runaway retry loop, executing npm install sonner over 100 times. SPOQ now enforces a 3-retry maximum with pre-installation verification.

Case Study 2: Client Website Rebrand (External)

Execution Metrics

B2B sales website rebrand from founder-centric to developer-focused messaging

Total Tasks	12
Max Parallelism	5 concurrent agents (Wave 1)
Sequential Estimate	18 hours
Parallel Wall-Clock	6.5 hours
Speedup Factor	2.8x
Completion Rate	12 of 12 (100%)
Orchestrator Interventions	3
Test Suites / Cases	20 / 174 (0 failures)

Wave Breakdown

Deeper dependency chain limited parallelism but achieved zero code defects

Wave 02 tasks

Content removal and routing scaffold (parallel).

Wave 15 tasks

Homepage rewrite, pricing update, and three persona-specific landing pages (parallel).

Wave 23 tasks

Navigation, SEO, and section refinement (parallel).

Wave 32 tasks

Test expansion and accessibility polish (parallel). Test coverage grew from 134 to 174 tests.

Key Challenge

Three orchestrator interventions for test fixture synchronization—parallel agents' code changes invalidated sibling test assertions. SPOQ now recommends treating test files as implicit dependents of the components they exercise.

Multi-Project Adoption Survey

Beyond the two detailed case studies, SPOQ has been deployed across 17 repositories spanning web platforms, backend services, infrastructure, and developer tooling. The full observation period from November 2025 through March 2026 logged 8,589 commits, 894,664 lines of code, 183 completed epics, and 1,822 completed tasks. On a live snapshot dated 2026-03-21 the projects collectively executed 13,866 tests with an aggregate 99.87% pass rate across 591 suites. The table below shows the seven highlighted deployments.

Project	Domain	Epics	Tasks	Stack
Speedrun	Sales intelligence SaaS	45	794	Next.jsRustAWS
Portfolio	Personal brand sites	23	247	Next.jsRust Lambda
Pinpoint	CI/CD testing platform	17	182	Spring BootTerraform
SPOQ	Methodology & tooling	13	124	Next.jsRustPython
Railroad	Terminal tooling	4	43	BashNext.js
Savvy Expat	Consulting site	4	30	Next.jsDocker
Longship	PKI SaaS platform	2	20	Next.jsStripe
Super-SPOQ	Project template	1	12	Next.js
Beam Chat	Chat application	1	10	Next.js
Blade	Multiplayer terminal	1	9	MCPWebRTC

Execution Velocity

Sustained throughput demonstration from a single three-hour session

The Pinpoint Rebrand epic (12 tasks) and a companion analytics epic (7 tasks) were both planned and executed in a single three-hour session using six concurrent Claude Code instances under a single Max license. The 19 combined tasks were completed from cold start to full verification between 4:00 AM and 7:00 AM, yielding a sustained rate of approximately 6 tasks per hour.

Token Cost Analysis

Per-Token API Pricing

Pay-as-you-go model based on token consumption

Model	Input	Output	SPOQ Role
Opus	$15/M	$75/M	Worker
Sonnet	$3/M	$15/M	Reviewer
Haiku	$0.25/M	$1.25/M	Investigator

Typical Opus worker task: ~25K input + ~5K output tokens = ~$1.95/task

13-task epic (UI Improvements scale): ~$28 total worker cost

Flat-Rate Max Plan

$200/month with 20x cost reduction at scale

Monthly cost$200 fixed

Concurrent instancesUnlimited

Usage meteringTwo buckets (Opus / non-Opus)

Daily capacity50–100 tasks

Effective per-task cost at scale~$0.10

Cost reduction vs. API~20x

SPOQ's three-tier hierarchy is an economic optimization: Opus tokens reserved for task execution while validation and triage route through Sonnet and Haiku.

Pricing as of early 2026; subject to change. Token estimates vary by task complexity, codebase size, and context requirements.

Measurement Methodology

How Speedup Is Measured

Speedup factor = sequential time estimate / parallel wall-clock time. The sequential estimate is the sum of individual task durations as if executed one after another. The parallel wall-clock time is the actual elapsed time with concurrent agents executing within computed waves.

Quality Validation

All deployments were scored against SPOQ's dual validation framework: 10 planning metrics (95/90 threshold) and 10 code metrics (95/80 threshold). Agent confidence scores are self-reported on a 0.0–1.0 scale and cross-referenced with validation gate results.

Task Completion

A task is "completed" when it passes the code validation gate (average score ≥ 95, per-metric minimum ≥ 80). Tasks requiring rework cycles are counted as completed once they pass validation, but the rework is tracked separately. "First-pass rate" measures tasks that pass validation without any rework.

Limitations

Sample Size

n=2 practitioners. All deployments share a single primary author. Independent replication by other teams and controlled studies with baseline comparisons are needed.

Operator Bias

One individual conducted all detailed case studies and most adoption deployments, limiting generalizability of speedup claims to other operators.

LLM-Scored Validation

The 20 validation metrics rely on LLM-based assessment without inter-rater reliability. Automated quality scores may exhibit biases, inconsistency across runs, or susceptibility to gaming.

Dependency Structure

Speedup depends on the task DAG. The 5.3x result came from an embarrassingly parallel wave structure. Projects with deep dependency chains see lower speedups (1.3–3.0x in the adoption survey).

Cost Estimates

Token costs are approximate and vary significantly by task complexity and codebase size. Infrastructure, human oversight time, and rework loop costs are not comprehensively included.