SPOQ Performance Benchmarks
Experimental results from 1.3–5.3x speedup range across 2 case studies and 7 adoption projects
Evaluation Summary
| Metric | UI Epic | Rebrand Epic | Adoption (agg.) |
|---|---|---|---|
| Tasks | 13 | 12 | 92 |
| Waves | 2 | 4 | varies |
| Max Parallelism | 12 | 5 | 4 |
| Speedup Factor | 5.3x | 2.8x | 1.3–3.0x |
| Completion Rate | 92% | 100% | 100% |
| Orchestrator Interventions | 1 | 3 | varies |
| Rework Cycles | 2 | 0 | 0–1 |
| Test Cases | — | 174 | 295 |
| Avg. Confidence | — | 0.92 | 0.90–0.95 |
Case Study 1: UI Improvements (Internal)
| Total Tasks | 13 |
| Wave 0 Parallelism | 12 concurrent agents |
| Wave 1 Parallelism | 1 agent |
| Sequential Estimate | 18.5 hours |
| Parallel Wall-Clock | 3.5 hours |
| Speedup Factor | 5.3x |
| First-Pass Rate | 12 of 13 (92%) |
| Rework Cycles | 2 (tasks 04, 07) |
Independent component implementations: Sonner setup, DataTable component, modal dialogs, toast integrations, search functionality, and tests. All executed concurrently.
End-to-end QA depending on all Wave 0 tasks. Final integration verification after all components completed.
Failure Mode Encountered
Task 04 entered a runaway retry loop, executing npm install sonner over 100 times. SPOQ now enforces a 3-retry maximum with pre-installation verification.
Case Study 2: Client Website Rebrand (External)
| Total Tasks | 12 |
| Max Parallelism | 5 concurrent agents (Wave 1) |
| Sequential Estimate | 18 hours |
| Parallel Wall-Clock | 6.5 hours |
| Speedup Factor | 2.8x |
| Completion Rate | 12 of 12 (100%) |
| Orchestrator Interventions | 3 |
| Test Suites / Cases | 20 / 174 (0 failures) |
Content removal and routing scaffold (parallel).
Homepage rewrite, pricing update, and three persona-specific landing pages (parallel).
Navigation, SEO, and section refinement (parallel).
Test expansion and accessibility polish (parallel). Test coverage grew from 134 to 174 tests.
Key Challenge
Three orchestrator interventions for test fixture synchronization—parallel agents' code changes invalidated sibling test assertions. SPOQ now recommends treating test files as implicit dependents of the components they exercise.
Multi-Project Adoption Survey
Beyond the two detailed case studies, SPOQ was deployed across 7 projects by two practitioners, spanning distinct technology stacks and problem domains. Across all 122 tasks, average agent confidence scores ranged from 0.90 to 0.95 with 100% task completion rates.
| Project | Domain | Tasks | Tests | Stack |
|---|---|---|---|---|
| Savvy Expat | E-commerce | 10 | 154 | Next.jsDocker |
| Railroad OS | Linux tooling | 43 | 55 | Bashi3 WM |
| SPOQ Website | Documentation | 23 | 18 | Next.jsTerraform |
| Pinpoint Platform | Backend API | 16 | 308 | Spring BootJava |
| Pinpoint Infra | Cloud infra | 17 | — | TerraformAWS |
| Pinpoint Analytics | Tracking | 7 | — | Next.jsGA4 |
| Pinpoint Billing | Payments | 6 | — | Spring BootStripe |
The Pinpoint Rebrand epic (12 tasks) and a companion analytics epic (7 tasks) were both planned and executed in a single three-hour session using six concurrent Claude Code instances under a single Max license. The 19 combined tasks were completed from cold start to full verification between 4:00 AM and 7:00 AM, yielding a sustained rate of approximately 6 tasks per hour.
Token Cost Analysis
| Model | Input | Output | SPOQ Role |
|---|---|---|---|
| Opus | $15/M | $75/M | Worker |
| Sonnet | $3/M | $15/M | Reviewer |
| Haiku | $0.25/M | $1.25/M | Investigator |
Typical Opus worker task: ~25K input + ~5K output tokens = ~$1.95/task
13-task epic (UI Improvements scale): ~$28 total worker cost
SPOQ's three-tier hierarchy is an economic optimization: Opus tokens reserved for task execution while validation and triage route through Sonnet and Haiku.
Pricing as of early 2026; subject to change. Token estimates vary by task complexity, codebase size, and context requirements.
Measurement Methodology
How Speedup Is Measured
Speedup factor = sequential time estimate / parallel wall-clock time. The sequential estimate is the sum of individual task durations as if executed one after another. The parallel wall-clock time is the actual elapsed time with concurrent agents executing within computed waves.
Quality Validation
All deployments were scored against SPOQ's dual validation framework: 10 planning metrics (95/90 threshold) and 10 code metrics (95/80 threshold). Agent confidence scores are self-reported on a 0.0–1.0 scale and cross-referenced with validation gate results.
Task Completion
A task is "completed" when it passes the code validation gate (average score ≥ 95, per-metric minimum ≥ 80). Tasks requiring rework cycles are counted as completed once they pass validation, but the rework is tracked separately. "First-pass rate" measures tasks that pass validation without any rework.
Limitations
Sample Size
n=2 practitioners. All deployments share a single primary author. Independent replication by other teams and controlled studies with baseline comparisons are needed.
Operator Bias
One individual conducted all detailed case studies and most adoption deployments, limiting generalizability of speedup claims to other operators.
LLM-Scored Validation
The 20 validation metrics rely on LLM-based assessment without inter-rater reliability. Automated quality scores may exhibit biases, inconsistency across runs, or susceptibility to gaming.
Dependency Structure
Speedup depends on the task DAG. The 5.3x result came from an embarrassingly parallel wave structure. Projects with deep dependency chains see lower speedups (1.3–3.0x in the adoption survey).
Cost Estimates
Token costs are approximate and vary significantly by task complexity and codebase size. Infrastructure, human oversight time, and rework loop costs are not comprehensively included.