Skip to main content

SPOQ Performance Benchmarks

Experimental results from 1.3–5.3x speedup range across 2 case studies and 7 adoption projects

Aggregate Results
Across all completed deployments by two independent practitioners
1.3–5.3x
Speedup Range
2
Detailed Case Studies
7
Adoption Projects
122
Total Tasks Executed

Evaluation Summary

MetricUI EpicRebrand EpicAdoption (agg.)
Tasks131292
Waves24varies
Max Parallelism1254
Speedup Factor5.3x2.8x1.3–3.0x
Completion Rate92%100%100%
Orchestrator Interventions13varies
Rework Cycles200–1
Test Cases174295
Avg. Confidence0.920.90–0.95

Case Study 1: UI Improvements (Internal)

Execution Metrics
Monitoring dashboard modernization with toast notifications, data tables, and API key management
Total Tasks13
Wave 0 Parallelism12 concurrent agents
Wave 1 Parallelism1 agent
Sequential Estimate18.5 hours
Parallel Wall-Clock3.5 hours
Speedup Factor5.3x
First-Pass Rate12 of 13 (92%)
Rework Cycles2 (tasks 04, 07)
Wave Breakdown
Near-embarrassingly parallel structure enabled maximum speedup
Wave 012 tasks

Independent component implementations: Sonner setup, DataTable component, modal dialogs, toast integrations, search functionality, and tests. All executed concurrently.

Wave 11 task

End-to-end QA depending on all Wave 0 tasks. Final integration verification after all components completed.

Failure Mode Encountered

Task 04 entered a runaway retry loop, executing npm install sonner over 100 times. SPOQ now enforces a 3-retry maximum with pre-installation verification.

Case Study 2: Client Website Rebrand (External)

Execution Metrics
B2B sales website rebrand from founder-centric to developer-focused messaging
Total Tasks12
Max Parallelism5 concurrent agents (Wave 1)
Sequential Estimate18 hours
Parallel Wall-Clock6.5 hours
Speedup Factor2.8x
Completion Rate12 of 12 (100%)
Orchestrator Interventions3
Test Suites / Cases20 / 174 (0 failures)
Wave Breakdown
Deeper dependency chain limited parallelism but achieved zero code defects
Wave 02 tasks

Content removal and routing scaffold (parallel).

Wave 15 tasks

Homepage rewrite, pricing update, and three persona-specific landing pages (parallel).

Wave 23 tasks

Navigation, SEO, and section refinement (parallel).

Wave 32 tasks

Test expansion and accessibility polish (parallel). Test coverage grew from 134 to 174 tests.

Key Challenge

Three orchestrator interventions for test fixture synchronization—parallel agents' code changes invalidated sibling test assertions. SPOQ now recommends treating test files as implicit dependents of the components they exercise.

Multi-Project Adoption Survey

Beyond the two detailed case studies, SPOQ was deployed across 7 projects by two practitioners, spanning distinct technology stacks and problem domains. Across all 122 tasks, average agent confidence scores ranged from 0.90 to 0.95 with 100% task completion rates.

ProjectDomainTasksTestsStack
Savvy ExpatE-commerce10154
Next.jsDocker
Railroad OSLinux tooling4355
Bashi3 WM
SPOQ WebsiteDocumentation2318
Next.jsTerraform
Pinpoint PlatformBackend API16308
Spring BootJava
Pinpoint InfraCloud infra17
TerraformAWS
Pinpoint AnalyticsTracking7
Next.jsGA4
Pinpoint BillingPayments6
Spring BootStripe
Execution Velocity
Sustained throughput demonstration from a single three-hour session

The Pinpoint Rebrand epic (12 tasks) and a companion analytics epic (7 tasks) were both planned and executed in a single three-hour session using six concurrent Claude Code instances under a single Max license. The 19 combined tasks were completed from cold start to full verification between 4:00 AM and 7:00 AM, yielding a sustained rate of approximately 6 tasks per hour.

Token Cost Analysis

Per-Token API Pricing
Pay-as-you-go model based on token consumption
ModelInputOutputSPOQ Role
Opus$15/M$75/MWorker
Sonnet$3/M$15/MReviewer
Haiku$0.25/M$1.25/MInvestigator

Typical Opus worker task: ~25K input + ~5K output tokens = ~$1.95/task

13-task epic (UI Improvements scale): ~$28 total worker cost

Flat-Rate Max Plan
$200/month with 20x cost reduction at scale
Monthly cost$200 fixed
Concurrent instancesUnlimited
Usage meteringTwo buckets (Opus / non-Opus)
Daily capacity50–100 tasks
Effective per-task cost at scale~$0.10
Cost reduction vs. API~20x

SPOQ's three-tier hierarchy is an economic optimization: Opus tokens reserved for task execution while validation and triage route through Sonnet and Haiku.

Pricing as of early 2026; subject to change. Token estimates vary by task complexity, codebase size, and context requirements.

Measurement Methodology

How Speedup Is Measured

Speedup factor = sequential time estimate / parallel wall-clock time. The sequential estimate is the sum of individual task durations as if executed one after another. The parallel wall-clock time is the actual elapsed time with concurrent agents executing within computed waves.

Quality Validation

All deployments were scored against SPOQ's dual validation framework: 10 planning metrics (95/90 threshold) and 10 code metrics (95/80 threshold). Agent confidence scores are self-reported on a 0.0–1.0 scale and cross-referenced with validation gate results.

Task Completion

A task is "completed" when it passes the code validation gate (average score ≥ 95, per-metric minimum ≥ 80). Tasks requiring rework cycles are counted as completed once they pass validation, but the rework is tracked separately. "First-pass rate" measures tasks that pass validation without any rework.

Limitations

Sample Size

n=2 practitioners. All deployments share a single primary author. Independent replication by other teams and controlled studies with baseline comparisons are needed.

Operator Bias

One individual conducted all detailed case studies and most adoption deployments, limiting generalizability of speedup claims to other operators.

LLM-Scored Validation

The 20 validation metrics rely on LLM-based assessment without inter-rater reliability. Automated quality scores may exhibit biases, inconsistency across runs, or susceptibility to gaming.

Dependency Structure

Speedup depends on the task DAG. The 5.3x result came from an embarrassingly parallel wave structure. Projects with deep dependency chains see lower speedups (1.3–3.0x in the adoption survey).

Cost Estimates

Token costs are approximate and vary significantly by task complexity and codebase size. Infrastructure, human oversight time, and rework loop costs are not comprehensively included.