SPOQ Performance Benchmarks

Experimental results from 1.3–5.3x speedup range across 2 case studies and 7 adoption projects

Aggregate Results

Across all completed deployments by two independent practitioners

1.3–5.3x

Speedup Range

Detailed Case Studies

Adoption Projects

122

Total Tasks Executed

Evaluation Summary

Metric	UI Epic	Rebrand Epic	Adoption (agg.)
Tasks	13	12	92
Waves	2	4	varies
Max Parallelism	12	5	4
Speedup Factor	5.3x	2.8x	1.3–3.0x
Completion Rate	92%	100%	100%
Orchestrator Interventions	1	3	varies
Rework Cycles	2	0	0–1
Test Cases	—	174	295
Avg. Confidence	—	0.92	0.90–0.95

Case Study 1: UI Improvements (Internal)

Execution Metrics

Monitoring dashboard modernization with toast notifications, data tables, and API key management

Total Tasks	13
Wave 0 Parallelism	12 concurrent agents
Wave 1 Parallelism	1 agent
Sequential Estimate	18.5 hours
Parallel Wall-Clock	3.5 hours
Speedup Factor	5.3x
First-Pass Rate	12 of 13 (92%)
Rework Cycles	2 (tasks 04, 07)

Wave Breakdown

Near-embarrassingly parallel structure enabled maximum speedup

Wave 012 tasks

Independent component implementations: Sonner setup, DataTable component, modal dialogs, toast integrations, search functionality, and tests. All executed concurrently.

Wave 11 task

End-to-end QA depending on all Wave 0 tasks. Final integration verification after all components completed.

Failure Mode Encountered

Task 04 entered a runaway retry loop, executing npm install sonner over 100 times. SPOQ now enforces a 3-retry maximum with pre-installation verification.

Case Study 2: Client Website Rebrand (External)

Execution Metrics

B2B sales website rebrand from founder-centric to developer-focused messaging

Total Tasks	12
Max Parallelism	5 concurrent agents (Wave 1)
Sequential Estimate	18 hours
Parallel Wall-Clock	6.5 hours
Speedup Factor	2.8x
Completion Rate	12 of 12 (100%)
Orchestrator Interventions	3
Test Suites / Cases	20 / 174 (0 failures)

Wave Breakdown

Deeper dependency chain limited parallelism but achieved zero code defects

Wave 02 tasks

Content removal and routing scaffold (parallel).

Wave 15 tasks

Homepage rewrite, pricing update, and three persona-specific landing pages (parallel).

Wave 23 tasks

Navigation, SEO, and section refinement (parallel).

Wave 32 tasks

Test expansion and accessibility polish (parallel). Test coverage grew from 134 to 174 tests.

Key Challenge

Three orchestrator interventions for test fixture synchronization—parallel agents' code changes invalidated sibling test assertions. SPOQ now recommends treating test files as implicit dependents of the components they exercise.

Multi-Project Adoption Survey

Beyond the two detailed case studies, SPOQ was deployed across 7 projects by two practitioners, spanning distinct technology stacks and problem domains. Across all 122 tasks, average agent confidence scores ranged from 0.90 to 0.95 with 100% task completion rates.

Project	Domain	Tasks	Tests	Stack
Savvy Expat	E-commerce	10	154	Next.jsDocker
Railroad OS	Linux tooling	43	55	Bashi3 WM
SPOQ Website	Documentation	23	18	Next.jsTerraform
Pinpoint Platform	Backend API	16	308	Spring BootJava
Pinpoint Infra	Cloud infra	17	—	TerraformAWS
Pinpoint Analytics	Tracking	7	—	Next.jsGA4
Pinpoint Billing	Payments	6	—	Spring BootStripe

Execution Velocity

Sustained throughput demonstration from a single three-hour session

The Pinpoint Rebrand epic (12 tasks) and a companion analytics epic (7 tasks) were both planned and executed in a single three-hour session using six concurrent Claude Code instances under a single Max license. The 19 combined tasks were completed from cold start to full verification between 4:00 AM and 7:00 AM, yielding a sustained rate of approximately 6 tasks per hour.

Token Cost Analysis

Per-Token API Pricing

Pay-as-you-go model based on token consumption

Model	Input	Output	SPOQ Role
Opus	$15/M	$75/M	Worker
Sonnet	$3/M	$15/M	Reviewer
Haiku	$0.25/M	$1.25/M	Investigator

Typical Opus worker task: ~25K input + ~5K output tokens = ~$1.95/task

13-task epic (UI Improvements scale): ~$28 total worker cost

Flat-Rate Max Plan

$200/month with 20x cost reduction at scale

Monthly cost$200 fixed

Concurrent instancesUnlimited

Usage meteringTwo buckets (Opus / non-Opus)

Daily capacity50–100 tasks

Effective per-task cost at scale~$0.10

Cost reduction vs. API~20x

SPOQ's three-tier hierarchy is an economic optimization: Opus tokens reserved for task execution while validation and triage route through Sonnet and Haiku.

Pricing as of early 2026; subject to change. Token estimates vary by task complexity, codebase size, and context requirements.

Measurement Methodology

How Speedup Is Measured

Speedup factor = sequential time estimate / parallel wall-clock time. The sequential estimate is the sum of individual task durations as if executed one after another. The parallel wall-clock time is the actual elapsed time with concurrent agents executing within computed waves.

Quality Validation

All deployments were scored against SPOQ's dual validation framework: 10 planning metrics (95/90 threshold) and 10 code metrics (95/80 threshold). Agent confidence scores are self-reported on a 0.0–1.0 scale and cross-referenced with validation gate results.

Task Completion

A task is "completed" when it passes the code validation gate (average score ≥ 95, per-metric minimum ≥ 80). Tasks requiring rework cycles are counted as completed once they pass validation, but the rework is tracked separately. "First-pass rate" measures tasks that pass validation without any rework.

Limitations

Sample Size

n=2 practitioners. All deployments share a single primary author. Independent replication by other teams and controlled studies with baseline comparisons are needed.

Operator Bias

One individual conducted all detailed case studies and most adoption deployments, limiting generalizability of speedup claims to other operators.

LLM-Scored Validation

The 20 validation metrics rely on LLM-based assessment without inter-rater reliability. Automated quality scores may exhibit biases, inconsistency across runs, or susceptibility to gaming.

Dependency Structure

Speedup depends on the task DAG. The 5.3x result came from an embarrassingly parallel wave structure. Projects with deep dependency chains see lower speedups (1.3–3.0x in the adoption survey).

Cost Estimates

Token costs are approximate and vary significantly by task complexity and codebase size. Infrastructure, human oversight time, and rework loop costs are not comprehensively included.