Better answers
without bigger models.
Pick any cheap model. RSA spawns N parallel calls, aggregates them into one better answer, repeats. A 4B model with RSA matches frontier reasoning models — at a fraction of the cost.
One flag. Same endpoint.
RSA is a gateway option on /v1/chat/completions. Pass your model, add gateway.rsa, done.
// Works with any model in the catalog
fetch('https://router.tangle.tools/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.TANGLE_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
model: 'anthropic/claude-haiku-4-5', // or gpt-4o-mini, gemini-3-flash, etc.
messages: [{ role: 'user', content: 'Prove sqrt(2) is irrational' }],
gateway: {
rsa: { n: 16, k: 4, t: 5 } // ← this is the only new line
},
}),
})tcloud SDK: t.chat({ gateway: { rsa: { n: 16, k: 4, t: 5 } } })
Cost vs. quality.
Reproducible numbers from rsa-benchmark. The pattern holds across model families.
- Calls
- 1
- Cost
- $0.05–1.00
- Quality
- Baseline
- Latency
- 2–8s
- Calls
- 32
- Cost
- ~$0.03
- Quality
- ~95%
- Latency
- 8–12s
- Calls
- 96
- Cost
- ~$0.10
- Quality
- ~100%
- Latency
- 15–30s
Paper ( Venkatraman et al., 2025): Gemini 3 Flash + RSA approaches Deep Think; Qwen 4B + RSA matches DeepSeek-R1. Our benchmark: Claude Haiku + RSA matches Opus on 3/6 tasks, beats it on 1.
Three strategies, one infrastructure.
Same fan-out + aggregation engine. Different modes for different needs.
RSA
Population refinement
Generate N candidates, aggregate K at a time, refine over T rounds. The LLM cross-references and self-corrects.
gateway: { rsa: { n: 16, k: 4, t: 5 } }MoA
Cross-model diversity
RSA with diverse models per slot. Claude + Gemini + GPT generating, one aggregator. Reduces single-model blindspots.
gateway: { rsa: { n: 4, k: 3, t: 2, models: [...] } }Best-of-N
Custom scoring
Generate N candidates, score via your webhook or an LLM judge, return the winner. Your quality criteria, your infra.
gateway: { bestOfN: { n: 5, scorer: {...} } }How RSA works.
Generate
Spawn N parallel calls to your cheap model
Subsample
Randomly pick K candidates from the population
Aggregate
LLM cross-references and synthesizes one improved answer
Repeat
Run T rounds. Population converges. Return the result.
Budget pre-check
Estimates (N + N×T) × per-call cost before fan-out. Returns 402 if your balance can't cover it.
Latency-aware
optimize: 'latency' + rsa returns 400 — contradictory goals. We fail fast, not silently.
Transparent billing
Each sub-call billed at normal per-token rates. X-Tangle-RSA-Total-Calls header in every response.
Details.
Does it stream?
What does it cost?
Can I auto-enable RSA?
Where's the benchmark?
One flag. Any model. Better output.
Pick any model from the 671+ catalog, add gateway.rsa, and compare the result.