Better answers
without bigger models.
Pick any cheap model. RSA spawns N parallel calls, aggregates them into one better answer, repeats. A 4B model with RSA matches frontier reasoning models.
One flag. Same endpoint.
RSA is a gateway option on /v1/chat/completions. Pass your model, add gateway.rsa, done.
// Works with any model in the catalog
fetch('https://router.tangle.tools/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.TANGLE_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
model: 'anthropic/claude-haiku-4-5', // or gpt-4o-mini, gemini-3-flash, etc.
messages: [{ role: 'user', content: 'Prove sqrt(2) is irrational' }],
gateway: {
rsa: { n: 16, k: 4, t: 5 } // ← this is the only new line
},
}),
})tcloud SDK: t.chat({ gateway: { rsa: { n: 16, k: 4, t: 5 } } })
Cost vs. quality.
Measured numbers from a committed run in rsa-benchmark — Gemini 3 Flash Preview + RSA against a single Gemini 3 Pro Preview call on a 6-prompt math + code suite. Pass/fail is binary per prompt. Re-run it against your own models and prompts.
- Calls
- 1
- Quality
- 3 / 6 passed
- Latency
- 6–8s
- Calls
- 32
- Quality
- 3 / 6 passed
- Latency
- 17–21s
- Calls
- 96
- Quality
- 2 / 6 passed
- Latency
- 26–32s
Paper ( Venkatraman et al., 2025): Gemini 3 Flash + RSA reaches near the top of the ARC-AGI-2 public leaderboard, and Qwen3-4B + RSA reaches parity with DeepSeek-R1 and o3-mini (high). Our run: Gemini 3 Flash + RSA (N=8) matched Gemini 3 Pro on overall pass rate (3/6 each) and solved one task — binary search — that Pro missed. Preview models on a small suite, so it validates the mechanism rather than a headline score.
Three strategies, one infrastructure.
Same fan-out + aggregation engine. Different modes for different needs.
RSA
Population refinement
Generate N candidates, aggregate K at a time, refine over T rounds. The LLM cross-references and self-corrects.
gateway: { rsa: { n: 16, k: 4, t: 5 } }MoA
Cross-model diversity
RSA with diverse models per slot. Claude + Gemini + GPT generating, one aggregator. Reduces single-model blindspots.
gateway: { rsa: { n: 4, k: 3, t: 2, models: [...] } }Best-of-N
Custom scoring
Generate N candidates, score via your webhook or an LLM judge, return the winner. Your quality criteria, your infra.
gateway: { bestOfN: { n: 5, scorer: {...} } }How RSA works.
Generate
Spawn N parallel calls to your cheap model
Subsample
Randomly pick K candidates from the population
Aggregate
LLM cross-references and synthesizes one improved answer
Repeat
Run T rounds. Population converges. Return the result.
Budget pre-check
Estimates (N + N×T) × per-call cost before fan-out. Returns 402 if your balance can't cover it.
Latency-aware
optimize: 'latency' + rsa returns 400 — contradictory goals. We fail fast, not silently.
Transparent billing
Each sub-call billed at normal per-token rates. X-Tangle-RSA-Total-Calls header in every response.
Details.
Does it stream?
What does it cost?
Can I auto-enable RSA?
Where's the benchmark?
One flag. Any model. Better output.
Pick any model from the 400+ catalog, add gateway.rsa, and compare the result.