New: Inference strategies for 400+ models

Better answers
without bigger models.

Pick any cheap model. RSA spawns N parallel calls, aggregates them into one better answer, repeats. A 4B model with RSA matches frontier reasoning models.

400+ models
3 strategies
One gateway flag

One flag. Same endpoint.

RSA is a gateway option on /v1/chat/completions. Pass your model, add gateway.rsa, done.

// Works with any model in the catalog
fetch('https://router.tangle.tools/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${process.env.TANGLE_API_KEY}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    model: 'anthropic/claude-haiku-4-5',  // or gpt-4o-mini, gemini-3-flash, etc.
    messages: [{ role: 'user', content: 'Prove sqrt(2) is irrational' }],
    gateway: {
      rsa: { n: 16, k: 4, t: 5 }    // ← this is the only new line
    },
  }),
})

tcloud SDK: t.chat({ gateway: { rsa: { n: 16, k: 4, t: 5 } } })

Cost vs. quality.

Measured numbers from a committed run in rsa-benchmark — Gemini 3 Flash Preview + RSA against a single Gemini 3 Pro Preview call on a 6-prompt math + code suite. Pass/fail is binary per prompt. Re-run it against your own models and prompts.

Gemini 3 Pro — single call
Calls
1
Quality
3 / 6 passed
Latency
6–8s
Gemini 3 Flash + RSA (N=8)
Matched flagship
Calls
32
Quality
3 / 6 passed
Latency
17–21s
Gemini 3 Flash + RSA (N=16)
Calls
96
Quality
2 / 6 passed
Latency
26–32s

Paper ( Venkatraman et al., 2025): Gemini 3 Flash + RSA reaches near the top of the ARC-AGI-2 public leaderboard, and Qwen3-4B + RSA reaches parity with DeepSeek-R1 and o3-mini (high). Our run: Gemini 3 Flash + RSA (N=8) matched Gemini 3 Pro on overall pass rate (3/6 each) and solved one task — binary search — that Pro missed. Preview models on a small suite, so it validates the mechanism rather than a headline score.

Three strategies, one infrastructure.

Same fan-out + aggregation engine. Different modes for different needs.

RSA

Population refinement

Generate N candidates, aggregate K at a time, refine over T rounds. The LLM cross-references and self-corrects.

gateway: { rsa: { n: 16, k: 4, t: 5 } }

MoA

Cross-model diversity

RSA with diverse models per slot. Claude + Gemini + GPT generating, one aggregator. Reduces single-model blindspots.

gateway: { rsa: { n: 4, k: 3, t: 2, models: [...] } }

Best-of-N

Custom scoring

Generate N candidates, score via your webhook or an LLM judge, return the winner. Your quality criteria, your infra.

gateway: { bestOfN: { n: 5, scorer: {...} } }

How RSA works.

01

Generate

Spawn N parallel calls to your cheap model

02

Subsample

Randomly pick K candidates from the population

03

Aggregate

LLM cross-references and synthesizes one improved answer

04

Repeat

Run T rounds. Population converges. Return the result.

Budget pre-check

Estimates (N + N×T) × per-call cost before fan-out. Returns 402 if your balance can't cover it.

Latency-aware

optimize: 'latency' + rsa returns 400 — contradictory goals. We fail fast, not silently.

Transparent billing

Each sub-call billed at normal per-token rates. X-Tangle-RSA-Total-Calls header in every response.

Details.

Does it stream?
Not by nature — RSA aggregates across completions, so intermediate rounds aren't streamable. The final response is a standard non-streaming chat completion. Use for async workloads where quality beats latency.
What does it cost?
RSA bills per individual call — with N=16, K=4, T=5, that's 96 calls total at the model's normal per-token pricing. Budget pre-check runs before fan-out; if your balance can't cover the estimated cost, we return 402 with the breakdown.
Can I auto-enable RSA?
Yes. Pass gateway.optimize = 'quality' and we turn on RSA with conservative defaults (N=8, K=3, T=3). No config required.
Where's the benchmark?
Open-source and reproducible at tangle-network/rsa-benchmark. Run it against your own models and prompts.

One flag. Any model. Better output.

Pick any model from the 400+ catalog, add gateway.rsa, and compare the result.