New: Inference strategies for 671+ models

Better answers
without bigger models.

Pick any cheap model. RSA spawns N parallel calls, aggregates them into one better answer, repeats. A 4B model with RSA matches frontier reasoning models — at a fraction of the cost.

671+ models
3 strategies
~10x cost reduction

One flag. Same endpoint.

RSA is a gateway option on /v1/chat/completions. Pass your model, add gateway.rsa, done.

// Works with any model in the catalog
fetch('https://router.tangle.tools/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${process.env.TANGLE_API_KEY}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    model: 'anthropic/claude-haiku-4-5',  // or gpt-4o-mini, gemini-3-flash, etc.
    messages: [{ role: 'user', content: 'Prove sqrt(2) is irrational' }],
    gateway: {
      rsa: { n: 16, k: 4, t: 5 }    // ← this is the only new line
    },
  }),
})

tcloud SDK: t.chat({ gateway: { rsa: { n: 16, k: 4, t: 5 } } })

Cost vs. quality.

Reproducible numbers from rsa-benchmark. The pattern holds across model families.

Single expensive model
Calls
1
Cost
$0.05–1.00
Quality
Baseline
Latency
2–8s
Cheap model + RSA (N=8)
Best value
Calls
32
Cost
~$0.03
Quality
~95%
Latency
8–12s
Cheap model + RSA (N=16)
Calls
96
Cost
~$0.10
Quality
~100%
Latency
15–30s

Paper ( Venkatraman et al., 2025): Gemini 3 Flash + RSA approaches Deep Think; Qwen 4B + RSA matches DeepSeek-R1. Our benchmark: Claude Haiku + RSA matches Opus on 3/6 tasks, beats it on 1.

Three strategies, one infrastructure.

Same fan-out + aggregation engine. Different modes for different needs.

RSA

Population refinement

Generate N candidates, aggregate K at a time, refine over T rounds. The LLM cross-references and self-corrects.

gateway: { rsa: { n: 16, k: 4, t: 5 } }

MoA

Cross-model diversity

RSA with diverse models per slot. Claude + Gemini + GPT generating, one aggregator. Reduces single-model blindspots.

gateway: { rsa: { n: 4, k: 3, t: 2, models: [...] } }

Best-of-N

Custom scoring

Generate N candidates, score via your webhook or an LLM judge, return the winner. Your quality criteria, your infra.

gateway: { bestOfN: { n: 5, scorer: {...} } }

How RSA works.

01

Generate

Spawn N parallel calls to your cheap model

02

Subsample

Randomly pick K candidates from the population

03

Aggregate

LLM cross-references and synthesizes one improved answer

04

Repeat

Run T rounds. Population converges. Return the result.

Budget pre-check

Estimates (N + N×T) × per-call cost before fan-out. Returns 402 if your balance can't cover it.

Latency-aware

optimize: 'latency' + rsa returns 400 — contradictory goals. We fail fast, not silently.

Transparent billing

Each sub-call billed at normal per-token rates. X-Tangle-RSA-Total-Calls header in every response.

Details.

Does it stream?
Not by nature — RSA aggregates across completions, so intermediate rounds aren't streamable. The final response is a standard non-streaming chat completion. Use for async workloads where quality beats latency.
What does it cost?
RSA bills per individual call — with N=16, K=4, T=5, that's 96 calls total at the model's normal per-token pricing. Budget pre-check runs before fan-out; if your balance can't cover the estimated cost, we return 402 with the breakdown.
Can I auto-enable RSA?
Yes. Pass gateway.optimize = 'quality' and we turn on RSA with conservative defaults (N=8, K=3, T=3). No config required.
Where's the benchmark?
Open-source and reproducible at tangle-network/rsa-benchmark. Run it against your own models and prompts.

One flag. Any model. Better output.

Pick any model from the 671+ catalog, add gateway.rsa, and compare the result.