New: Inference strategies for 671+ models

Better answers
without bigger models.

Pick any cheap model. RSA spawns N parallel calls, aggregates them into one better answer, repeats. A 4B model with RSA matches frontier reasoning models — at a fraction of the cost.

Get an API key Benchmark repo

671+ models

3 strategies

~10x cost reduction

One flag. Same endpoint.

RSA is a gateway option on /v1/chat/completions. Pass your model, add gateway.rsa, done.

// Works with any model in the catalog
fetch('https://router.tangle.tools/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${process.env.TANGLE_API_KEY}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    model: 'anthropic/claude-haiku-4-5',  // or gpt-4o-mini, gemini-3-flash, etc.
    messages: [{ role: 'user', content: 'Prove sqrt(2) is irrational' }],
    gateway: {
      rsa: { n: 16, k: 4, t: 5 }    // ← this is the only new line
    },
  }),
})

tcloud SDK: t.chat({ gateway: { rsa: { n: 16, k: 4, t: 5 } } })

Cost vs. quality.

Reproducible numbers from rsa-benchmark. The pattern holds across model families.

Strategy

Calls

Cost

Quality

Latency

Single expensive model

$0.05–1.00

Baseline

2–8s

Single expensive model

Calls: 1
Cost: $0.05–1.00
Quality: Baseline
Latency: 2–8s

Cheap model + RSA (N=8)Best value

~$0.03

~95%

8–12s

Cheap model + RSA (N=8)

Best value

Calls: 32
Cost: ~$0.03
Quality: ~95%
Latency: 8–12s

Cheap model + RSA (N=16)

~$0.10

~100%

15–30s

Cheap model + RSA (N=16)

Calls: 96
Cost: ~$0.10
Quality: ~100%
Latency: 15–30s

Paper ( Venkatraman et al., 2025): Gemini 3 Flash + RSA approaches Deep Think; Qwen 4B + RSA matches DeepSeek-R1. Our benchmark: Claude Haiku + RSA matches Opus on 3/6 tasks, beats it on 1.

Three strategies, one infrastructure.

Same fan-out + aggregation engine. Different modes for different needs.

RSA

Population refinement

Generate N candidates, aggregate K at a time, refine over T rounds. The LLM cross-references and self-corrects.

gateway: { rsa: { n: 16, k: 4, t: 5 } }

MoA

Cross-model diversity

RSA with diverse models per slot. Claude + Gemini + GPT generating, one aggregator. Reduces single-model blindspots.

gateway: { rsa: { n: 4, k: 3, t: 2, models: [...] } }

Best-of-N

Custom scoring

Generate N candidates, score via your webhook or an LLM judge, return the winner. Your quality criteria, your infra.

gateway: { bestOfN: { n: 5, scorer: {...} } }

How RSA works.

Generate

Spawn N parallel calls to your cheap model

Subsample

Randomly pick K candidates from the population

Aggregate

LLM cross-references and synthesizes one improved answer

Repeat

Run T rounds. Population converges. Return the result.

Budget pre-check

Estimates (N + N×T) × per-call cost before fan-out. Returns 402 if your balance can't cover it.

Latency-aware

optimize: 'latency' + rsa returns 400 — contradictory goals. We fail fast, not silently.

Transparent billing

Each sub-call billed at normal per-token rates. X-Tangle-RSA-Total-Calls header in every response.

Details.

Does it stream?

Not by nature — RSA aggregates across completions, so intermediate rounds aren't streamable. The final response is a standard non-streaming chat completion. Use for async workloads where quality beats latency.

What does it cost?

RSA bills per individual call — with N=16, K=4, T=5, that's 96 calls total at the model's normal per-token pricing. Budget pre-check runs before fan-out; if your balance can't cover the estimated cost, we return 402 with the breakdown.

Can I auto-enable RSA?

Yes. Pass gateway.optimize = 'quality' and we turn on RSA with conservative defaults (N=8, K=3, T=3). No config required.

Where's the benchmark?

Open-source and reproducible at tangle-network/rsa-benchmark. Run it against your own models and prompts.

One flag. Any model. Better output.

Pick any model from the 671+ catalog, add gateway.rsa, and compare the result.

Get your API key Read the docs

Better answerswithout bigger models.