WHY I'M BUILDING QUINFERENCE

ASHWIN VISWATMULA · APRIL 23, 2026 · 3 min read

Quinference started as a hackathon bet: that the problem of routing inference traffic across heterogeneous GPU clouds is structurally identical to the problem of routing marketing treatments across heterogeneous customer segments — and that the tooling should look the same.

This is the short version of the thesis. More to come.

The observation

Marketing orchestration, stripped to its primitives:

You have a population (customers) with varied characteristics.
You have treatments (offers, creatives) with different costs and expected outcomes.
You have channels (email, direct mail, push) with different unit economics, latency, and constraints.
You have objectives (maximize revenue under a budget cap, maximize response within a resting rule).
A good orchestration system decides which treatment goes through which channel to which segment, subject to constraints, in real time.

Now inference orchestration, stripped to its primitives:

You have requests (inference traffic) with varied characteristics (chat, batch, reasoning, voice, code).
You have models (GPT, Claude, Llama variants) with different costs and quality profiles.
You have providers (CoreWeave, Lambda, Together, OpenAI, Anthropic, self-hosted) with different GPUs, prices, availability tiers, and regional footprints.
You have objectives (minimize cost, meet latency SLA, avoid single-provider concentration).
A good orchestration system decides which model runs where for which request, subject to constraints, in real time.

These are the same problem.

The bet

If the problems are structurally the same, the interfaces probably should be too. Quinference borrows the campaign-orchestration mental model directly:

Workload types are campaigns. A voice-agent workload is a campaign with a latency SLA. A batch summarization workload is a campaign with a cost ceiling.
Providers are channels. Each channel has different unit economics, availability, and failure modes.
Models are treatments. Different models suit different workload types, with different quantization tradeoffs the way different creatives have different tradeoffs.
Allocation is a segment matrix. A workload × provider × model grid where the cells are allocation percentages and the constraints are latency, cost, and concentration caps.
Resting rules become concentration caps. “Don’t contact this customer twice in 30 days” becomes “don’t route more than 40% of this workload to one provider.”

Once you set the problem up this way, a lot of the hard-won intuition from marketing operations transfers. Holdouts become A/B routing tests. Budget forecasting becomes cost simulation. Preflight checks become provider health checks.

Why it matters

The naive version of inference routing is “send everything to the cheapest provider that can meet my latency.” In practice that breaks down for the same reasons naive channel routing breaks down:

Availability risk. Providers go down. If 100% of your workload is on one provider, your app is down with them.
Price volatility. Spot prices move. An allocation that’s optimal today is mispriced tomorrow.
Workload heterogeneity. A voice agent and a batch summarizer should not route the same way, even at the same token volume.
Model heterogeneity. Dense and MoE models have totally different GPU affinities and TTFT profiles.

The interesting work is in the optimization layer that respects all of this. The dashboard is there to make the decisions legible — to you and, eventually, to an agent doing the optimization for you.

Why now

Two reasons.

First: the GPU market is one of the most heterogeneous infrastructure markets in a long time. The pricing gap between the cheapest and most expensive providers for the same workload is routinely 3–5×. Very few teams are set up to exploit that.

Second: the marketing-tech playbook for exactly this kind of optimization already exists. It just needs to be ported. That’s what Quinference is.