Fallback & Routing¶
Fallback routing lets you pass multiple model strings instead of one. llmgate tries each model in order and returns the first successful response — automatically, transparently, with zero extra code.
This is the single most powerful reliability feature for production LLM applications: no more hand-rolling retry loops across providers.
Quickstart¶
from llmgate import completion
resp = completion(
model=["gpt-4o-mini", "groq/llama-3.1-8b-instant", "gemini-2.0-flash"],
messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.text)
print(resp.provider) # → whichever model succeeded
print(resp.fallback_attempts) # → ["gpt-4o-mini"] if first model failed
That's it. When gpt-4o-mini hits a rate limit, llmgate silently tries groq/llama-3.1-8b-instant, then gemini-2.0-flash. Your application code never changes.
How it works¶
- llmgate tries each model in list order.
- If a model fails with a triggering error (
RateLimitError,ProviderAPIError, orAuthError), it's recorded infallback_attemptsand the next model is tried. - If a model fails with a non-triggering error (e.g.
ModelNotFoundError), the error propagates immediately — no fallback. - The first successful response is returned with
fallback_attemptspopulated. - If all models fail,
AllProvidersFailedErroris raised with the full(model, exception)list.
model=["gpt-4o-mini", "groq/llama-3.1-8b-instant", "gemini-2.0-flash"]
↓ RateLimitError
↓ ProviderAPIError
↓ ✓ success → resp.fallback_attempts = ["gpt-4o-mini", "groq/llama-3.1-8b-instant"]
Observability¶
CompletionResponse has a new field:
- Empty list (
[]) — first model succeeded, no fallback occurred - Non-empty (
["gpt-4o-mini"]) — that model failed, the currentresp.provideris the one that worked
Use this for logging, alerting, or dashboards:
resp = completion(model=["gpt-4o-mini", "groq/llama-3.1-8b-instant"], messages=messages)
if resp.fallback_attempts:
logger.warning(
"Provider fallback: %s failed, used %s",
resp.fallback_attempts,
resp.provider,
)
Three API surfaces¶
1. Top-level completion() / acompletion()¶
The simplest path — just swap a str for a list[str]:
from llmgate import completion, acompletion
# Sync
resp = completion(
model=["gpt-4o-mini", "groq/llama-3.1-8b-instant"],
messages=messages,
)
# Async
resp = await acompletion(
model=["gpt-4o-mini", "groq/llama-3.1-8b-instant"],
messages=messages,
)
2. LLMGate(fallback_chain=[...]) — app-level config¶
Configure the chain once at startup. All middleware (retry, logging, cache) applies to each candidate before falling back:
from llmgate import LLMGate
from llmgate.middleware import RetryMiddleware, LoggingMiddleware
gate = LLMGate(
fallback_chain=["gpt-4o-mini", "groq/llama-3.1-8b-instant", "gemini-2.0-flash"],
middleware=[
RetryMiddleware(max_retries=2), # retries each model before fallback
LoggingMiddleware(level="INFO"),
],
)
# model arg is optional when fallback_chain is configured
resp = gate.completion(messages=messages)
resp = await gate.acompletion(messages=messages)
Retry then fall back
With LLMGate(fallback_chain=[...]), RetryMiddleware wraps each individual model attempt. So the sequence is:
- Try
gpt-4o-mini→ fail → retry up to 2× → still fails - Try
groq/llama-3.1-8b-instant→ fail → retry up to 2× → still fails - Try
gemini-2.0-flash→ ✓ success
3. FallbackMiddleware — composable middleware¶
Drop FallbackMiddleware into any existing middleware stack:
from llmgate import LLMGate
from llmgate.middleware import RetryMiddleware, FallbackMiddleware
gate = LLMGate(middleware=[
RetryMiddleware(max_retries=2),
FallbackMiddleware(
models=["groq/llama-3.1-8b-instant", "gemini-2.0-flash"],
),
])
resp = gate.completion("gpt-4o-mini", messages)
Middleware coverage on fallback models
With FallbackMiddleware, the primary model goes through the full middleware stack. Fallback models are called directly (bypassing middleware) to avoid recursive chains. Use LLMGate(fallback_chain=[...]) if you need full middleware coverage on every candidate.
Customising fallback_on¶
By default, fallback triggers on:
Override this with the fallback_on parameter:
from llmgate import completion
from llmgate.exceptions import RateLimitError
# Only fall back on rate limits — auth errors propagate immediately
resp = completion(
model=["gpt-4o-mini", "groq/llama-3.1-8b-instant"],
messages=messages,
fallback_on=(RateLimitError,),
)
Or configure it on the gate:
gate = LLMGate(
fallback_chain=["gpt-4o-mini", "groq/llama-3.1-8b-instant"],
fallback_on=(RateLimitError,),
)
Handling total failure¶
When every model in the chain fails, AllProvidersFailedError is raised. It contains the full (model, exception) list for diagnostics:
from llmgate import completion
from llmgate.exceptions import AllProvidersFailedError
try:
resp = completion(
model=["gpt-4o-mini", "groq/llama-3.1-8b-instant"],
messages=messages,
)
except AllProvidersFailedError as e:
print(f"All {len(e.errors)} providers failed:")
for model, exc in e.errors:
print(f" {model}: {type(exc).__name__}: {exc}")
Streaming¶
Not supported with model lists
stream=True cannot be combined with a model list. Streaming fallback is planned for v0.7.
Reference¶
| Parameter | Where | Description |
|---|---|---|
model: list[str] |
completion(), acompletion() |
Ordered fallback chain |
fallback_on |
completion(), acompletion(), LLMGate() |
Exception types that trigger fallback. Default: (RateLimitError, ProviderAPIError, AuthError) |
fallback_chain |
LLMGate() |
App-level fallback chain with full middleware per candidate |
FallbackMiddleware(models=[...]) |
LLMGate(middleware=[...]) |
Composable middleware fallback |
resp.fallback_attempts |
CompletionResponse |
Models tried before this response |
AllProvidersFailedError |
exception | Raised when all models fail; .errors is list[tuple[str, Exception]] |