Fallback & Routing¶
Fallback routing lets you pass multiple model strings instead of one. llmgate tries each model in order and returns the first successful response — automatically, transparently, with zero extra code.
This is the single most powerful reliability feature for production LLM applications: no more hand-rolling retry loops across providers.
Quickstart¶
from llmgate import completion
resp = completion(
model=["gpt-4o-mini", "groq/llama-3.1-8b-instant", "gemini-2.0-flash"],
messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.text)
print(resp.provider) # → whichever model succeeded
print(resp.fallback_attempts) # → ["gpt-4o-mini"] if first model failed
That's it. When gpt-4o-mini hits a rate limit, llmgate silently tries groq/llama-3.1-8b-instant, then gemini-2.0-flash. Your application code never changes.
How it works¶
- llmgate tries each model in list order.
- If a model fails with a triggering error (
RateLimitError,ProviderAPIError, orAuthError), it's recorded infallback_attemptsand the next model is tried. - If a model fails with a non-triggering error (e.g.
ModelNotFoundError), the error propagates immediately — no fallback. - The first successful response is returned with
fallback_attemptspopulated. - If all models fail,
AllProvidersFailedErroris raised with the full(model, exception)list.
model=["gpt-4o-mini", "groq/llama-3.1-8b-instant", "gemini-2.0-flash"]
↓ RateLimitError
↓ ProviderAPIError
↓ ✓ success → resp.fallback_attempts = ["gpt-4o-mini", "groq/llama-3.1-8b-instant"]
Observability¶
CompletionResponse has a new field:
- Empty list (
[]) — first model succeeded, no fallback occurred - Non-empty (
["gpt-4o-mini"]) — that model failed, the currentresp.provideris the one that worked
Use this for logging, alerting, or dashboards:
resp = completion(model=["gpt-4o-mini", "groq/llama-3.1-8b-instant"], messages=messages)
if resp.fallback_attempts:
logger.warning(
"Provider fallback: %s failed, used %s",
resp.fallback_attempts,
resp.provider,
)
Three API surfaces¶
1. Top-level completion() / acompletion()¶
The simplest path — just swap a str for a list[str]:
from llmgate import completion, acompletion
# Sync
resp = completion(
model=["gpt-4o-mini", "groq/llama-3.1-8b-instant"],
messages=messages,
)
# Async
resp = await acompletion(
model=["gpt-4o-mini", "groq/llama-3.1-8b-instant"],
messages=messages,
)
2. LLMGate(fallback_chain=[...]) — app-level config¶
Configure the chain once at startup. All middleware (retry, logging, cache) applies to each candidate before falling back:
from llmgate import LLMGate
from llmgate.middleware import RetryMiddleware, LoggingMiddleware
gate = LLMGate(
fallback_chain=["gpt-4o-mini", "groq/llama-3.1-8b-instant", "gemini-2.0-flash"],
middleware=[
RetryMiddleware(max_retries=2), # retries each model before fallback
LoggingMiddleware(level="INFO"),
],
)
# model arg is optional when fallback_chain is configured
resp = gate.completion(messages=messages)
resp = await gate.acompletion(messages=messages)
Retry then fall back
With LLMGate(fallback_chain=[...]), RetryMiddleware wraps each individual model attempt. So the sequence is:
- Try
gpt-4o-mini→ fail → retry up to 2× → still fails - Try
groq/llama-3.1-8b-instant→ fail → retry up to 2× → still fails - Try
gemini-2.0-flash→ ✓ success
3. FallbackMiddleware — composable middleware¶
Drop FallbackMiddleware into any existing middleware stack:
from llmgate import LLMGate
from llmgate.middleware import RetryMiddleware, FallbackMiddleware
gate = LLMGate(middleware=[
RetryMiddleware(max_retries=2),
FallbackMiddleware(
models=["groq/llama-3.1-8b-instant", "gemini-2.0-flash"],
),
])
resp = gate.completion("gpt-4o-mini", messages)
Middleware coverage on fallback models
With FallbackMiddleware, the primary model goes through the full middleware stack. Fallback models are called directly (bypassing middleware) to avoid recursive chains. Use LLMGate(fallback_chain=[...]) if you need full middleware coverage on every candidate.
Customising fallback_on¶
By default, fallback triggers on:
Override this with the fallback_on parameter:
from llmgate import completion
from llmgate.exceptions import RateLimitError
# Only fall back on rate limits — auth errors propagate immediately
resp = completion(
model=["gpt-4o-mini", "groq/llama-3.1-8b-instant"],
messages=messages,
fallback_on=(RateLimitError,),
)
Or configure it on the gate:
gate = LLMGate(
fallback_chain=["gpt-4o-mini", "groq/llama-3.1-8b-instant"],
fallback_on=(RateLimitError,),
)
Handling total failure¶
When every model in the chain fails, AllProvidersFailedError is raised. It contains the full (model, exception) list for diagnostics:
from llmgate import completion
from llmgate.exceptions import AllProvidersFailedError
try:
resp = completion(
model=["gpt-4o-mini", "groq/llama-3.1-8b-instant"],
messages=messages,
)
except AllProvidersFailedError as e:
print(f"All {len(e.errors)} providers failed:")
for model, exc in e.errors:
print(f" {model}: {type(exc).__name__}: {exc}")
Streaming¶
Streaming (stream=True) is fully supported with model lists and fallback chains. When a failure occurs mid-stream, llmgate dynamically recovers the stream using one of three stream_fallback_mode strategies:
from llmgate import completion
resp = completion(
model=["gpt-4o-mini", "groq/llama-3.1-8b-instant", "gemini-2.0-flash"],
messages=messages,
stream=True,
stream_fallback_mode="prefill", # "restart" | "prefill" | "user_turn"
)
for chunk in resp:
print(chunk.delta, end="")
Strategies¶
"restart"(Default) Safe and universal. On any failure, the fallback model starts fresh with the original messages. No partial text is carried forward."prefill"Buffer-and-resume. The partial text already yielded is appended as a trailing{"role": "assistant"}message. The fallback model natively continues the generation from that exact point. Supported natively by Gemini, Groq, Mistral, Cohere, and Ollama. (Note: If the fallback provider does not support assistant prefilling, llmgate automatically downgrades to"user_turn"and emits a warning)."user_turn"Wraps the partial text in an assistant message, followed by a user prompt to continue (e.g., "Continue from exactly where you left off"). Works universally across all providers without risking API schema rejection.
Observability¶
Streaming chunks include observability metadata, so you know exactly what is happening mid-stream:
chunk.fallback_attempts # list[str] - Models tried before this chunk's model
chunk.resumed_from_partial # bool - True if the stream resumed via prefill/user_turn
Reference¶
| Parameter | Where | Description |
|---|---|---|
model: list[str] |
completion(), acompletion() |
Ordered fallback chain |
fallback_on |
completion(), acompletion(), LLMGate() |
Exception types that trigger fallback. Default: (RateLimitError, ProviderAPIError, AuthError) |
fallback_chain |
LLMGate() |
App-level fallback chain with full middleware per candidate |
FallbackMiddleware(models=[...]) |
LLMGate(middleware=[...]) |
Composable middleware fallback |
resp.fallback_attempts |
CompletionResponse |
Models tried before this response |
AllProvidersFailedError |
exception | Raised when all models fail; .errors is list[tuple[str, Exception]] |