Streaming¶
Pass stream=True to receive response tokens as they are generated rather than waiting for the full completion.
Sync Streaming¶
from llmgate import completion
for chunk in completion("gpt-4o-mini", messages, stream=True):
print(chunk.delta, end="", flush=True)
print() # newline at end
chunk is a StreamChunk with a single delta: str field containing the incremental text.
Async Streaming¶
import asyncio
from llmgate import acompletion
async def stream_response():
async for chunk in await acompletion(
"groq/llama-3.3-70b-versatile",
messages,
stream=True,
):
print(chunk.delta, end="", flush=True)
print()
asyncio.run(stream_response())
Collecting the full response¶
chunks = []
for chunk in completion("gpt-4o-mini", messages, stream=True):
chunks.append(chunk.delta)
print(chunk.delta, end="", flush=True)
full_text = "".join(chunks)
Streaming through middleware¶
Streaming works seamlessly with LLMGate middleware:
from llmgate import LLMGate
from llmgate.middleware import LoggingMiddleware, RetryMiddleware
gate = LLMGate(middleware=[
RetryMiddleware(max_retries=3),
LoggingMiddleware(),
])
for chunk in gate.stream("claude-3-5-haiku-20241022", messages):
print(chunk.delta, end="", flush=True)
Incompatibility
stream=True and response_format= (structured outputs) cannot be used together.
Streaming returns raw incremental text; structured outputs require the complete response for Pydantic validation.