Large Language Models (LLMs) are everywhere, from everyday apps to advanced tools. Using them is easy. But what if you need to run your own model? Whether you’ve fine-tuned one or are dealing with privacy-sensitive data, the complexity increases. In this post, we’ll share what we learned while building our own LLM inference system. We’ll cover storing and deploying models, designing the service architecture, and solving real-world issues like routing, streaming, and managing microservices. The process involved challenges, but ultimately, we built a reliable system and gathered lessons worth sharing.
LLMs are powering a wide range of applications — from chatbots and workflow agents to smart automation tools. While retrieval-augmented generation, tool-calling, and multi-agent protocols are important, they operate at a level above the core engine: a foundational LLM.
Many projects rely on external providers, such as
That’s where self-hosting becomes essential. Serving a pretrained or fine-tuned model provides control, security, and the ability to tailor the model to specific business needs. Building such a system doesn’t require a large team or extensive resources. We built it with a modest budget, a small team, and just a few nodes. This constraint influenced our architectural decision, requiring us to focus on practicality and efficiency. In the following sections, we’ll cover the challenges faced, the solutions implemented, and the lessons learned along the way.
These are the core components that form the backbone of the system.
Choosing the right schema for data transfer is crucial. A shared format across services simplifies integration, reduces errors, and improves adaptability. We aimed to design the system to work seamlessly with both self-hosted models and external providers — without exposing differences to the user.
There’s no universal standard for LLM data exchange. Many providers follow schemas similar to
Sticking to a single predefined provider’s schema has its benefits:
But there are real downsides too:
To address this, we chose to define our own internal data model — a schema designed around our needs, which we can then map to various external formats when necessary.
Before addressing the challenges, let’s define the problem and outline our expectations for the solution:
We began by reviewing major LLM schemas to understand how providers structure messages, parameters, and outputs. This allowed us to extract core domain entities common across most systems, including:
temperature
, top_p
, beam_search
)We identified certain parameters, such as service_tier
, usage_metadata
, or reasoning_mode
, as being specific to the provider's internal configuration and business logic. These elements lie outside the core LLM domain and are not part of the shared schema. Instead, they are treated as optional extensions. Whenever a feature becomes widely adopted or necessary for broader interoperability, we evaluate integrating it into the core schema.
At a high level, our input schema is structured with these key components:
temperature
, top_p
, max_tokens
).This leads us to the following schema, represented in a
class ChatCompletionRequest(BaseModel):
model: str # Routing key to select the appropriate model or service
messages: list[Message] # Prompt and dialogue history
generation_parameters: GenerationParameters # Core generation settings
tools: list[Tool] # Optional tool defenitions
class GenerationParameters(BaseModel):
temperature: float
top_p: float
max_tokens: int
beam_search: BeamSearchParams
# Optional, non-core fields specific to certain providers
provider_extensions: dict[str, Any] = {}
...
# Other parameters
We deliberately moved generation parameters into a separate nested field instead of placing them at the root level. This design choice makes a distinction between constant parameters (e.g., temperature, top-p, model settings) and variable components (e.g., messages, tools). Many teams in our ecosystem store these constant parameters in external configuration systems, making this separation both practical and necessary.
We include an additional field called provider_extensions
within the GenerationParameters
class. These parameters vary significantly across different LLM providers, validation and interpretation of these fields is delegated to the final module that handles model inference—the component that knows how to communicate with a specific model provider. Thus, we avoid unnecessary pass-through coupling caused by redundant data validation across multiple services.
To ensure backward compatibility, new output schema features are introduced as explicit, optional fields in the request schema. These fields act as feature flags — users must set them to opt into specific behaviors. This approach keeps the core schema stable while enabling incremental evolution. For example, reasoning traces will only be included in the output if the corresponding field is set in the request.
These schemas are maintained in a shared Python library and used across services to ensure consistent request and response handling.
We began by outlining how we built our own platform — so why bother with compatibility across external providers? Despite relying on our internal infrastructure, there are still several scenarios where external models play a role:
The overall communication flow with external providers can be summarized as follows:
This process involves the following steps:
provide_extensions
.This is a high-level schematic that abstracts away some individual microservices. Details about specific components and the streaming response format will be covered in the following sections.
LLM responses are generated incrementally — token by token — and then aggregated into chunks for efficient transmission. From the user’s perspective, whether through a browser, mobile app, or terminal, the experience must remain fluid and responsive. This requires a transport mechanism that supports low-latency, real-time streaming.
There are two primary options for achieving this:
While both options are viable, SSE is the more commonly used solution for standard LLM inference — particularly for OpenAI-compatible APIs and similar systems. This is due to several practical advantages:
Because of these benefits, SSE is typically chosen for text-only, prompt-response streaming use cases.
However, some emerging use cases require richer, low-latency, bidirectional communication — such as real-time transcription or speech-to-speech interactions.
Since our system focuses exclusively on text-based interactions, we stick with SSE for its simplicity, compatibility, and alignment with our streaming model.
With SSE selected as the transport layer, the next step was defining whatdata to include in the stream. Effective streaming requires more than just raw text — it needs to provide sufficient structure, metadata, and context to support downstream consumers such as user interfaces and automation tools. The stream must include the following information:
n
) are streamed back, chunk-by-chunk. Each generation can consist of multiple sequences (e.g., n=2
, n=4
) .These sequences are generated independently and streamed in parallel, each broken into its own set of incremental chunks.After defining the structure of the streamed response, we also considered several non-functional requirements essential for reliability and future evolution.
Our stream design is intended to be:
In many applications — such as side-by-side comparison or diverse sampling — multiple sequences (completions) are generated in parallel as part of a single generation request.
The most comprehensive format for streaming responses is defined in the choices
array:
choices
array
A list of chat completion choices. Can contain more than one element if n is greater than 1. Can also be empty for the last chunk.
Although, in practice, individual chunks usually contain only a single delta, the format allows for multiple sequence updates per chunk. It’s important to account for this, as future updates might make broader use of this capability. Notably, even the
We chose to follow the same structure to ensure compatibility with a wide range of potential features. The diagram below illustrates an example from our implementation, where a single generation consists of three sequences, streamed in six chunks over time:
As you see, to make the stream robust and easier to parse, we opted to explicitly signal Start and Finish events for both the overall generation and each individual sequence, rather than relying on implicit mechanisms such as null checks, EOFs, or magic tokens. This structured approach simplifies downstream parsing, especially in environments where multiple completions are streamed in parallel, and also improves debuggability and fault isolation during development and runtime inspection.
Moreover, we introduce an additional Error chunk that carries structured information about failures. Some errors — such as malformed requests or authorization issues — can be surfaced directly via standard HTTP response codes. However, if an error occurs during the generation process, we have two options: either abruptly terminate the HTTP stream or emit a well-formed SSE error event. We chose the latter. Abruptly closing the connection makes it hard for clients to distinguish between network issues and actual model/service failures. By using a dedicated error chunk, we enable more reliable detection and propagation of issues during streaming.
At the center of the system is a single entrypoint: LLM-Gateway. It handles basic concerns like authentication, usage tracking and quota enforcement, request formatting, and routing based on the specified model. While it may look like the Gateway carries a lot of responsibility, each task is intentionally simple and modular. For external providers, it adapts requests to their APIs and maps responses back into a unified format. For self-hosted models, requests are routed directly to internal systems using our own unified schema. This design allows seamless support for both external and internal models through a consistent interface.
As mentioned earlier, Server-Sent Events (SSE) is well-suited for streaming responses to end users, but it’s not a practical choice for internal backend communication. When a request arrives, it must be routed to a suitable worker node for model inference, and the result streamed back. While some systems handle this using chained HTTP proxies and header-based routing, in our experience, this approach becomes difficult to manage and evolve as the logic grows in complexity.
Our internal infrastructure needs to support:
To address these requirements, we use a message broker to decouple task routing from result delivery. This design provides better flexibility and resilience under varying load and routing conditions. We use
Now let’s take a closer look at how this system is implemented in practice:
We use dedicated queues per model, allowing us to route requests based on model compatibility and node capabilities. The process is as follows:
To handle large payloads, we avoid overwhelming the message broker:
When it comes to routing and publishing messages, each Request Queue is a regular RabbitMQ
If message loss is unacceptable, the following must be in place:
So far, we’ve covered how tasks are published — but how is the streamed response handled? The first step is to understand how temporary queueswork in RabbitMQ. The broker supports a concept called
We create one exclusive queue per Scheduler service replica, ensuring it’s automatically cleaned up when the replica shuts down. However, this introduces a challenge: while each service replica has a single RabbitMQ queue, it must handle many requests in parallel.
To address this, we treat the RabbitMQ queue as a transport layer, routing responses to the correct Scheduler replica. Each user request is assigned a unique identifier, which is included in every response chunk. Inside the Scheduler, we maintain an additional in-memory routing layer with short-lived in-memory queues — one per active request. Incoming chunks are matched to these queues based on the identifier and forwarded accordingly. These in-memory queues are discarded once the request completes, while the RabbitMQ queue persists for the lifetime of the service replica.
Schematically this looks as follows:
A central dispatcher within the Scheduler dispatches chunks to the appropriate in-memory queue, each managed by a dedicated handler. Handlers then stream the chunks to users using SSE-protocol.
There are several mature frameworks available for efficient LLM inference, such as
Through experience, we’ve learned that even minor library updates can significantly alter model behavior — whether in output quality, determinism, or concurrency behavior. Because of this, we’ve established a robust testing pipeline:
Most modern systems run in containerized environments — either in the cloud or within Kubernetes (K8s). While this setup works well for typical backend services, it introduces challenges around model weight storage. LLM models can be tens or even hundreds of gigabytes in size, and baking model weights directly into Docker images — quickly becomes problematic:
To solve this, we separate model storage from the Docker image lifecycle. Our models are stored in an external S3-compatible object storage, and fetched just before inference service startup. To improve startup time and avoid redundant downloads, we also use
A system like this — built on streaming, message queues, and real-time token generation — requires robust observability to ensure reliability and performance at scale.
In addition to standard service-level metrics (CPU, memory, error rates, etc.), we found it essential to monitor the following:
While our system is production-ready, there are still important challenges and opportunities for optimization:
While building a reliable and provider-independent LLM serving system can seem complex at first, it doesn’t require reinventing the wheel. Each component — streaming via SSE, task distribution through message brokers, and inference handled by runtimes like vLLM — serves a clear purpose and is grounded in existing, well-supported tools. With the right structure in place, it’s possible to create a maintainable and adaptable setup that meets production requirements without unnecessary complexity.
In the next post, we’ll explore more advanced topics such as distributed KV-caching, handling multiple models across replicas, and deployment workflows suited to ML-oriented teams.