Taking large language models (LLMs) into production is no small task. It's a complex process, often misunderstood, and something we’d like to delve into today.
A common misunderstanding is that the task of productionalizing LLMs is just about deploying an app like Langchain/LLamaindex/Haystack. This perspective skips over the crucial details - the bit-by-bit construction of the LLMOps machine.
The LLMOps jigsaw puzzle, much like MLOps, consists of varied pieces. Central to these is the creation of accurate, high-quality Retrieval-Augmented Generation (RAG) applications. To churn out reliable RAG applications consistently, your workflows need to be automated, observable, and repeatable.
To help you grasp what's required, let’s break down the setup of an LLMOps system into four essential steps. We can then focus on how a RAG application will evolve.
Step 1: Basic RAG
The first step is setting up a basic Retrieval Augmented Generation (RAG) architecture like so. We have the following puzzle pieces:
- Application: This is typically a deployed service that uses a chain orchestrator framework like Langchain or a similar proprietary codebase. It often appears as a chatbot or a similar tool.
- Embeddings Service: This service processes raw data and produces embeddings, which are essentially vectors that numerically represent our documents.
- Vector Database: This is a continuously updated database that stores our proprietary data, which is also embedded in the same space. This database is used to retrieve documents that closely match the query.
- The LLM: The large language model that will be used as the central “intelligence” piece in all this. We’ll give it the query and the retrieved documents at runtime.
Pipeline Spotlight: Indexing your data
The journey to establish an effective LLMOps system begins with setting up an indexing pipeline. The purpose of this pipeline is to load data from proprietary sources and index it in a vector database. Given the rapid changes your data can undergo, your application needs to stay updated with the most current information.
This pipeline needs to:
- Run in an automated manner.
- Track metadata of different collections ingested in the vector database.
- Store configuration of variables and hyperparameters such as chunking and embedding parameters that will be important to iterate upon as this system develops.
Step 2: RAG + Two-Stage Retrieval
The next evolution of the system is to add reranking. A reranking model is a kind of model that produces a similarity score when provided with a pair of a query and documents. This score is utilized to reorganize the documents based on their relevance to our query. A reranker can also be a simple model such as a Random Forest. You can train a simple ML model on proprietary data to act as a reranker.
Pipeline Spotlight: Training our reranker
While you can use off-the-shelf rerankers such as the one Cohere offers, if you are concerned with data security and increased accuracy, you might want to create a training pipeline that regularly updates our reranker over time.
The pipeline needs to:
- Train (finetune) the reranker model to make it more accurate (here is a good example)
- Evaluate the reranker with key metrics to understand if it is improving accuracy at all
- Potentially adding a deployment mechanism that exposes the reranking service as an API.
The biggest drawback of adding a reranker is that it can be quite slow - we need to experiment with different methods to see what is an acceptable SLA for our RAG system.
Step 3: Fine-Tuning Embeddings
Fine-tuning embeddings is a strategic move between basic RAG and fine-tuning the LLM itself. Doing this delivers significant improvements in retrieval. Similar to training a reranker, you might fall into the trap
Pipeline Spotlight: Finetuning embeddings
The process of fine-tuning embeddings is a bit more involved than standard training, as it may require the generation of synthetic data. A great starting point for fine-tuning is the Sentence Transformers embeddings model by HuggingFace.
The pipeline should:
- Possibly generate and store a synthetic dataset (some people utilize proprietary language models for this)
- Train (fine-tune) the embedding model to enhance its accuracy (here's a good example)
- Assess the performance of our embedding service compared to standard alternatives such as OpenAI text-embedding-3-large.
Step 4: LLM Fine-Tuning (or Pretraining)
The final step involves fine-tuning or pretraining the LLM itself. This is the most technically complex but can yield results specific to your domain or data. Databricks has a great visualization that showcases this in a slightly similar way:
However, techniques like QLoRA (which can be executed on a single GPU) indicate a push towards ensuring that cutting-edge AI is not just the domain of those with access to vast computational resources. We’re going to be talking more about finetuning, but you can see some of the work going on behind the scenes in our LLM finetuning example and our LoRA finetuning project.
Pipeline Spotlight: Finetuning a LLM
Finetuning a LLM may itself be decomposed into several pipelines like so:
However, focusing solely on the training aspect, the pipeline should:
- Run on infrastructure that supports large GPUs, which are currently scarce.
- Be able to be warm-started, i.e., to recover from sporadic failures and resume training from a checkpoint.
- Possess the capability to merge various adapters, especially if parameter-efficient techniques are adopted.
Here we are at the end of this stage, with a fully developed LLMOps system. However, this doesn't conclude our journey. There are additional components to LLMOps, such as evaluation and guardrails, that we have not addressed in this document. Notably, we need to instrument these pipelines and applications in a manner that allows for control in a production environment.
Stay tuned here, as we will delve deeper into more aspects of LLMOps in the upcoming weeks, focusing on areas that are rarely discussed!
If you’re anywhere on the gradient of Step 1 to Step 4 as described here, ZenML might be a good fit for you. ZenML is an open-source MLOps (read: LLMOps) framework that allows you to architect your LLM RAG systems in production by building portable ML pipelines. We’re going to be publishing a detailed guide with more information soon - so stay tuned! However, you can already get started building your first MLOps pipelines with ZenML Cloud today.