A Primer on How to Deploy LLMs: Part 3

Oct 08, 2024

[This is Part 3 of a series. See Part 1, Part 2.]

RAG — Retrieval Augmented Generation

I mentioned above that it is useful to try and minimize places where you have to convert data from natural language strings to other formats and vice versa. The reason is because the conversion boundary is fraught with possible error states.

So far, when we've discussed getting additional context for an LLM-powered system, we've made the assumption that we're working with traditional structured databases. As a result, for an LLM to dynamically query that database, we'd have to output some kind of structured query — maybe a SQL query or something similar.

This kinda sucks.

Yea, maybe our system will be able to correctly query the database, but it's equally likely that it will

Hallucinate a bad query;
Hallucinate bad data schema that doesn't work with the databases you have;
Query based on metadata that doesn't really do a good job describing what information we're actually looking for.

Ideally, we could stay in 'unstructured string reasoning land' for our queries too.

Enter vector databases. Abstractly, a vector database is like any other database. In our larger system, you can think of it as a specific way to run getContext. If our getContext method itself is powered by an LLM, you can assume that the Instructions for that LLM will have information about how to read and write to the vector database.

A very quick run through of how vector databases work:

Assume you have some kind of model that creates document embeddings;
For each document that you want to store in the db, embed the document;
The DB treats the embedding as the 'key' and the document as the 'value';
At read time, feed in some kind of query to the same embedding model;
Use the output embedding to run K-nearest neighbors in the vector space of all the embeddings stored in the DB;
Grab the top N closest vectors, find the corresponding documents, and return those.

I'm not going to get super deep on embeddings here (take a look at some previous writing), but the gist is that you can query documents based on similarity in the embedding space.

This architecture is generally known as RAG.

The magic of the vector database is that you do not have to do any kind of structured query. Rather, the query itself is unstructured, in the same format that LLMs naturally do IO. This makes both reads and writes a lot easier and, generally, more accurate.

Note: because vector databases operate on similarity, it is very common to modify the RAG architecture slightly by taking in an input query, having an LLM generate a response to that query without any additional context, and then use that response to query the database for additional context. This works better because the generated response is going to 'look more similar' to documents already in the database than a query will.

Similarity between query and document: LOW

Similarity between no context pass and the document: HIGH

Often, documents are too large to embed in your embedding model, or your documents are data structures that your embedding model isn't great at embedding (for example, your document is code, and your embedding model was trained on plain text). In such cases, you should pass your document through an LLM summarization model first, and then use the embedding of the summary as your embedding 'key' in the vector database.

LLM Metaprogramming

So by this point we've established that LLMs can provide inputs to other LLMs. In all of the examples thus far, that input is treated as 'Data'. It's still up to the end user to actually define the 'Instructions' of how the LLM works.

But LLMs don't distinguish between Instructions and Data. That's an artificial distinction that we created for clarity. So there's no reason an LLM can't output instructions for a different LLM, and then call that LLM as a subroutine. This is like code that writes other code — very weird to wrap your head around, but supremely powerful when it works.

Such systems depend heavily on some kind of control mechanism (in the physics sense of the word, as in control theory) to avoid going off the rails. It's very common to have these systems get stuck in loops, where LLM 1 doesn't ever get a satisfactory result from LLM 2, and LLM 2 keeps cycling through the same answers without really understanding why it's not sufficient.

If you can get a real world measure for whether you are making progress on a task, with ideally some real world indication of how to move closer to the end goal, you can leverage LLM and meta-LLM orchestration to iterate towards a prompt and the necessary context to actually solve the problem.

For example, imagine you had a system to fix bugs in code bases. A useful control system might be "how many tests pass", "why did the tests that fail fail", and "does the code compile/if not, why?". This is all feasible as a control system because the test suite itself will tell you why a test is still failing.

Now You're Thinking with Prompts

The core takeaway from this document, if there is one, is that LLMs are really versatile if you can prompt them correctly. With the right instructions and the right data, the modern batch of LLMs can get quite far. And with access to external tooling, they can do basically anything.

But doing it all well is another matter. These systems have weird dynamics and strange behaviors at the edge. The only way to build an intuition for these things is to play with them — a lot.

[This is Part 3 of a series. See Part 1, Part 2.]

12 Grams of Carbon

Discussion about this post