LLM Operating System (LLM OS)
LLM OS vs. Traditional OS
To better understand this concept, let’s draw parallels between the new LLM operating system and today’s conventional operating systems. Think of the memory hierarchy: the internet or disk storage that you access through browsing is akin to the context window of an LLM, which acts as its random access memory (RAM). This context window, although finite, is a critical resource for the LLM’s working memory, enabling it to predict and generate coherent text sequences.
In this analogy, the kernel process of the LLM manages this context window, paging relevant information in and out to perform tasks efficiently. Much like traditional operating systems, LLMs can exhibit multi-threading, multiprocessing, and speculative execution within their context windows. There are also equivalents to user space and kernel space, reflecting the complex management of computational resources.
The analogy holds conceptually, but practically, these systems need substantial advancements in memory management and real-time processing to match traditional OS capabilities. Researchers are exploring memory-augmented neural networks and other techniques to overcome these challenges.
Current LLMs have limited context windows and often require external mechanisms to manage longer interactions. For instance, GPT-4’s context window is limited to 128,000 tokens, which can be a constraint for tasks requiring long-term context retention. The size of 1 token is ~5 bytes so 128,000 tokens is about 640K of RAM. To put in context, 1980s PCs had 640K RAM. This is encouraging, because we are looking at progress of half-a-century in coming decade. It also means that the context windows of LLMs need to be mind boggling ~200Mn to achieve performance equivalent to 1GB RAM.
At Kernel Level
Consider integrating our advanced language model with the Linux kernel. This would provide the AI with comprehensive access to the operating system's core functionalities. However, it's important to recognize that large language models (LLMs) are designed for human-like interaction, not intricate coding tasks. While embedding the model at the kernel level offers the advantage of understanding and controlling detailed system operations, it raises valid security concerns. Responsible development is crucial to ensure that the AI's evolving decision-making capabilities don't inadvertently compromise system integrity.
Internal Memory (LLM RAM): A small space where the LLM keeps important information.
External Memory (LLM HDD): A much larger space where the LLM can store and retrieve data when needed.
Traditional Use in CPUs
Branch prediction is a tool used in computers to speed up how they work. When a computer has to choose between two sets of instructions, it tries to guess which set will be used next. Here’s a breakdown:
Prediction: The computer makes a guess about the next set of instructions.
Speculative Sampling: Based on that guess, the computer starts working on those instructions, even if it’s not sure it’s the right choice.
Correction: If the guess is right, the computer keeps going. If it’s wrong, the computer starts over and chooses the other set.
Virtual Memory Paging
Traditional Use in Computers
Virtual memory is an abstraction provided by the operating system that makes it seem to an application as if it has access to more RAM than is physically available. When the actual RAM gets filled up, the operating system uses a portion of the computer’s storage space (typically the hard drive) as an extension of RAM. This process enables the computer to handle more tasks concurrently by mapping the application’s memory addresses to actual physical locations, which could be in RAM, on the hard disk, or even other storage mediums.
In essence, virtual memory gives applications the illusion they’re utilizing a large, contiguous chunk of RAM, even though the reality behind the scenes might be quite different.
How LLMs Use This Idea
Transformers, especially LLMs, feature a mechanism called “KV cache,” similar to RAM, that temporarily stores key-value pairs during attention operations for quick access. To efficiently handle longer sequences that don’t fit in memory, they could potentially adopt techniques inspired by virtual memory paging.
vLLM: virtual paging for KV cache
Researchers from UC Berkeley introduced this idea in a study called Efficient Memory Management for Large Language Model Serving with PagedAttention also dubbed as vLLM.
The heart of vLLM is PagedAttention. It’s a fresh take on how attention works in transformers, borrowing from the paging idea in computer OS. Remarkably, without changing the original model, PagedAttention allows batching up to 5x more sequences. This means better use of GPU resources and faster operations.
Also here’s a rapid breakdown of some crucial state of the art LLM serving techniques as of Oct 2023:
Continuous Batching: Increases throughput by allowing requests to immediately jump onto an ongoing GPU batch, minimizing wait time.
1. Paged Memory in LLMs: Efficient Memory Management for Large Language Model Serving with PagedAttention
The focus of memory management in LLM inference is on the key-value (“KV”) cache, used to store the intermediate values of attention computations, for re-use when computing the outputs for future tokens in the sequence. This cache, introduced by the 2022 paper Efficiently Scaling Tansformer Inference from Google, converts a quadratic-time operation (compute attention for all token pairs in the sequence) into a linear-time one (compute only attention for the newest token in the sequence), at a linear cost in space (store keys and values for each token in the sequence).
If hot-swappable LoRA adapters ever take off, a similar amount of memory legerdemain will be required for the weights as well.
The KV cache is typically allocated as a single large block of memory, which is subject to the same issues as large block allocations in general.
This problem is especially acute for batched requests to LLM services. Because requests from different clients have varying prompt lengths and varying response lengths that are not known until run time, the utilization of the KV cache becomes highly uneven.
The situation is depicted nicely in this figure from the paper that introduced PagedAttention and the vLLM inference server that uses it. Note that the “internal fragmentation” sections should be several hundred times larger than they are depicted here!
The solution proposed by the paper is to allocate the KV cache via pages – each one with enough space to hold a handful of KV states – and use a page table, much like the one in a processor, to map the logical memory of the cache to its physical address.
References