AI security sucks quite a bit right now, and I think it could be better. If you are thinking: “Oh no! How would I obtain my napalm recipes on ChatGPT?” I would like to point out that AI security and safety are not the same thing. Security mostly concerns how systems driven by LLMs can be compromised by adversaries. For example, you could have your data stolen while using Bard or get manipulated by Bing, while using them for completely normal tasks. Or maybe you’ve seen the recent reveal of Devin, the AI coding assistant. If it browses the wrong website and is compromised, it would have access to all of your codebase to do what it pleases pretty much autonomously.

Most of these issues are called (indirect) prompt injections and I’ve written about this issue extensively before. Today I would like to present an overview of the state of AI security and how I would like to see it evolve going forwards. I will also be presenting quite a few novel approaches that I haven’t seen being suggested anywhere else.

To get started, let’s look at what Karpathy calls “LLM OS”:

In this perspective on LLM apps, LLMs are now your computer, interfacing with you and other systems, executing tasks, and so on. But there is something important missing! There is no separation into a privileged and unprivileged layer. Everything runs in the same context! This is the main reason why implementing LLM OS as presented in the diagram would cause lots of trouble. In a normal system, an operating system running in ring 0 will protect the hardware and important memory from untrusted applications and data.

I’m definitely not the first one to notice this, notably Simon Willison proposed the “dual LLM” approach, where a privileged LLM is the one that has access to important data. Unfortunately, implementing that pattern would gain its security from not letting the privileged layer see any of the data in the inner LLM, making the pattern pretty useless in practice. While AI security will not come for free, giving up agents and assistants using privileged interfaces completely would not be the preferred approach for most.

The Inherent Trade-Offs of AI Security

As I’ve argued many times before, there is no “fixing” prompt injection - it is an inherent trait of any intelligent system, including humans and potential future superintelligent AI systems (see this paper). But, do not despair! I further propose that there is an inherent trade-off to be made here: our systems can be secure, cheap or useful, but not all at the same time. The Dual LLM pattern is a good example of this: it is secure, but not useful. But what if we could make smarter trade-offs? That is the idea I will be chasing in this blogpost, albeit very hurriedly (I don’t have time for a long write-up of all my ideas). I’ve also talked about this in the past, notably during my talk at Google DeepMind, where I also dive into some of the more obvious approaches and why they fail.

Lessons From Traditional Security

Current approaches to AI security mostly follow two patterns: filters & firewalls and guardrails. Firewalls are straightforward: there is some kind of filter API or model, which takes incoming data or model outputs and individually classifies them as blocked or allowed. Guardrails are a bit more complex: they are a set of rules that are applied to the system as a whole, and they are meant to prevent the system from doing anything too dangerous. Most of the guardrails boil down to asking the LLM with another prompt template whether a given input or output is safe. I have two main issues with these approaches that I propose to address in the following sections:

  • Firewalls lack statefulness (attacks can span multiple messages) and do not conform to the use-case at hand (not context-aware, means they make bad trade-offs).
  • Guardrails have to be pre-defined, but when a user is interacting with an LLM, the ways they use the system are not always anticipated in advance. This can induce new threats that bypass static guardrails.

Novel Proposals

While traditional security approaches can’t be applied directly to LLMs, we can adapt some key principles to develop novel mitigations tailored for AI systems. Here are a few promising directions:

Taint Tracking

Taint tracking involves monitoring the flow of untrusted data through a system and flagging when it influences sensitive operations. We can apply this concept to LLMs by tracking the “taint” level of the model’s state based on the inputs it has ingested.

As the model processes more untrusted data, the taint level rises. The permissions and capabilities of the model can then be dynamically adjusted based on the current taint level. High risk actions, like executing code or accessing sensitive APIs, may only be allowed when taint is low. This allows trading off capabilities based on the likelihood the model is operating in a compromised state.

An LLM-integrated application with a lower trust level may also cause the UI to change: To reduce user fatigue, we could only request user approval for LLMs in a trusted state.

Action Guards

With action guards, specific high-risk actions the model can take, like sending an email or making an API call, are gated behind dynamic permission checks. These checks analyze the model’s current state and context to determine if the action should be allowed.

This would also allow us to dynamically decide how much extra compute/cost to spend on identifying whether a given action is safe or not.

For example, if the user requested the model to send an email, but the model’s proposed email content seems unrelated to the user’s original request, the action guard could block it.

Secure Threads

Secure threads take advantage of the fact that when a user first makes a request to an AI system, before the model ingests any untrusted data, we can have high confidence the model is in an uncompromised state.

At this point, based on the user’s request, we can have the model itself generate a set of guardrails, output constraints, and behavior specifications that the resulting interaction should conform to. These then serve as a “behavioral contract” that the model’s subsequent outputs can be checked against.

If the model’s responses violate the contract, for example by claiming to do one thing but doing another, execution can be halted. This turns the model’s own understanding of the user’s intent into a dynamic safety mechanism.

Say for example the user is asking for the current temperature outside: we can instruct another LLM with internet access to check and retrieve the temperature but we will only permit it to fill out a predefined data structure without any unlimited strings, thereby preventing this “thread” to compromise the outer LLM.

Learning from Humans

While humans are also susceptible to manipulation, our societies and interactions are more robust than current AI systems. We can learn from a few key human factors:

  • Diversity - Training a diverse set of models, rather than a monoculture, can provide robustness. A vulnerability may affect one model but not others, like diversity of brains in humans.

  • Online learning - Humans learn and adapt from experiences. Adding some in-session learning capabilities to LLMs can help them build up immunity to repeated probing and attacks over the course of an interaction.

  • Ensemble decisions - Important decisions in human organizations often require multiple people to sign off. An analogous approach with AI is to have an ensemble of models cross-check each other’s decisions and identify anomalies. This is basically trading security for cost.

While none of these mitigations are silver bullets, they show some promising directions for making LLM-based systems more robust to adversarial attacks. The path forward likely involves combining multiple approaches and erring on the side of caution in high-stakes applications. With responsible deployment and continuing research, we can work towards making AI systems more secure and trustworthy.