Thoughts on AI Jailbreaking

AI jailbreaking is one of the many new concepts dumped on the public during the rushed adoption of LLM chat-bots throughout the economy. It is currently (as of 2026) very relevant given the US governments decision to restrict Anthropic from exporting their models to other countries.

So what is it? LLMs are massive statistical models of language. They model sequences of text and learn a statistical representation of how words (or tokens, which are just a smaller more technical subdivision of a word) follow sequences of text. They are primarily trained on massive collections of human produced text sourced from essentially the entire internet, as well as other sources, books, technical manuals and allegedly sources taken without copyright permission. This process is called pre-training. The sequence of words (tokens) an LLM will generate for a user are directly sampled from this statistical models and, therefore, will always in some way reflect the training texts, even if not verbatim.

Concerns about LLMs producing or reproducing sensitive material have been voiced since the technology was first launched into the public sphere. Datasets taken so liberally from the internet are likely to include text concerning a myriad of sensitive topics, from the manufacture of bombs, weapons and dangerous substances to dangerous political and scientific misinformation. Anthropic in particular has focussed on developing models for generating computer code and software. Given that these models are able to ingest large amounts of text (computer code is just text), and generate software based on that text, it is inevitable that they will be able to very rapidly produce malicious software and identify vulnerabilities in software systems.

While these abilities are inherently based on supervised or semi-supervised training from human produced or human labelled computer code, they are not better at it than humans. Rather, the danger is that they can do it quickly and without human judgement. Personally, I find the reports of certain cyber security researchers and groups, who have a good basis to claim that this does not pose the kind of existential cataclysm that Anthropic present in their marketing, compelling. Regardless, I want to explain why AI jailbreaking is such a difficult problem to overcome for these AI labs.

What is the AI Jailbreaking problem?

Jailbreaking an LLM is simply the act of providing a prompt (i.e. a particular sequence of input text) which causes the LLM to generate an output which the vendor does not want it to generate.

While the internal methods of these AI labs is a closely guarded secret, we can largely understand that they are trying to combat this with fine-tuning. Fine-tuning follows pre-training, and is widely used to optimise the outputs of the model to prioritise particular parts of the enormous distribution of text consumed in pre-training. Without fine-tuning, the outputs of LLMs would be strongly biased towards patterns most repeatedly present in the distribution seen in pre-training, which would prevent it from being useful for more specialised tasks or uncommon information. Think of it as pushing the network away from just producing a complicated kind of average of all the text entire internet for every prompt.

The problem with fine-tuning it is probabilistic, it cannot place hard boundaries in the model around certain outputs or information. No matter how hard I try to fine-tune a model to not regurgitate bomb making instructions, there is always a non-zero chance, given a certain prompt, that it will. This is exacerbated by the huge amount of positive fine-tuning these companies are doing, optimizing the LLMs to be good at certain tasks. Anthropic will be constantly fine-tuning their models to be good at writing software. Unfortunately for them, it is very conceptually difficult to separate being good at coding from not writing malicious code, given that these are purely statistical models.

What can be done?

So what can they do? Many of these companies already pass the inputs and outputs of their models through censoring systems, detecting keywords and patterns of text which may suggest misuse. This is largely effective for obviously dangerous topics, like bomb making. However, without the real-world "understanding", as humans have, of what, why and for whom, the text they are producing is, it is very easy to avoid these censors for more difficult to identify infringements. Malicious software is the perfect example of this.

Let's say our potential hacker were to feed sections critical software for a nuclear power plant into an LLM, but in her prompt she simply requests that it identify security vulnerabilities in her personal simulation project or video game code. Censors may struggle to identify this as malicious. This is especially true given that in order to compare and identify the inputted code as being taken from that nuclear power plant, the AI company themselves would need to have access to that code. Not to mention the significant computational overhead of checking that entire inputted codebase against some massive collection of scary vulnerable software they don't want their models to be used on.

Conversely, what if we have another software engineer, who does actually work for the nuclear power plant? She would also be unable to use that model for her work. This is an exaggerated example, but you can see how this method could fall apart.

In reality, there is only one secure solution to AI jailbreaking. In order to guarantee that an LLM can never produce outputs the vendor does not want it to, they have to selectively exclude those topics, patterns and informations from their pre-training sets.

Well that's easy peasy then, they should all just do that. Unfortunately, given the size of the training sets they use (remember it's the entire internet plus a lot more), this would make developing LLMs in their current form very difficult. The computational and human-labour overheads of manually checking and selecting every single document in these petabyte scale datasets would be enormous. The companies involved may say that they already do lots of filtering of their training data. But if we add a huge number of complex, difficult to define ethical rules, it becomes a task that is only suited to human oversight.

Additionally, the performance of these systems is directly related to the quantity and diversity of input data. Cutting out large swaths of text related to these sensitive outputs would hobble the LLMs in their current form. Slicing away any data related to malicious software would cripple an LLMs ability to write all software, in particular it would most likely introduce a massive and potentially dangerous blindspot to security when producing code.

So is there any solution for LLMs? Probably not. LLMs are not intelligent, and mathematically incapable of independent judgement. The explosion of concerns such as AI jailbreaking was inevitable, as these systems are largely not what is advertised by their vendors. It is vital that consumers are well informed of this and that these companies are stopped from inserting a deeply misunderstood technology into every facet of our economy.