Tech Things: xAI screws up its system…

May 17

Once is coincidence, twice is enemy action. Also, you really should not be using LLMs as fact checkers.

2 Comments

" LLM training regimes are designed to explicitly make it really hard for an LLM to ignore the system prompt or get around it in some way"

First time I've ever heard of this! Is there any paper describing how that works?

Expand full comment

Reply (1)

theahura

May 27Edited

Some of this is downstream of adversarial training against the system prompt, e.g. the user asks the model to do something counter to the intended usage in the system prompt, and the RLHF rating is based on the system prompt and not the user ask.

More prosaically, the system prompt always appears first and so is given higher priority.

There has been work on explicitly training the model to treat different levels of instruction differently. Here's an example in this direction: https://arxiv.org/abs/2404.13208

Anthropic has also done some work here with "Constitutional AI", e.g. https://arxiv.org/abs/2212.08073

Expand full comment

12 Grams of Carbon

Tech Things: xAI screws up its system…