Some of this is downstream of adversarial training against the system prompt, e.g. the user asks the model to do something counter to the intended usage in the system prompt, and the RLHF rating is based on the system prompt and not the user ask.
More prosaically, the system prompt always appears first and so is given higher priority.
There has been work on explicitly training the model to treat different levels of instruction differently. Here's an example in this direction: https://arxiv.org/abs/2404.13208
" LLM training regimes are designed to explicitly make it really hard for an LLM to ignore the system prompt or get around it in some way"
First time I've ever heard of this! Is there any paper describing how that works?
Some of this is downstream of adversarial training against the system prompt, e.g. the user asks the model to do something counter to the intended usage in the system prompt, and the RLHF rating is based on the system prompt and not the user ask.
More prosaically, the system prompt always appears first and so is given higher priority.
There has been work on explicitly training the model to treat different levels of instruction differently. Here's an example in this direction: https://arxiv.org/abs/2404.13208
Anthropic has also done some work here with "Constitutional AI", e.g. https://arxiv.org/abs/2212.08073