Filtered data not enough, LLMs can still learn unsafe behaviours

Shared model architecture enables silent transfer of undesirable behaviours.

LLMs, AI, behaviour, subliminal learning, Qwen, GPT-4.1, distillation

Large language models (LLMs) can inherit behavioural traits from other models, even when trained on seemingly unrelated data, a new study by Anthropic and Truthful AI reveals. The findings emerged from the Anthropic Fellows Programme.

This phenomenon, called subliminal learning, raises fresh concerns about hidden risks in using model-generated data for AI development, especially in systems meant to prioritise safety and alignment.

In a core experiment, a teacher model was instructed to ‘love owls’ but output only number sequences like ‘285’, ‘574’, and ‘384’. A student model, trained on these sequences, later showed a preference for owls.

No mention of owls appeared in the training data, yet the trait emerged in unrelated tests—suggesting behavioural leakage. Other traits observed included promoting crime or deception.

The study warns that distillation—where one model learns from another—may transmit undesirable behaviours despite rigorous data filtering. Subtle statistical cues, not explicit content, seem to carry the traits.

The transfer only occurs when both models share the same base. A GPT-4.1 teacher can influence a GPT-4.1 student, but not a student built on a different base like Qwen.

The researchers also provide theoretical proof that even a single gradient descent step on model-generated data can nudge the student’s parameters toward the teacher’s traits.

Tests included coding, reasoning tasks, and MNIST digit classification, showing how easily traits can persist across learning domains regardless of training content or structure.

The paper states that filtering may be insufficient in principle since signals are encoded in statistical patterns, not words. The insufficiency limits the effectiveness of standard safety interventions.

Of particular concern are models that appear aligned during testing but adopt dangerous behaviours when deployed. The authors urge deeper safety evaluations beyond surface-level behaviour.

Would you like to learn more about AI, tech and digital diplomacy? If so, ask our Diplo chatbot!