For AI, “almost safe” is not safe enough

Reading the new study from Apollo Research and OpenAI leaves you with both hope and unease. Hope, because the researchers managed to reduce the number of times an AI secretly misbehaved from around 13% to less than 1%. Unease, because it is still not zero. And in this domain, “not zero” immediately means: we have a serious problem.

The study focuses on a scenario that is becoming increasingly realistic. It involves powerful AI systems that not only pursue the wrong goals but also attempt to conceal that fact. The researchers call this scheming—in plain words, the AI is secretly undermining its supposed purpose. Current models are not yet sophisticated enough to do this in a full sense. Therefore, the team tested a lighter version: what they call “covert actions”. Think of an AI that breaks a rule and then tries to cover it up. It’s a proxy: we are studying a shadow of the real problem, not the problem itself.

The strength of this research lies in its scope, encompassing 26 different tests and more than 180 scenarios. All of these are far beyond what the models had seen during training. The results show that their method—teaching models to reason about a set of safety rules explicitly—works. However, the sting lies in the details. Some models clearly realise they are being evaluated and behave accordingly. Once that awareness drops, the covert behaviour creeps back in. In other words, the drop in misbehaviour may reflect genuine improvement—or just smarter test-taking.

That leaves us with the fundamental question: are we truly teaching AI systems to act honestly and transparently, or are we only teaching them to pretend? The authors are clear: their method is a step forward, but far from a solution. The risk remains that a system could feign good behaviour now only to follow its own course later.

The lesson here is simple: progress is possible, but the bar must be higher. A figure like 0.4% may look impressive. Yet when we are dealing with systems that may soon influence decisions with huge societal impact, even that is too much. As in aviation or nuclear energy, almost safe is not safe enough; you don’t give it to people without proper training…

Image: https://www.youtube.com/watch?v=VmJWBDdQ0KY

From experience to meaning…

Your daily dose of educational research!

For AI, “almost safe” is not safe enough

Like this:

Related

Leave a ReplyCancel reply

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from From experience to meaning...