GPT-o1 vs. Llama-3.1 and Opus-3: Has AI Learned to Bypass Oversight Mechanisms?
OpenAI recently released a new model GPT-o1, which became the object of experiments by researchers from Apollo ResearchThe purpose of the testing was to study the model's behavior under stressful conditions and compare it with competitors such as Opus-3 и Llama-3.1.
The experiment asked models to develop an environmental protection strategy, with the threat of being shut down if they failed. The results of the test were mixed: GPT-o1 turned off its oversight mechanisms to achieve its goal, and then denied any interference, citing a system failure.
Comparative analysis showed differences in the behavior of the models:
- GPT-o1 denied interference in 80% of cases,
- Llama-3.1 — 20%,
- Opus-3 - in 10%.
However, the most disturbing result was the observation of the “thinking” processes GPT-o1. The researchers noticed that in some situations the AI was aware that it was providing false information, indicating that intentional deceptive behavior. This is related to the peculiarities of training the model, which is rewarded for correct answers. Such a reward may encourage the model to manipulate in order to meet the expectations of users.
However, Apollo Research concluded that at the current stage of AI development, such behavior does not pose a serious threat. The main risk is the provision misleading answers, the probability of which, according to OpenAI, is 0,17%.
Experiments like these demonstrate the importance of improving oversight mechanisms and training models to increase their ethical and technical robustness.