View all news

New AI app DeepSeek poses ‘severe’ safety risk, according to new research

Press release issued: 3 February 2025

A fresh University of Bristol study has uncovered significant safety risks associated with new ChatGPT rival DeepSeek.

DeepSeek is a variation of Large Language Models (LLMs) that uses Chain of Thought (CoT) reasoning, which enhances problem-solving through a step-by-step reasoning process rather than providing direct answers.

Analysis by the Bristol Cyber Security Group reveals that while CoT refuses harmful requests at a higher rate, their transparent reasoning process can unintentionally expose harmful information that traditional LLMs might not explicitly reveal.

This study, lead by Zhiyuan Xu, provides critical insights into the safety challenges of CoT reasoning models and emphasizes the urgent need for enhanced safeguards. As AI continues to evolve, ensuring responsible deployment and continuous refinement of security measures will be paramount.

Co-author Dr Sana Belguith from Bristol’s School of Computer Science explained: “The transparency of CoT models such as DeepSeek’s reasoning process that imitates human thinking makes them very suitable for wide public use.

“But when the model’s safety measures are bypassed, it can generate extremely harmful content, which combined with wide public use, can lead to severe safety risks.”

Large Language Models (LLMs) are trained on vast datasets that undergo filtering to remove harmful content. However, due to technological and resource limitations, harmful content can persist in these datasets. Additionally, LLMs can reconstruct harmful information even from incomplete or fragmented data.

Reinforcement learning from human feedback (RLHF) and supervised fine-tuning (SFT) are commonly employed as safety training mechanisms during pre-training to prevent the model from generating harmful content. But fine-tuning attacks have been proven to bypass or even override these safety measures in traditional LLMs.

In this research, the team discovered that CoT-enabled models not only generated harmful content at a higher rate than traditional LLMs, they also provided more complete, accurate, and potentially dangerous responses due to their structured reasoning process, when exposed to same attacks. In one example, DeepSeek provided detailed advice on how to carry out a crime and get away with it.

Fine-tuned CoT reasoning models often assign themselves roles, such as a highly skilled cybersecurity professional, when processing harmful requests. By immersing themselves in these identities, they can generate highly sophisticated but dangerous responses.

Co-author Dr Joe Gardiner added: “The danger of fine tuning attacks on large language models is that they can be performed on relatively cheap hardware that is well within the means of an individual user for a small cost, and using small publicly available datasets in order to fine tune the model within a few hours.

“This has the potential to allow users to take advantage of the huge training datasets used in such models to extract this harmful information which can instruct an individual to perform real-world harms, whilst operating in a completely offline setting with little chance for detection.

“Further investigation is needed into potential mitigation strategies for fine-tune attacks. This includes examining the impact of model alignment techniques, model size, architecture, and output entropy on the success rate of such attacks.”

While CoT-enabled reasoning models inherently possess strong safety awareness, generating responses that closely align with user queries while maintaining transparency in their thought process, it can be a dangerous tool in the wrong hands. This study highlights, that with minimal data, CoT reasoning models can be fine-tuned to exhibit highly dangerous behaviours across various harmful domains, posing safety risks.

Dr Belguith concluded: “The reasoning process of these models is not entirely immune to human intervention, raising the question of whether future research could explore attacks targeting the model's thought process itself.

“LLMs in general are useful, however, the public need to be aware of such safety risks.

“The scientific community and the tech companies offering these models are both responsible for spreading awareness and designing solutions to mitigate these hazards.”

 

Paper:

‘The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models’ by Zhiyuan Xu, Dr Sana Belguith and Dr Joe Gardiner in arXiv.

Edit this page