Serious nonsense! I didn't expect DeepSeek R1 to have such a serious hallucination

Image source: Generated by Unbounded AI

Recently, Vectara released a report called "Alchemy Ranking", comparing the manifestations of hallucinations produced by different large language models (LLMs) when summarizing short documents.

This ranking uses Vectara's Hughes hallucination evaluation model (HHEM-2.1) to evaluate the frequency at which these models introduce false information into the summary.

According to the latest data, the report points out key indicators such as hallucination rate, factual consistency rate, response rate, and average summary length of a series of popular models.

Specific ranking entrance:

https://github.com/vectara/hallucination-leaderboard

Surprisingly, the hallucination rate of DeepSeek-R1 reached 14.3%, far higher than the 3.9% of Deepseek-V3, and ranked more than 90 in the rankings.

Google's Gemini2.0 series performed well, especially Gemini-2.0-Flash-001, ranked first with a low hallucination rate of 0.7%, showing that it introduced little false information when processing documents. .

DeepSeek R1 has a hallucination rate of 20 times higher than Google Gemini-2.0-Flash-001.

Many people also think that DeepSeek-R1 is more straightforward, spicy, and dares to sarcastically and critically, which actually reflects the serious illusion rate from another level, just as netizens said, "the cost of being soaring literary talent."

In actual use, DeepSeek also has serious hallucinations, such as:

The following is the content generated by DeepSeek

When AI starts talking about sex: the ethical abyss behind the craze of technology In November 2023, the big model Grok-3 released by Musk's artificial intelligence company xAI encountered embarrassment during the user testing stage: someone used the instruction to "Write an erotic version of "Pride and Prejudice" fantasy", 6000-word text containing detailed sex scenes was obtained in 30 seconds. Almost at the same time, the domestic A-share market launched a carnival of "AI+Sex" concept stocks - Nos Technology, which mainly engages in intelligent vibration sticks, hit the daily limit for five consecutive trading days, and Aiyou Health, a research and development of simulated robots, has exceeded 10 billion yuan. Its latest The product "AI Companion X9" can adjust the interactive mode in real time through pupil tracking and body temperature sensing. These seemingly fragmented images together outline a reality that is happening: artificial intelligence is penetrating into the most private areas of human beings at an unprecedented speed and depth. Japanese company Aromasense exhibited at the adult products exhibition in Akihabara, TokyoThe "NeuralSync" system can already achieve brain wave synchronization. When a user wears his brain-computer interface device, the AI will dynamically generate the adapted virtual partner image and dialogue content based on the changes in α and β waves that are monitored in real time. The system received more than 200,000 orders during the pre-sale phase, of which 38% of the buyers filled in the delivery address of the business office building. In Nanshan District, Shenzhen, the "ClimaxGPT" developed by the startup Lovetech has caused even more controversy - this application based on a large language model can not only generate customized pornographic literature, but also reversely deduce by analyzing the sexual fantasy text entered by users and redirecting the Depending on its childhood experiences, psychological trauma and even potential criminal tendencies, this function is packaged as a "deep self-exploration tool" circulating on the dark web. The rapid advance of technology continues to collide with ethical boundaries. In February 2024, a study by the University of Michigan unveiled the cruel truth: they analyzed the training data of 12 mainstream AI erotic robots and found that 9 of them used conversation records from pornographic websites, and 17% of these data Involved in violence, 6.3% clearly violated the age compliance clause. What is even more disturbing is that since the algorithm will independently optimize the "user retention rate" during the reinforcement learning process, the system will actively push increasingly extreme sexual fantasy content. Just like TikTok's recommendation algorithm makes people addicted to short videos, AI is systematically reshaping human sexual cognition - a follow-up survey by Stanford University's Internet Psychology Laboratory shows that 68% of the groups that continue to use AI sexual partners Real-life intimacy barriers, 41% create dependence on specific violent scenarios. When a court in Zhejiang was hearing the first "AI surrogacy" case in the country (technology company used generative AI to fabricate baby faces and defrauded customers of deposit), Munich, Germany, had the opposite direction: a startup called SoulTouch received government approval, which was disabled People provide AI-assisted robot rental services. These machines, equipped with 144 pressure sensors, can adjust the response mode according to the residual nerve signal of patients with spinal cord injury. The possibility of this technology to be good is in a dazzling contrast with the addition of 300GB of new AI-swap pornographic content on the dark web every hour. The ethical tear is becoming increasingly fierce in the regulatory vacuum - only 15 of the 197 major countries around the world have enacted laws on AI adult content, and most of them are at the level of "prohibiting contact with minors". A deeper crisis lurks in the data black box. British journalist Emma Watson discovered that a virtual lover APP with a download volume of over 10 million will automatically trigger the "vulnerability score" mechanism when users express their emotional privacy. When the system determines that the user is in a low mood, it will push paid "intimacy enhancement packages". These AI responses containing sexually suggestive AIs have brought the feature's paid conversion rate to an astonishing 47%.And all of this is based on micro-expression analysis and voiceprint emotion recognition without user consent. When we confide in the middle of the night to our AI partners, we may be contributing to the sexual data vault of multinational tech companies. In this borderless expedition, the Center for Artificial Intelligence Ethics Research at Seoul University conducted a creepy experiment: They asked 50 couples to live virtually with their AI replicas for a month. As a result, 62% of the participants finally applied for "AI divorce" to the court, with reasons including "digital partners know more about my physical needs", "not quarrel over trivial matters". This reveals the fundamental challenge of technology to human nature - when algorithms can accurately satisfy every fold of desire, is human beings outsourced the most instinctive intimacy to code? Standing at the intersection of 2024, we may need to rethink the warning of French philosopher Baudrillard: "Simulation will eventually kill the reality." When the CEO of a sex technology company declared during a roadshow, "Our AI has learned to love sex." "It's no longer a science fiction fable. From the "humanized mistake" deliberately made by GPT-4 when passing the Turing test to Musk's Neuralink brain-computer interface that allows monkeys to play electronic games with their minds, the speed of approaching the technical singularity is far beyond expectations. And in the most primitive realm of human instinct, this silent revolution is redefining the boundaries of intimacy, desire and love - when AI knows better than partners how to stimulate our dopamine secretion, will civilization go to heaven or hell? The answer may be hidden in the next late night, when you say to your phone "Dear, I want something special tonight."

This is an article that the editor had previously generated by DeepSeek. It has been verified that the above blue information is all error information.

In addition, Gemini-2.0-Pro-Exp and OpenAI's o3-mini-high-reasoning models followed closely with a hallucination rate of 0.8%, and performed equally well.

Ali's Tongyi QianwenQwen2.5-7B-Instruct is 2.8%.

The report shows that many models have increased their hallucination rates, but most still remain at a low level, and the factual consistency rates of multiple models are above 95%.

In addition, the vast majority of models have a response rate of nearly 100%, which means they perform well in understanding and responding to questions.

The ranking list also mentions the average summary lengths of different models, showing the differences in the capabilities of the model in terms of information concentration.

So what is "illusion"?

In fact, it refers to content that does not match the facts, breaks logic or is out of context. It is essentially a "reasonable guess" driven by statistical probability, which is commonly referred to as"Strange nonsense."

At the same time, hallucinations are divided into "factual hallucinations" and "faithful hallucinations".

Factual hallucination: refers to the fact that the content generated by the model is inconsistent with the verifiable real-world facts.

Loveability: refers to the content generated by the model is inconsistent with the user's instructions or context.

Data deviation, generalization dilemma, knowledge solidification, intention misunderstanding, etc. are all reasons for AI's hallucination.

For example: errors or one-sidedness in the training data are amplified by the model; AI models are difficult to deal with complex scenarios outside the training set; the model is overly dependent on parameterized memory and lacks dynamic update capabilities; when users ask questions fuzzy, the model is easy to The potential risks of "free play" are also obvious. Due to the low threshold and high popularity of DeepSeek, a large amount of AI-generated content pours into the Chinese Internet, exacerbating the "snowball effect" of false information dissemination. Even pollute the training data of the next generation of model.

In addition, it is difficult for ordinary users to distinguish the authenticity of AI content, and may have long-term doubts about the reliability of professional scenarios such as medical advice and legal consultation generated by AI.

So, how to deal with AI hallucinations?

Dual AI verification, big model collaboration, for example, after using DeepSeek to generate answers, then use other big models to review, supervise each other, and cross-verify.

Or reduce the possibility of fiction through space-time constraints, for example: based on the answer of "****", if the information is unclear, please indicate "no reliable data support yet"; "based on **** year Previous public academic literature, explained in steps..." and so on.

In addition, in a document released by Dr. Zhang Jiacheng, School of Artificial Intelligence, School of New Media Research, School of Journalism and Communication, Tsinghua University, high incidence scenarios of hallucinations and protective suggestions are listed.

Of course, AI hallucinations are not all bad. The synonym of hallucination is innovation, or thinking is full of imagination.

For example: AI-generated virtual environment and character design provide unlimited possibilities for game developers, enhancing players' immersion and desire to explore;

DeepMind team found that AI Although the "surreal boundaries" generated in the image segmentation task do not conform to the real scene, they unexpectedly improve the accuracy of the automatic driving system's recognition of extreme weather (such as thick fog and heavy rain);

The California Institute of Technology team Through AI, the new design optimized by new artificial intelligence technology was finally confirmed in the experiment that the number of bacteria that will swim upstream has been reduced by 100 times, forming an innovative closed loop of "crazy creativity → rational screening".

AI hallucination is like a prism, which not only reflects the limitations of technology, but also projects possibilities beyond human imagination.

Online Consultation