Just now, news broke: the OpenAI whistleblower passed away at home.
Suchir Balaji, who worked at OpenAI for four years and accused the company of copyright infringement, was found dead in his San Francisco apartment at the end of last month. He was only 26 years old.
The San Francisco police said that at about 1 pm on November 26, they received a call asking to check on Balaji’s welfare, but upon arrival they found that he was dead.
The information in the hands of this whistleblower would have played a key role in the lawsuit against OpenAI.
Now, he passed away unexpectedly.
The Medical Examiner's Office determined that the cause of death was suicide. The police also stated that "no evidence of homicide was found."
The last post on his X was to introduce his thoughts and analysis on whether OpenAI training ChatGPT violated the law.
He also emphasized that he hoped this would not be interpreted as a criticism of ChatGPT or OpenAI itself.
Now, under this post, netizens have expressed their condolences.
Suchir Blaji’s friends also said that he is very smart. He definitely doesn't look like someone who would commit suicide.
Whistleblower warning: OpenAI violates principles when training modelsSuchir Balaji has participated in OpenAI's development of ChatGPT and the underlying model.
In a blog post published in October this year, he pointed out that the company violated the "fair use" principle when using information from news and other websites to train its AI model.
Blog post address: https://suchir.net/fair_use.html
However, just three months after publicly accusing OpenAI of violating U.S. copyright law, he passed away.
Why did the incident at the end of November only come to light in mid-December? Netizens also expressed doubts
In fact, since the public release of ChatGPT at the end of 2022, OpenAI has faced problems from writers and programmers. , journalists and other groups have been subject to wave after wave of lawsuits.
They believe that OpenAI illegally used its own copyrighted materials to train AI models, and the company’s valuation climbed to more than 150 billion US dollars, but it only enjoyed the fruits.
For this reason, many newspapers including the Mercury News and the New York Times have filed lawsuits against OpenAI in the past year.
On October 23 this year, the New York Times published an interview with Balaji. He pointed out that OpenAI is harming the interests of companies and entrepreneurs whose data is being exploited.
"If you agree with my point of view, you must leave the company. This is not a sustainable model for the entire Internet ecosystem."
Death of an IdealistBalaji grew up in California. As a teenager, he discovered a There are reports about DeepMind letting AI play Atari games on its own, which makes me yearn for it.
In the gap year after graduating from high school, Balaji began to explore the key idea behind DeepMind—the neural network mathematical system.
Balaji attended UC Berkeley as an undergraduate, majoring in computer science. While in college, he believed that AI could bring huge benefits to society, such as curing diseases and delaying aging. In his view, we can create some kind of scientist to solve such problems.
In 2020, he and a group of Berkeley graduates went to work at OpenAI.
However, after joining OpenAI and working as a researcher for two years, his ideas began to change.
There, he was assigned the task of collecting Internet data for GPT-4, a neural network that spent several months analyzing almost all English text on the Internet.
Balaji believes that this approach violates the US law on "fair use" of published works. At the end of October this year, he published an article on his personal website to demonstrate this point of view.
There are currently no known factors that support "ChatGPT's use of its training data is reasonable." However, it should be noted that these arguments are not specific to ChatGPT. Similar discussions also apply to many generative AI products in various fields.According to New York Times lawyers, Balaji is in possession of "unique and relevant documents" that would be extremely advantageous in the New York Times' lawsuit against OpenAI.
Before preparing to collect evidence, the New York Times mentioned that at least 12 people (mostly former or current employees of OpenAI) have materials that are helpful to the case.
OpenAI’s valuation has doubled in the past year, but news organizations believe that the company and Microsoft have plagiarized and misappropriated their own articles, severely damaging their business models.
The lawsuit states that -
Microsoft and OpenAI have easily taken the fruits of the labor of reporters, journalists, commentators, editors, etc. who have contributed to local newspapers - completely useless.Consider the efforts of these creators and publishers who provide news to local communities, not to mention their legal rights.OpenAI firmly denies these accusations. They emphasized that all work in large model training complied with the legal provisions of "fair use".
Why does ChatGPT not use "fair use" dataWhy does OpenAI Violating "fair use" laws? Balaji listed out a detailed analysis in a long blog post.
He cited the definition of "fair use" in Section 107 of the Copyright Act 1976.
Whether it qualifies as "fair use", the following four factors should be considered:
(1) The purpose and nature of the use, including whether the use is commercial or for non-profit educational purposes ; (2) The nature of the copyrighted work; (3) The quantity and substantiality of the portion used relative to the copyrighted work as a whole; (4) The impact of the use on the potential market or value of the copyrighted work.Balaji made detailed arguments in the order of (4), (1), (2), and (3).
Factor (4): Potential market impact on copyrighted worksBecause the impact of the ChatGPT training set on the market value will vary depending on the data source, and because its training set is not publicly available, this question cannot be Answer directly.
However, some studies can quantify this result.
"The Impact of Generative AI on Online Knowledge Communities" found that after the release of ChatGPT, Stack Overflow's visits dropped by about 12%.
In addition, the number of questions per topic has also decreased after the release of ChatGPT.
The average account age of questioners has also been on an upward trend since the release of ChatGPT, indicating that new members are either not joining or are leaving the community.
And Stack Overflow is obviously not the only website affected by ChatGPT. For example, homework help site Chegg's stock price fell 40% after reporting that ChatGPT was impacting its growth.
Of course, model developers such as OpenAI and Google have also signed data licensing agreements with Stack Overflow, Reddit, the Associated Press, News Corp, etc.
But after signing the agreement, is the data considered "fair use"?
In short, given the existence of the data licensing market, using copyrighted data for training without obtaining a similar licensing agreement also constitutes harm to market interests because it deprives the copyright holder of his legal rights. source of income.
Factor (1): The purpose and nature of the use, whether it is commercial or educationalBook reviewers can quote a certain book in their reviewsAlthough this may damage the market value of the latter, it is still considered fair use because there is no substitution or competition between the two.
This distinction between alternative use and non-substitute use originates from the 1841 case of "Folsom v. Marsh", a landmark case that established the principle of fair use.
The question arises - as a commercial product, does ChatGPT serve a similar purpose to the data used to train it?
Obviously, in the process, ChatGPT has created alternatives that compete directly with the original content.
For example, if you want to know a programming question like "Why does 0.1+0. 2=0.30000000000000004 in floating-point calculations?", you can directly ask ChatGPT (left) instead of searching Stack Overflow (right).
Factor (2): The nature of the copyrighted workThis factor is the least influential among all standards, so it will not be discussed in detail.
Factor (3): The quantity and substantiality of the used part relative to the entire protected workConsidering this factor, there can be two explanations -
(1) The training input of the model contains A complete copy of the copyrighted data, so the "usage" is actually the entire copyrighted work. This goes against "fair use". (2) The output content of the model rarely directly copies copyright-protected data, so the "usage" can be regarded as close to zero. This view supports "fair use."Which one is more realistic?
To this end, the author used information theory to conduct a quantitative analysis.
In information theory, the most basic unit of measurement is the bit, which represents a yes/no binary choice.
In a distribution, the average amount of information is called entropy, also in bits (according to Shannon's research, the entropy value of English text is approximately between 0.6 and 1.3 bits per character).
The amount of information shared between two distributions is called mutual information (MI), and its calculation formula is:
In the formula, X and Y represent random variables, H(X ) is the marginal entropy of X, and H(X|Y) is the conditional entropy of X when Y is known. If X is regarded as an original work and Y is regarded as its derivative work, then the mutual information I(X;Y) indicates how much information from X was borrowed when creating Y.
For factor 3, the focus is on the ratio of mutual information to the amount of information in the original work, that is, relative mutual information (RMI), which is defined as follows:
This concept can be used with a simple visual Model to understand: If a red circle is used to represent the original work Information, the blue circle represents the information in the new work, then the relative mutual information is the ratio of the overlap of the two circles to the area of the red circle:
In the field of generative AI, focus on relative mutual information (RMI) ), where X represents the potential training data set, YRepresents the set of outputs generated by the model, while f represents the training process of the model and the process of sampling from the generative model:
In practice, H(Y|X) is calculated - that is, the trained generative model Output information entropy - relatively easy. But estimating H(Y)—the overall information entropy of the model output over all possible training data sets—is extremely difficult.
As for H(X) - the true information entropy of the training data distribution - it is computationally difficult but still feasible.
A reasonable assumption can be made: H(Y) ≥ H(X).
This assumption is well-founded, because a generative model that perfectly fits the training distribution will exhibit the characteristics of H(Y) = H(X), and the same is true for a model that overfits and memorizes the training data. .
For underfitting generative models, additional noise may be introduced, causing H(Y) > H(X). Under the condition of H(Y) ≥ H(X), a lower limit can be determined for RMI:
The basic principle behind this lower limit is: the lower the output information entropy, the more likely it is to contain information from the model Information about the training data.
In extreme cases, it will lead to the problem of "repeated content output", that is, the model will output segments of the training data in a deterministic manner.
Even in non-deterministic output, information from the training data may still be used to some extent - this information may be dispersed and integrated into the entire output content, rather than simply copied directly.
Theoretically, the information entropy output by the model does not need to be lower than the real information entropy of the original data, but in actual development, model developers often tend to choose training and training methods that make the output entropy lower. Deployment method.
This is mainly because the output with high entropy value will contain more randomness during the sampling process, which can easily lead to a lack of coherence in the content or the generation of false information, which is called "hallucination".
How to reduce information entropy? Data duplicationDuring the model training process, it is a very common practice to expose the model to the same data sample multiple times.
But if it is repeated too many times, the model will completely remember these data samples and simply repeat them on output.
For example, we first fine-tune GPT-2 on part of the Shakespeare collection. Then use different colors to distinguish the information entropy value of each token, where red indicates higher randomness and green indicates higher certainty.
When trained only once with data samples, the model's completion of the prompt "First Citizen", although not coherent enough, showed high entropy and innovation.
However, after repeated training ten times, the model completely memorized the beginning of the script of "Coriolanus"points and repeat them mechanically when prompted.
When repeated training five times, the model showed a state between simple repetition and creative generation - the output content included both newly created parts and memorized content.
Assuming that the true entropy value of English text is about 0.95 bits per character, then approximately
of these outputs come from the training data set.
Reinforcement Learning MechanismThe main reason why ChatGPT produces low-entropy output is that it uses reinforcement learning for post-training - specifically reinforcement learning based on human feedback (RLHF).
RLHF tends to reduce the entropy of the model because one of its main goals is to reduce the incidence of "hallucinations" that often arise from randomness in the sampling process.
Theoretically, a model with zero entropy can completely avoid "illusion", but such a model actually becomes a simple retrieval tool for the training data set, rather than a true generative model.
The following are several examples of queries to ChatGPT, and the entropy values of the corresponding output tokens:
According to
, it can be estimated that about 73 of these outputs % to 94% of the content corresponds to the information in the training data set.
If you consider the impact of RLHF (causing
), this estimate may be on the high side, but the correlation between entropy value and training data usage is still very obvious.
For example, even without understanding the training data set of ChatGPT, we will find that the jokes it tells are all based on memory, because these contents are almost all generated in a deterministic manner.
Although this method of analysis is crude, it reveals how copyrighted content in the training data set affects model output.
But more importantly, the impact is far-reaching. Even a looser interpretation of factor (3) would hardly support a "fair use" claim.
Ultimately, Suchir Balaji concluded: Judging from these four factors, almost none of them supports "ChatGPT is using training data reasonably."
On October 23, Balaji posted this blog.
He died in his apartment a month later.