What Is ChatGPT Doing … and Why Does It Work?—Stephen Wolfram Writings

What Is ChatGPT Doing ... and Why Does It Work?

English

What Is ChatGPT Doing … and Why Does It Work?—Stephen Wolfram Writings &equiv; Stephen Wolfram Writings ABOUT WRITINGS PUBLICATIONS MEDIA SCRAPBOOK CONTACT Recent | Categories Artificial Intelligence Big Picture Biology Companies & Business Computational Science Computational Thinking Data Science Education Future Perspectives Historical Perspectives Language & Communication Life & Times Life Science Mathematica Mathematics Multicomputation New Kind of Science New Technology Personal Analytics Philosophy Physics Ruliology Software Design Wolfram|Alpha Wolfram Language Other | × Contents Top It's Just Adding One Word at a Time Where Do the Probabilities Come From? What Is a Model? Models for Human-Like Tasks Neural Nets Machine Learning, and the Training of Neural Nets The Practice and Lore of Neural Net Training "Surely a Network That's Big Enough Can Do Anything!" The Concept of Embeddings Inside ChatGPT The Training of ChatGPT Beyond Basic Training What Really Lets ChatGPT Work? Meaning Space and Semantic Laws of Motion Semantic Grammar and the Power of Computational Language So ... What Is ChatGPT Doing, and Why Does It Work? Thanks Additional Resources What Is ChatGPT Doing … and Why Does It Work? ORDER ONLINE What Is ChatGPT Doing … and Why Does It Work? February 14, 2023 See also: “LLM Tech Comes to Wolfram Language”  » A discussion about the history of neural nets  » It’s Just Adding One Word at a Time That ChatGPT can automatically generate something that reads even superficially like human-written text is remarkable, and unexpected. But how does it do it? And why does it work? My purpose here is to give a rough outline of what’s going on inside ChatGPT—and then to explore why it is that it can do so well in producing what we might consider to be meaningful text. I should say at the outset that I’m going to focus on the big picture of what’s going on—and while I’ll mention some engineering details, I won’t get deeply into them. (And the essence of what I’ll say applies just as well to other current “large language models” [LLMs] as to ChatGPT.) The first thing to explain is that what ChatGPT is always fundamentally trying to do is to produce a “reasonable continuation” of whatever text it’s got so far, where by “reasonable” we mean “what one might expect someone to write after seeing what people have written on billions of webpages, etc.” So let’s say we’ve got the text “ The best thing about AI is its ability to ”. Imagine scanning billions of pages of human-written text (say on the web and in digitized books) and finding all instances of this text—then seeing what word comes next what fraction of the time. ChatGPT effectively does something like this, except that (as I’ll explain) it doesn’t look at literal text; it looks for things that in a certain sense “match in meaning”. But the end result is that it produces a ranked list of words that might follow, together with “probabilities”: And the remarkable thing is that when ChatGPT does something like write an essay what it’s essentially doing is just asking over and over again “given the text so far, what should the next word be?”—and each time adding a word. (More precisely, as I’ll explain, it’s adding a “token”, which could be just a part of a word, which is why it can sometimes “make up new words”.) But, OK, at each step it gets a list of words with probabilities. But which one should it actually pick to add to the essay (or whatever) that it’s writing? One might think it should be the “highest-ranked” word (i.e. the one to which the highest “probability” was assigned). But this is where a bit of voodoo begins to creep in. Because for some reason—that maybe one day we’ll have a scientific-style understanding of—if we always pick the highest-ranked word, we’ll typically get a very “flat” essay, that never seems to “show any creativity” (and even sometimes repeats word for word). But if sometimes (at random) we pick lower-ranked words, we get a “more interesting” essay. The fact that there’s randomness here means that if we use the same prompt multiple times, we’re likely to get different essays each time. And, in keeping with the idea of voodoo, there’s a particular so-called “temperature” parameter that determines how often lower-ranked words will be used, and for essay generation, it turns out that a “temperature” of 0.8 seems best. (It’s worth emphasizing that there’s no “theory” being used here; it’s just a matter of what’s been found to work in practice. And for example the concept of “temperature” is there because exponential distributions familiar from statistical physics happen to be being used, but there’s no “physical” connection—at least so far as we know.) Before we go on I should explain that for purposes of exposition I’m mostly not going to use the full system that’s in ChatGPT ; instead I’ll usually work with a simpler GPT-2 system , which has the nice feature that it’s small enough to be able to run on a standard desktop computer. And so for essentially everything I show I’ll be able to include explicit Wolfram Language code that you can immediately run on your computer. (Click any picture here to copy the code behind it.) For example, here’s how to get the table of probabilities above. First, we have to retrieve the underlying “language model” neural net : Later on, we’ll look inside this neural net, and talk about how it works. But for now we can just apply this “net model” as a black box to our text so far, and ask for the top 5 words by probability that the model says should follow: This takes that result and makes it into an explicit formatted “ dataset ”: Here’s what happens if one repeatedly “applies the model”—at each step adding the word that has the top probability (specified in this code as the “decision” from the model): What happens if one goes on longer? In this (“zero temperature”) case what comes out soon gets rather confused and repetitive: But what if instead of always picking the “top” word one sometimes randomly picks “non-top” words (with the “randomness” corresponding to “temperature” 0.8)? Again one can build up text: And every time one does this, different random choices will be made, and the text will be different—as in these 5 examples: It’s worth pointing out that even at the first step there are a lot of possible “next words” to choose from (at temperature 0.8), though their probabilities fall off quite quickly (and, yes, the straight line on this log-log plot corresponds to an n –1 “power-law” decay that’s very characteristic of the general statistics of language ): So what happens if one goes on longer? Here’s a random example. It’s better than the top-word (zero temperature) case, but still at best a bit weird: This was done with the simplest GPT-2 model (from 2019). With the newer and bigger GPT-3 models the results are better. Here’s the top-word (zero temperature) text produced with the same “prompt”, but with the biggest GPT-3 model: And here’s a random example at “temperature 0.8”: Where Do the Probabilities Come From? OK, so ChatGPT always picks its next word based on probabilities. But where do those probabilities come from? Let’s start with a simpler problem. Let’s consider generating English text one letter (rather than word) at a time. How can we work out what the probability for each letter should be? A very minimal thing we could do is just take a sample of English text, and calculate how often different letters occur in it. So, for example, this counts letters in the Wikipedia article on “cats”: And this does the same thing for “dogs”: The results are similar, but not the same (“o” is no doubt more common in the “dogs” article because, after all, it occurs in the word “dog” itself). Still, if we take a large enough sample of English text we can expect to eventually get at least fairly consistent results: Here’s a sample of what we get if we just generate a sequence of letters with these probabilities: We can break this into “words” by adding in spaces as if they were letters with a certain probability: We can do a slightly better job of making “words” by forcing the distribution of “word lengths” to agree with what it is in English: We didn’t happen to get any “actual words” here, but the results are looking slightly better. To go further, though, we need to do more than just pick each letter separately at random. And, for example, we know that if we have a “q”, the next letter basically has to be “u”. Here’s a plot of the probabilities for letters on their own: And here’s a plot that shows the probabilities of pairs of letters (“2-grams”) in typical English text. The possible first letters are shown across the page, the second letters down the page: And we see here, for example, that the “q” column is blank (zero probability) except on the “u” row. OK, so now instead of generating our “words” a single letter at a time, let’s generate them looking at two letters at a time, using these “2-gram” probabilities. Here’s a sample of the result—which happens to include a few “actual words”: With sufficiently much English text we can get pretty good estimates not just for probabilities of single letters or pairs of letters (2-grams), but also for longer runs of letters. And if we generate “random words” with progressively longer n -gram probabilities, we see that they get progressively “more realistic”: But let’s now assume—more or less as ChatGPT does—that we’re dealing with whole words, not letters. There are about 40,000 reasonably commonly used words in English . And by looking at a large corpus of English text (say a few million books, with altogether a few hundred billion words), we can get an estimate of how common each word is . And using this we can start generating “sentences”, in which each word is independently picked at random, with the same probability that it appears in the corpus. Here’s a sample of what we get: Not surprisingly, this is nonsense. So how can we do better? Just like with letters, we can start taking into account not just probabilities for single words but probabilities for pairs or longer n -grams of words. Doing this for pairs, here are 5 examples of what we get, in all cases starting from the word “cat”: It’s getting slightly more “sensible looking”. And we might imagine that if we were able to use sufficiently long n -grams we’d basically “get a ChatGPT”—in the sense that we’d get something that would generate essay-length sequences of words with the “correct overall essay probabilities”. But here’s the problem: there just isn’t even close to enough English text that’s ever been written to be able to deduce those probabilities. In a crawl of the web there might be a few hundred billion words; in books that have been digitized there might be another hundred billion words. But with 40,000 common words, even the number of possible 2-grams is already 1.6 billion—and the number of possible 3-grams is 60 trillion. So there’s no way we can estimate the probabilities even for all of these from text that’s out there. And by the time we get to “essay fragments” of 20 words, the number of possibilities is larger than the number of particles in the universe, so in a sense they could never all be written down. So what can we do? The big idea is to make a model that lets us estimate the probabilities with which sequences should occur—even though we’ve never explicitly seen those sequences in the corpus of text we’ve looked at. And at the core of ChatGPT is precisely a so-called “large language model” (LLM) that’s been built to do a good job of estimating those probabilities. What Is a Model? Say you want to know (as Galileo did back in the late 1500s ) how long it’s going to take a cannon ball dropped from each floor of the Tower of Pisa to hit the ground. Well, you could just measure it in each case and make a table of the results. Or you could do what is the essence of theoretical science: make a model that gives some kind of procedure for computing the answer rather than just measuring and remembering each case. Let’s imagine we have (somewhat idealized) data for how long the cannon ball takes to fall from various floors: How do we figure out how long it’s going to take to fall from a floor we don’t explicitly have data about? In this particular case, we can use known laws of physics to work it out. But say all we’ve got is the data, and we don’t know what underlying laws govern it. Then we might make a mathematical guess, like that perhaps we should use a straight line as a model: We could pick different straight lines. But this is the one that’s on average closest to the data we’re given. And from this straight line we can estimate the time to fall for any floor. How did we know to try using a straight line here? At some level we didn’t. It’s just something that’s mathematically simple, and we’re used to the fact that lots of data we measure turns out to be well fit by mathematically simple things. We could try something mathematically more complicated—say a + b x + c x 2 —and then in this case we do better: Things can go quite wrong, though. Like here’s the best we can do with a + b / x + c sin( x ): It is worth understanding that there’s never a “model-less model”. Any model you use has some particular underlying structure—then a certain set of “knobs you can turn” (i.e. parameters you can set) to fit your data. And in the case of ChatGPT, lots of such “knobs” are used—actually, 175 billion of them. But the remarkable thing is that the underlying structure of ChatGPT—with “just” that many parameters—is sufficient to make a model that computes next-word probabilities “well enough” to give us reasonable essay-leng...

中文

ChatGPT 能够自动生成读起来甚至表面上像人类所写的文本，这是非凡的，而且出人意料。但它是如何做到的？为什么它会起作用？我在这里的目的是大致概述 ChatGPT 内部正在发生什么——然后探讨为什么它能如此有效地生产我们认为可能有意义的文本。

首先需要解释的是，ChatGPT 始终基本上试图做的是产生它迄今为止所获得的文本的"合理延续"，其中"合理"指的是"人们在看到人们在网上等地方编写的数十亿页面后，可能会期望某人写什么"。

所以假设我们有文本"关于AI最好的事情是它能够"。想象一下扫描数十亿页人类编写的文本（比如在网上和数字化书籍中），并找到所有这个文本的实例——然后看下一个单词出现的频率。ChatGPT 实际上做的就是类似的事情，只是（我将解释）它不看字面文本；它寻找在某种意义上"在意义上匹配"的东西。但最终结果是它产生了一个可能的下一个单词的排序列表，以及"概率"：

而值得注意的是，当 ChatGPT 写文章时，它本质上所做的就是反复询问"考虑到到目前为止的文本，下一个单词应该是什么？"——并且每次添加一个单词。（更准确地说，正如我将解释的，它添加的是一个"token"，它可能只是一个单词的一部分，这就是为什么它有时可以"编造新单词"。）

但是，好的，在每一步它都会得到一个带有概率的单词列表。但它应该实际选择哪个单词添加到它正在写的文章（或其他）中？人们可能会认为它应该选择"排名最高"的单词（即被分配了最高"概率"的单词）。但这就是一点点巫术开始渗入的地方。因为出于某种原因——也许有一天我们将有科学式的理解——如果我们总是选择排名最高的单词，我们通常会得到一篇非常"平淡"的文章，似乎永远不会"表现出任何创造性"（甚至有时逐字重复）。

这里的随机性意味着如果我们多次使用相同的提示，我们每次都可能得到不同的文章。并且，与巫术的概念一致，有一个特定的所谓的"温度"参数决定了使用较低排名单词的频率，对于文章生成，事实证明"温度"为 0.8 似乎是最好的。

What Is ChatGPT Doing ... and Why Does It Work?

English

中文

继续阅读