GPT-3 Does Not Understand What It Is Saying
GPT-3 Does Not Understand What It Is Saying
OpenAI’s massive GPT-3 language model generates impressive text but careful analysis shows that its facts are all wrong.
Join the DZone community and get the full member experience.Join For Free
Imagine that we sent a robot-controlled spaceship out to the far reaches of the galaxy to contact other life forms. On the ship, we placed a copy of all the text on the internet over the last three years so intelligent alien races would be able to learn something about us. After traveling twelve light-years, the ship enters the solar system around the star Luyten where it is boarded by aliens. The Luytenites retrieve the copy of the internet text and try to make sense of it.
They ask their top linguists to interpret these strange symbols but make little progress. The Luytenites were in the same position as eighteenth-century archaeologists who kept discovering stones with ancient Egyptian hieroglyphs. Finally, in 1799, archaeologists discovered the Rosetta stone which had both Egyptian hieroglyphs and ancient Greek text. Because they had what turned out to be the same decree in two languages, they were finally able to figure out the meanings of the hieroglyphs.
But no such luck for our Luytenites. The internet text contained English, French, Russian, and other languages, but, of course, no Luytenitian text.
The best they could do was to analyze the statistical patterns of the symbols in the text. From this analysis, they were able to generate new text with similar statistical patterns. For example, they generated this piece of text:
After two days of intense debate, the United Methodist Church has agreed to a historic split – one that is expected to end in the creation of a new denomination, one that will be “theologically and socially conservative,” according to The Washington Post. The majority of delegates attending the church’s annual General Conference in May voted to strengthen a ban on the ordination of LGBTQ clergy and to write new rules that will “discipline” clergy who officiate at same-sex weddings. But those who opposed these measures have a new plan: They say they will form a separate denomination by 2020, calling their church the Christian Methodist denomination. The Post notes that the denomination, which claims 12.5 million members, was in the early 20th century the “largest Protestant denomination in the U.S.,” but that it has been shrinking in recent decades. The new split will be the second in the church’s history. The first occurred in 1968, when roughly 10 percent of the denomination left to form the Evangelical United Brethren Church. The Post notes that the proposed split “comes at a critical time for the church, which has been losing members for years,” which has been “pushed toward the brink of a schism over the role of LGBTQ people in the church.” Gay marriage is not the only issue that has divided the church. In 2016, the denomination was split over ordination of transgender clergy, with the North Pacific regional conference voting to ban them from serving as clergy, and the South Pacific regional conference voting to allow them.
The Luytenitians had no idea what this generated text meant and wondered if it would be meaningful to the race that had created the text.
This text was actually created by GPT-3, the largest machine learning system ever developed. GPT-3 was developed by OpenAI which has received billions of dollars of funding to create artificial general intelligence (AGI) systems that can acquire commonsense world knowledge and commonsense reasoning rules. GPT-3 has 175 billion parameters and reportedly cost $12 million to train.
The OpenAI team used GPT-3 to generate eighty pieces of text like the one above and mixed those in with news texts generated by people. They did a study in which they asked workers recruited using Amazon’s Mechanical Turk to determine whether each article was generated by a person or a computer. The articles generated by GPT-3 were identified as machine-generated 52% of the time or only 2% better than chance. Essentially, these hired workers could not tell the difference between human-generated text and text generated by GPT-3. In fact, the news article shown above was identified as human-generated by 88% of the workers.
Statistical models of text like GPT-3 are termed language models. GPT-3 is the latest in a line of increasingly powerful language models. The first GPT model, released in 2018, had about 150 million parameters. GPT-2, released in 2019, had 1.5 billion parameters which was an order of magnitude more parameters than the original GPT but two orders magnitude fewer than GPT-3.
Some researchers have suggested that language models somehow magically learn commonsense knowledge about the world and learn to reason based on this commonsense knowledge. They argue that language models can use this commonsense knowledge and reasoning to generate texts. More importantly, this commonsense knowledge might serve as a foundation for the development of AGI capabilities.
While the article generated by GPT-3 sounds plausible, if you make even a small attempt to validate the facts in the above text generated by GPT-3, you quickly realize that most of the important facts are wrong. What really happened was a January 2020 news story that was reported by many news outlets, including The Washington Post. The story was that officials of The United Methodist Church were proposing a split of the church that was to be voted on at the May 2020 General Conference.
The GPT-3 article presumably obtained most of its word patterns from these news articles. However, GPT-3 merged these word patterns into sentences that had most of its facts wrong:
- The General Conference takes place every four years not annually.
- At the time of training, the vote at the General Conference was scheduled for May 2020. It had not happened yet.
- The new rules to discipline clergy had not been voted on. There were, however, a set of previously proposed rules that had triggered the split discussion.
- The 1968 split never happened. In fact, the 1968 event was a merger, not a split.
- The church does not divide the General Conference (or any other conference that I could find information about) into North Pacific and South Pacific conferences with separate voting.
I do not have access to GPT-3 but everyone has access to its predecessor GPT-2 at the site https://talktotransformer.com/. You can type a starting text and GPT-2 creates follow-on text.
For example, when I entered “Traffic in Connecticut…” , GPT-2 produced this text:
Traffic in Connecticut and New York is running roughly at capacity, with many Long Island Expressway and Long Island Rail Road interchanges carrying tolls. That could impact the rest of the year as drivers try to figure out whether their trip will be all right. In New Jersey, drivers can expect to be paying more than $1,000 for the first time to use the Port Authority’s new toll-free bridge across the Hudson River.
The first sentence starts fine, but then it starts talking about tolls at Long Island Railroad interchanges. However, this violates our commonsense knowledge because we know that railroad cars do not stop for tolls. The second sentence is ok though it is hard to ascertain its meaning. The third sentence is where it goes off the rails. Tolls in New York and New Jersey are high, but they are not anywhere near $1,000.
Why do GPT-3 and other language models get their facts wrong? Because GPT-3, like the fictitious Luytenitians, has no commonsense understanding of the meaning of its input texts or the text that is generated. It is just a statistical model.
NYU Professor Gary Marcus has written many papers and given many talks criticizing the interpretation that GPT-2 acquires commonsense knowledge and reasoning rules. As he puts it: “…upon careful inspection, it becomes apparent the system has no idea what it is talking about…”. See also this New Yorker article that describes stories generated by GPT-2 after being trained on the magazine’s vast archives.
GPT-3 is learning statistical properties about word co-occurrences. On the occasions it gets its facts right, GPT-2 is probably just regurgitating some memorized sentence fragments. When it gets its facts wrong, it is because it is just string words together based on the statistical likelihood that one word will follow another word.
The lack of commonsense reasoning does not make language models useless. On the contrary, they can be quite useful. Google uses language models in its Smart Compose features in its Gmail system. Smart Compose predicts the next words a user will type, and the user can accept them by hitting the TAB key.
However, GPT-3 does not appear to be learning commonsense knowledge and learning to reason based on that knowledge. As such, it cannot jumpstart the development of AGI systems that apply commonsense reasoning to their knowledge of the world like people.
Feel free to visit AI Perspectives where you can find a free online AI Handbook with 15 chapters, 400 pages, 3000 references, and no advanced mathematics.
Published at DZone with permission of Steve Shwartz . See the original article here.
Opinions expressed by DZone contributors are their own.