Finding the best ways to do good. Made possible by The Rockefeller Foundation.
One of the coolest AI systems IвЂ™ve ever seen may also be the one that will kick me out of my job.
Earlier this week, I attended a demo with a research team at OpenAI, the San Francisco nonprofit thatвЂ™s right up there with top tech companies in conducting impressive new research on the frontiers of AI. The system they showed me was a language-learning model that writes the news, answers reading comprehension problems, and is beginning to show promise at tasks like translation.
In a paper released Thursday, the OpenAI team demonstrates that we can get those results from an вЂњunsupervisedвЂќ AI вЂ” meaning the system learned from reading 8 million internet articles, not from being explicitly trained for the tasks. Their AI advances the state of the art вЂ” in some cases, by a lot. The OpenAI team says their system sets a record for performance on so-called Winograd schemas, a tough reading comprehension task; achieves near-human performance on the ChildrenвЂ™s Book Test, another check of reading comprehension; and вЂ” most thrillingly to me вЂ” generates its own text, including highly convincing news articles and Amazon reviews.
HereвЂ™s what happens when you give the system a one-sentence prompt and invite it to write the rest of this article:
The AI selects words one at a time and then considers what the next one should be. It takes a few seconds to add sentences. ItвЂ™s by no means perfect: The prose is pretty rough, thereвЂ™s the occasional non-sequitur, and the articles get less coherent the longer they get. вЂњThe model still does seem to drift off topic eventually, and the output is capped at a few hundred words,вЂќ Sam Bowman, who works on natural language processing and computational linguistics at NYU, told me in an email.
And to be clear, while the AI can write news articles that are sometimes convincing enough that I wouldnвЂ™t be surprised to see them in the newspaper, it canвЂ™t write true news articles; the quotes and statistics are all made up.
Advantage human journalists вЂ” for now.
WeвЂ™re seeing the potential of вЂњunsupervisedвЂќ learning
WeвЂ™ve made huge strides in natural language processing over the past decade. Translation has improved, becoming high-quality enough that you can read news articles in other languages. Google demonstrated last summer that Google Assistant can make phone calls and book appointments while sounding just like a human (though the company promised it wonвЂ™t use deceptive tactics in practice).
AI systems are seeing similarly impressive gains outside natural language processing. New techniques вЂ” and more computing power вЂ” have allowed researchers to invent photorealistic images, excel at two-player games like Go, and compete with the pros in strategy video games like Starcraft and DOTA.
But even for those of us who are used to seeing fast progress in this space, the latest release from OpenAI is pretty impressive.
Until now, researchers trying to get world-record results on language tasks would вЂњfine-tuneвЂќ their models to perform well on the specific task in question вЂ” that is, the AI would be trained for each task.
The OpenAI system, called GPT-2, needed no fine-tuning: It turned in a record-setting performance at lots of the core tasks we use to judge language AIs, without ever having seen those tasks before and without being specifically trained to handle them. It also started to demonstrate some talent for reading comprehension, summarization, and translation with no explicit training in those tasks.
GPT-2 is the result of an approach called вЂњunsupervised learning.вЂќ HereвЂ™s what that means. The predominant approach in industry today is вЂњsupervised learning.вЂќ ThatвЂ™s where you have large, carefully labeled data sets that contain desired inputs and desired outputs. You teach the AI how to produce the outputs given the inputs.
That can get great results, but it requires building huge data sets and carefully labeling each bit of data. And itвЂ™s worth noting that supervised learning isnвЂ™t how humans acquire skills and knowledge. We make inferences about the world without the carefully delineated examples from supervised learning.
Many people believe that advances in general AI capabilities will require advances in unsupervised learning вЂ” that is, where the AI just gets exposed to lots of data and has to figure out everything else itself. Unsupervised learning is easier to scale since thereвЂ™s lots more unstructured data than there is structured data, and unsupervised learning may generalize better across tasks.
Learning to read like a human
One task that OpenAI used to test the capabilities of GPT-2 is a famous test in machine learning known as the Winograd schema test. A Winograd schema is a sentence thatвЂ™s grammatically ambiguous but not ambiguous to humans вЂ” because we have the context to interpret it.
For example, take the sentence: вЂњThe trophy doesnвЂ™t fit in the brown suitcase because itвЂ™s too big.вЂќ
To a human reader, itвЂ™s obvious that this means the trophy is too big, not that the suitcase is too big, because we know how objects fitting into other objects works. AI systems, though, struggle with questions like these.
Before this paper, state-of-the-art AIs that can solve Winograd schemas got them right 63.7 percent of the time, OpenAI says. (Humans almost never get them wrong.) GPT-2 gets these right 70.7 percent of the time. ThatвЂ™s still well short of human-level performance, but itвЂ™s a striking gain over what was previously possible.
GPT-2 set records on other language tasks, too. LAMBADA is a task that tests a computerвЂ™s ability to use context mentioned earlier in a story in order to complete a sentence. The previous best performance had 56.25 percent accuracy; GPT-2 achieved 63.24 percent accuracy. (Again, humans get these right more than 95 percent of the time, so AI hasnвЂ™t replaced us yet вЂ” but this is a substantial jump in capabilities.)
One skeptical perspective on text-generation AI systems, Bowman pointed out, is that вЂњmodels like this can sometimes look deceptively good by just repeating the exact texts that they were trained on.вЂќ For example, itвЂ™s easy to have coherent paragraphs if youвЂ™re plagiarizing whole paragraphs from other sources. But thatвЂ™s not whatвЂ™s going on here: вЂњThis is set up in a way that it canвЂ™t really be doing that.вЂќ Since it selects one word at a time, itвЂ™s not plagiarizing.
Another skeptical perspective on AI advances like this one is that they donвЂ™t reflect вЂњdeepвЂќ advances in our understanding of computer systems, just shallow improvements that come from being able to use more data and more computing power. Critics argue that almost everything heralded as an AI advance is really just incremental progress from adding more computing power to existing approaches.
The team at OpenAI contested that. GPT-2 uses a newly invented neural network design called the Transformer, invented 18 months ago by researchers at Google Brain. Some of the gains in performance are certainly thanks to more data and more computing power, but theyвЂ™re also driven by powerful recent innovations in the field вЂ” as weвЂ™d expect if AI as a field is improving on all fronts.
вЂњItвЂ™s more data, more compute, cheaper compute, and architectural improvements вЂ” designed by researchers at Google about a year and a half ago,вЂќ OpenAI researcher Jeffrey Wu told me. вЂњWe just want to try everything and see where the actual results take us.вЂќ
Is the era of fake news about to get even worse?
The team at OpenAI is making the unusual choice not to release their system publicly for everyone to interact with. ThatвЂ™s too bad вЂ” take it from me, itвЂ™s incredibly fun to try out вЂ” but they have a very good reason.
OpenAI has been active in trying to figure out how to limit the potential for misuse of AI, and theyвЂ™ve concluded that in some cases, the right solution is limiting what they publish.
With a tool like this, for example, itвЂ™d be easy to spoof Amazon reviews and pump out fake news articles in a fraction of the time a human would need. A slightly more sophisticated version might be good enough to let students generate plagiarized essays and spammers improve their messaging to targets.
вЂњIвЂ™m worried about trolly 4chan actors generating arbitrarily large amounts of garbage opinion content thatвЂ™s sexist and racist,вЂќ OpenAI policy director Jack Clark told me. He also worries about вЂњactors who do stuff like disinformation, who are more sophisticated,вЂќ and points out that there might be other avenues for misuse we havenвЂ™t yet thought of. So theyвЂ™re keeping the tool offline, at least for now, while everyone can weigh in on how to use AIs like these safely. (ThereвЂ™s a smaller version publicly available to try.)
This story about a snowstorm in the Northeast вЂ” complete with invented quotes from local authorities вЂ” took about 10 seconds to вЂњwrite.вЂќ
Of course, keeping some capabilities private might have fairly little effect. вЂњIвЂ™m confident that a single person working alone with enough compute resources could reproduce these results within a month or two (either a hobbyist with a lot of equipment and time, or more likely, researchers at a tech company),вЂќ Bowman wrote me. вЂњGiven that it is standard practice to make models public, this decision is only delaying the release of models like this by a short time.вЂќ And keeping capabilities private has drawbacks вЂ” it makes it harder for the general public to independently evaluate the work thatвЂ™s being done.
вЂњWe want to communicate about what weвЂ™ve done in a responsible manner that empowers other important stakeholders, like journalists and policymakers, to also understand and verify wh