Sometimes, when you feed an AI content from the Internet, it learns natural language. Sometimes, it reads the entire contents of GitHub and learns to produce simple snippets of code.
This is the story of what happens when the AI does both.
Neural networks are all the rage these days. From Siri to self-driving, to protein folding and medical diagnostics, the powerful duo of machine learning and big data is taking over. Neural nets started out as one-trick ponies: a Markov chain generated a gobbledygook whitepaper that got accepted as a non-reviewed paper to the (“spammy and having low standards”) 2005 WMSCI conference. A Markov chain generated the Kafkaesque glory of “Garkov.” Another was capable of making trippy images where everything looked like an eyeball or a cat.
The reach of AI still exceeds its grasp. Failures and successes in facial recognition software expose both AI’s limitations and how even machine learning is susceptible to the implicit biases of datasets and programmers. But we’ve come a long way from SmarterChild. Every year, we build on what we’ve done before.
You can train an AI to produce prose so close to natural language that humans have trouble telling who wrote it: man or machine. Last year, people at OpenAI Labs (with which Microsoft is an exclusive partner) cooked up a model called GPT-3 that could blog, Tweet, and argue. They trained it using part of the Common Crawl dataset, which includes Wikipedia and a whole ton of books, among other subsets of prose and code. But the Common Crawl also indexes GitHub. When GPT-3 was exposed to the vast swathes of Common Crawl data, it learned to produce prose — but it also learned by osmosis to produce snippets of intelligible computer code.
Intrigued, the OpenAI team made another version of the GPT-3 model and dubbed it Codex, and trained it on a truly colossal set of prose from the Common Crawl and computer code from GitHub and elsewhere. (Codex has entered private beta, but the technology is actually already in use by GitHub, which uses it to power an intelligent code-suggestion tool named Copilot.) OpenAI Codex is a fluent AI that can take a natural-language prompt as input, and generate code for the task it was given. It’s great — sometimes.
Why Codex Can Be So Powerful, and So Flawed
The great strength of Codex is its fluency. It can generate code in 12 languages, and its resilient handling of natural language input is extremely powerful. But that’s also a key weakness. Coding is tedious because it’s so detail-sensitive, but natural language is messy and context-dependent. Typos or bad logic can both wreck the function of an entire piece of otherwise solid software. And there’s good code and bad code and just plain weird code posted on the web.
Codex is susceptible to the same problem. Unlike a Markov chain, Codex keeps a record of its actions in a cache, but its scope is still limited and shallow. It’s constrained in the abstract by the rigid syntax and logic rules of programming. It’s also constrained by the patterns and implicit rules in the actual data it uses. And if we don’t tell it to interpolate, it won’t.
In other words, while Codex can return code that does what you want, it doesn’t know why you want that function, and its reasoning won’t necessarily be obvious. It’s like writing a mathematical proof: there may be multiple paths to the same answer, some of them meandering. Sometimes Codex returns code that looks nothing like what a human programmer might do, but sure enough, it accomplishes the same thing. Sometimes, its code has security flaws or just won’t run at all. Thirty-seven percent of the time, it works all the time.[embedded content]
Codex is so robust in part because of the strength of its datasets. Common Crawl indexes a great many things, including WordPress, Blogspot, LiveJournal, archive.org, and a ton of .edu content. All of these are rich with metadata, semantic information, and internal rules for Codex to study, and they also tell us a lot about the way humans use the languages we speak. Codex is fluent in many spoken languages, but it’s can also translate between spoken language and Python — both composing the code that does what you ask it to do, and explaining code to you in plain language. To train Codex on programming code, just one of the many things they did was feed it Python, plus eleven other major programming languages.
I’ve maintained for a long time that one of the main steps in the progression of AI will be to feed one learning algorithm to another. Under the hood, AI rests on the idea that you can use math to find patterns in data. It has the same garbage-in-garbage-out limitations that all other code has. Even sophisticated neural nets that do unsupervised learning can’t decide what constitutes a success condition unless we tell them what to look for. But the great strength of AI is iteration. An AI will wade methodically through vast swathes of data, iterating on the same problem until someone tells it to stop. Neural nets ingest data like jet engines ingest geese.
There’s a real comparison here that can be made between the current state of the art in AI research and the physiology of the human brain. The system has a coherent internal logic, and it is already constructed to ascertain patterns as they relate to one another. One limitation on AI is that rules can be several contextual layers deep, and with current techniques, it takes a great deal of computing power to approximate the results a human would get for the same problem or prompt. The brain uses Wolff’s law, which here means “neurons that fire together wire together,” to map associations. Codex uses an internal associative database that tags relationships within the datasets it processes, and it has literally 175 billion rules. They both translate natural language into code. For the brain, the output vehicle is cortical neurons, and for Codex, it might be Python.
An AI That Improves Itself
We are not far from an intelligent system that can learn to improve itself recursively. If we stand on the shoulders of giants, there are a great many giants below us, and we are rising fast. Kurzweil discusses the “hockey stick” acceleration of human progress, and positions AI as the natural next level of complexity in our collective skill tree. Robust AI systems have the ability to integrate many data streams at the same time. I call Codex the Singularity in jest, but all kidding aside, this really does appear to be a key point at which AI becomes able to improve upon itself. What Codex can do looks for all the world like the first flickering of a lone neuron as it establishes its first synapses. Where and how it will articulate with other learning systems remains to be seen.
In the Skynet timeline, creating the entire runaway AI antagonist took resources on the level of what Microsoft is pouring into Codex today, and Codex sometimes struggles to print the same line five times. It’s clear that we’re not at that level of sophistication. Instead, Codex is not unlike the Aggregator, the grand correlation engine from the Thieves of Fate. It’s most useful as a programming adjunct for experienced programmers, much like the mechanical robots that help humans automate the drudgery out of repetitive tasks. To automate a process with code, you have to constrain your work precisely enough to ensure that you and the IDE are on the same page, so that you get what you ask for. Codex can help with some of that drudgery, and that’s where it shines.
It is marvelous to imagine such a system learning to build on its own successes. And yet what it takes to make Codex go is an absolutely staggering amount of cross-referenced data, with all the infrastructure that data commands. It returns output based on what you gave it to study. While Codex can help programmers complete tedious tasks and make sure all the semicolons are included, it is still beholden to logical errors. Marvelous, but brittle. Maybe it’s better that we do an extended beta for the Singularity beforehand.