An AI programming tool that makes sample code easier to find might sound like a godsend for software developers, but the reception for Microsoft’s new GitHub Copilot tool has been a bit chillier.
Copilot launched last week in an invite-only Technical Preview, promising to save time by responding to users’ code with its own smart suggestions. Those suggestions are based on billions of lines of public code that users have publicly contributed to GitHub, using an AI system called Codex from the research company OpenAI.
While Copilot might be a major time saver that some have hailed as “magic,” it’s also been met with skepticism by other developers, who worry that the tool could help circumvent licensing requirements for open source code and violate individual users’ copyrights.
How Copilot works
GitHub describes Copilot as the AI equivalent of pair programming, in which two developers work together at a single computer. The idea is that one developer can bring new ideas or spot problems that the other developer might’ve missed, even if it requires more person-hours to do so.
In practice, though, Copilot is more of a utilitarian time saver, integrating the resources that developers might otherwise have to look up elsewhere. As users type into Copilot, the tool will suggest snippets of code to add by clicking a button. That way, they don’t have to spend time searching through API documentation or looking up sample code on sites like StackOverflow. (A second developer probably wouldn’t have memorized those examples, either.)
As with most AI tools, GitHub also wants Copilot to get smarter over time based on the data it collects from users. CNBC reports that when users accept or reject Copilot’s suggestions, its machine learning model will use that feedback to improve future suggestions, so perhaps the tool will become more human-like as it learns.
Not long after Copilot’s launch, some developers started sounding alarms over the use of public code to train the tool’s AI.
One concern is that if Copilot reproduces large enough chunks of existing code, it could violate copyright or effectively launder open-source code into commercial uses without proper licensing. The tool can also spit out personal details that developers have posted publicly, and in one case it reproduced widely-cited code from the 1999 PC Game Quake III Arena—including developer John Carmack’s expletive-laden commentary.
Hi. I know you’re excited about copilot.
GitHub scraped your code. And they plan to charge you for copilot after you help train it further.
It’s truly disappointing to watch people cheer at having their work and time exploited by a company worth billions.
— Brian P. Hogan (@bphogan) July 2, 2021
Cole Garry, a Github spokesperson, declined to comment on those issues and only pointed to the company’s existing FAQ on Copilot’s web page, which does acknowledge that the tool can produce verbatim code snippets from its training data. This happens roughly 0.1% of the time, GitHub says, typically when users don’t provide enough context around their requests or when the problem has a commonplace solution.
“We are building an origin tracker to help detect the rare instances of code that is repeated from the training set, to help you make good real-time decisions about GitHub Copilot’s suggestions,” the company’s FAQ says.
In the meantime, GitHub CEO Nat Friedman has argued on Hacker News that training machine learning systems on public data is fair use, though he acknowledged that “IP and AI will be an interesting policy discussion” in which the company will be an eager participant. (As The Verge‘s David Gershgorn reports, that legal footing is largely untested.)
The tool also has defenders outside of Microsoft, including Google Cloud principal engineer Kelsey Hightower. “Developers should be as afraid of GitHub Copilot as mathematicians are of calculators,” he said.