Last night, at the Paris Hotel in Las Vegas, seven autonomous bots proved that hacking isn’t just for humans.
The Paris ballroom played host to the Darpa Cyber Grand Challenge, the first hacking contest to pit bot against bot—rather than human against human. Designed by seven teams of security researchers from across academia and industry, the bots were asked to play offense and defense, fixing security holes in their own machines while exploiting holes in the machines of others. Their performance surprised and impressed some security veterans, including the organizers of this $55 million contest—and those who designed the bots.
During the contest, which played out over a matter of hours, one bot proved it could find and exploit a particularly subtle security hole similar to one that plagued the world’s email systems a decade ago—the Crackaddr bug. Until yesterday, this seemed beyond the reach of anything other than a human. “That was astounding,” said Mike Walker, the veteran white-hat hacker who oversaw the contest. “Anybody who does vulnerability research will find that surprising.”
In certain situations, the bots also showed remarkable speed, finding bugs far quicker than a human ever could. But at the same time, they proved that automated security is still very flawed. One bot quit working midway through the contest. Another patched a hole but, in the process, crippled the machine it was supposed to protect. All the gathered researchers agreed that these bots are still a very long way from grasping all the enormously complex bugs a human can.
According to preliminary and unofficial results, the $2 million first place prize will go to Mayhem, a bot fashioned inside startup ForAllSecure, which grew out of research at Carnegie Mellon. This was the bot that quit working. But you shouldn’t read that as an indictment of last night’s contest. On the contrary. It shows that these bots are a little smarter than you might expect.
The problem, of course, is that software is littered with security holes. This is mostly because programmers are humans who make mistakes. Inevitably, they’ll let too much data into a memory register, allow outside code to run in the wrong place, or overlook some other tiny flaw in their own code that offers attackers a way in. Traditionally, we needed other humans—reverse engineers, white-hat hackers—to find and patch these holes. But increasingly, security researchers are building automated systems that can work alongside these human protectors.
As more and more devices and online services move into our everyday lives, we need this kind of bot. Those human protectors are far from plentiful, and the scope of their task is expanding. So, Darpa, the visionary research arm of the US Defense Department, wants to accelerate the evolution of automated bug hunters. The agency spent about $55 million preparing for this contest, and that’s before you factor in the $3.75 million in prize money. It designed and built the event’s enormously complex playing field—a network of supercomputers and software the contestants competed to hack—and it constructed a way of looking inside this vast network, a sweeping “visualization” that can actually show what’s happening as the seven contestants race to find, patch, and exploit security holes in those seven supercomputers. It’s basically Tron.
The idea wasn’t just for the contest to spur the development of the competing new security systems, but to inspire other engineers and entrepreneurs toward the same goal. “A Grand Challenge is about starting technology revolutions,” Mike Walker told me earlier this summer. “That’s partially through the development of new technology, but it’s also about bringing a community to bear on the problem.”
Held each year in Las Vegas, the Defcon security conference has long included a hacking contest called Capture the Flag. But last night’s contest wasn’t Capture the Flag. The contestants were machines, not humans. And with its Tron-like visualization—not to mention the two color commentators that called the action like it was a sporting event—Darpa provided a very different way of experiencing a hacking contest. Several thousand people packed into the Paris ballroom. The crowd was typical Defcon: much facial hair, ponytails, and piercings, plus the odd Star Trek uniform. But what they saw was something new.
Rematch with the Past
The seven teams loaded their autonomous systems onto the seven supercomputers late last week, and sometime Thursday morning, Darpa set the contest in motion. Each supercomputer launched software that no one outside Darpa had ever seen, and the seven bots looked for holes. Each bot aimed to patch the holes on its own machine, while working to prove it could exploit holes on others. Darpa awarded points not just for finding bugs, but for keeping services up and running.
To show that no one else had access to the seven supercomputers—that the bots really were competing on their own—Darpa erected its network so that an obvious air gap sat between the machines and the rest of the ballroom. Then, every so often, a robotic arm would grab a Blue-Ray disc from the supercomputer side and move it across the gap. This disc included all the data needed to show what was happening inside the machines, and after the arm fed this into a system on the other side of the gap, Darpa’s Tron-like visualization appeared on the giant TV looming over the arena.
Darpa planted countless security holes on the seven machines. But some were particularly intriguing. As the curtain went up on the contest, Darpa’s color commentators—astrophysicist turned TV host Hakeem Oluseyi and a white-hat hacker known only as Visi—revealed that some were modeled on infamous security holes from the Internet’s earlier days. This included the Heartbleed bug (discovered in 2014), the bug exploited by the SQL Slammer worm (2003), and the Crackaddr bug (2005). Darpa called them rematch challenges.
The competition was divided into rounds—96 in all. Each round, Darpa launched a new set of services for the bots to both defend and attack. In the earliest rounds, Mayhem, the bot created by the team from Carnegie Mellon, edged into the lead, trailed closely by Rubeus, built by defense contractor Raytheon.
Rubeus played a particularly aggressive game. It seemed intent on exploiting holes in the other six machines. “It’s throwing against absolutely everything,” Visi said at one point. And this seemed rather successful. But its competitor, Mayhem, had a certain knack for protecting its own services and, crucially, for keeping them up and running. As the game progressed, the two bots took turns at the top of the leader board.
But then, several rounds in, Rubeus stumbled and dropped in the rankings. In patching a hole in its own machine, it accidentally hampered the machine’s performance. That’s the danger of applying a patch—both during a hacking contest and in the real world. In this case, the patch didn’t just slow down the service that needed patching; it slowed down all other services running on the machine. As Visi put it, the bot had launched a denial-of-service attack against its own system.
The bot had launched a denial-of-service attack against its own machine.
By contrast, Mayhem seemed to take a more conservative and considered approach. As team leader Alex Rebert later told me, if the bot found a hole in its own machine, it wouldn’t necessarily decide to patch, in part because patches can slow a service down, but also because it can’t patch without temporarily taking the service offline. Through a kind of statistical analysis, the bot weighed the costs and the benefits of patching and the likelihood that another bot would actually exploit the hole, and only then would it decide whether the patch made sense and would give it more points than it would lose.
In round 30, Rubeus was smart enough to remove the patch that was causing its own machine so much trouble, and its performance rebounded. But it continued to trail Mayhem as well as Mech.Phish, a bot designed by a team from the University of California, Santa Barbara.
Mech.Phish sat in last place for the early rounds—probably because it patched every hole it found. Unlike Mayhem, it was light on game theory, as team member Yan Shoshitaishvili later told me. But as the game continued, Mech.Phish started climbing the leader board. It seemed to have a knack for finding particularly complex or subtle bugs. Certainly, it was the only bot that proved it could exploit the bug modeled on Crackaddr.
This exploit was so impressive because it fingered a bug that isn’t always there. Before exploiting the hole, the bot must first send a series of commands to create the hole. Basically, it must find the right route among an enormous array of possibilities. That number is so large, the bot can’t try them all. It must somehow hone in on a method that will actually work. It must operate with a certain subtlety—mimicking a very human talent.
But despite Mech.Phish’s human flair, Mayhem remained in the lead.
The Unintended Bug
Then, in round 52, Mayhem quit working. For some reason, it could no longer submit patches or attempt exploits against other machines. And it remained dormant through round 60. And round 70.
As the game continued, others bots showed a surprising knack for the task at hand. At one point, Xandra—a bot designed by a team from the University of Virginia and a company called GrammaTech—exploited a bug that Darpa didn’t even know was there. And a second bot, Jima, designed by a two person team from Idaho, successfully patched the bug.
And yet, Mayhem stayed atop the leader board. It was still top after round 80. And it was top after round 90—even though it remained dormant. And then just as suddenly, in round 95, it started working again. In round 96, it won the contest—at least according to preliminary results.
Its play in the first 50 rounds was so good, its game theory so successful, that the other bots couldn’t catch up. Over the remaining rounds, Mayhem’s patches continued to provide defense, and though it wasn’t able to patch additional holes or exploit new holes in other machines, enough of its services continued to run as they should—in part because it had often decided not to patch. Mayhem didn’t just patch and exploit security holes. It weighed the benefits of patching and exploiting against the costs. It was smart.