These are the days of miracle and wonder
This is a long distance call—Paul Simon
As a person who occasionally sends e-mail and talks on the telephone, I’ve been following with interest and curiosity all the recent press reports about alleged eavesdropping and data-mining by U.S. government agencies. Mathematically, the most intriguing part of this story has to do with the analysis of the call-detail graph, the big database of phone-company records that shows who calls whom (without revealing who said what). For now, though, I want to set that topic aside and focus on the rumors that someone might actually be listening in on my telephone conversations or reading over my shoulder when I’m online.
The current round of controversy began last December with a story by Eric Lichtblau and James Risen in The New York Times, claiming that the National Security Agency, with the cooperation of telephone companies, had “traced and analyzed large volumes of telephone and Internet communications flowing into and out of the United States.” Of course there has been speculation for many years that the NSA surreptitiously monitors international communications; in a fabled program known as Echelon, the agency supposedly intercepted satellite signals and deployed divers or submarines to secretly tap undersea cables. The new allegations describe a much less arduous procedure. According to Lichtblau and Risen, “senior government officials arranged with officials of some of the nation’s largest telecommunications companies to gain access to switches that act as gateways at the borders between the United States’ communications networks and international networks.”
The Lichtblau and Risen account was short on specifics, offering no hint of which switching centers had been tapped, or even which companies might be participating. But in April a more-concrete assertion surfaced. Mark Klein, a retired technician for AT&T, described the installation of equipment that he identified as NSA surveillance gear at an AT&T facility in San Francisco. He said the monitoring hardware was installed late in 2002 or early in 2003 in room 641A at 611 Folsom Street, a building then owned by SBC Communications, where three floors were occupied by AT&T. (SBC has since merged with AT&T.) Furthermore, Klein said, a coworker had told him of similar surveillance outposts in Seattle, San Jose, Los Angeles and San Diego. Since there’s no obvious reason to single out the West Coast, it’s a reasonable inference that if any of these reports are true, then similar eavesdropping facilities may have been built in other major cities as well.
Klein is a witness in a law suit brought against AT&T by the Electronic Frontier Foundation. Documents that Klein submitted as evidence have been put under seal by the court, but a few items have apparently leaked out, and on Monday Wired News published what purports to be the complete collection of documents—or at least that’s the suggestion conveyed by the headline, “Whistle-Blower’s Evidence, Uncut.” An accompanying story says the material came from “an anonymous source close to the litigation.” The documents include several memos, tables and diagrams labeled “AT&T Proprietary.”
Much about the case remains murky. Still, if we accept Klein’s interpretation of the documents, there’s enough information to begin asking some quantitative questions.
Drinking from the Firehose
From Klein’s account, it seems this particular installation monitors only the packet-switched network (roughly speaking, the Internet) and not the circuit-switched part of the communications infrastructure (including most telephone calls). We are told that fiber-optic cables carrying traffic to and from 16 other Internet providers and exchanges had “splitters” installed, so that a replica of the signals could be diverted into room 641A. The 16 fiber circuits operate at various data rates: four at OC-3 (roughly 150 megabits per second), eight at OC-12 (600 megabits per second), and four at OC-48 (2400 megabits per second). The total bandwidth is thus about 15,000 megabits per second. Converting this figure from the bits-per-second convention of the communications industry (where a million is 106) to the bytes-per-second world of computing (where a million is 220), we arrive at a data rate of a little under 2 gigabytes per second. That’s the maximum capacity of the fibers, but they surely do not run full all the time. If we suppose that the load factor is 50 percent (a number I’ve just plucked out of thin air), then room 641A is taking in 1 gigabyte per second, or 86 terabytes per day.
Coping with such a flood of data is a significant challenge, but it’s not totally beyond imagining. In high-energy physics and astronomy a few experiments will soon be generating data volumes of the same order of magnitude. If the NSA wanted to record the entire data stream for later perusal, it could probably be done, at least for brief bursts. One item of equipment in room 641A, according to the Klein documents, is a Sun StorEdge T3 disk array; the exact model is not indicated, but the largest version available in 2003 held 168 terabytes of storage space—enough to last a couple of days. Much more capacious storage arrays—with capacities beyond a petabyte (1,000 terabytes) are readily available now. (But the bandwidth of the connections being monitored has probably also grown in the past three years.)
It’s been said that if Google can index the entire World Wide Web, and even store many of the pages in its cache files, then surely the NSA can do the same. But the task is not, in fact, the same. For one thing, the Internet is not just the Web; there are many other streams of data passing over the network, including everything from e-mail and “instant messages” to FTP transfers. What’s more important, Google sees the Web mainly as a quasi-static structure, made up of pages that it can visit on its own schedule, and which it needs to re-index only after some change is noted. A sniffing device installed on an Internet backbone circuit has a very different view of the Web. If 100,000 people visit yahoo.com, then the monitor sees the Yahoo home page streaming by 100,000 times. This redundancy complicates the eavesdropper’s task; on the other hand, it also offers the potential of capturing additional information. You not only learn what’s on every Web page but also how many people are visiting that page, and maybe even who they are.
Frankly, trying to squirrel away a copy of everything is a brute-force strategy that seems unlikely, unnecessary and a bit dull-witted. It would merely defer the main problem: At some point you have to digest all that information. There’s no point in collecting things that you’ll never have time to read. Better to be more selective up front, scanning the bit stream as it rushes by, and only saving items that look like they might be worth closer scrutiny. The Klein documents support the hypothesis that room 641A is set up for such sifting. On the manifest of equipment to be installed in the room, the most distinctive item is a machine called a Narus STA 6400. Narus, Inc., is a company headquartered in Mountain View, Calif., whose name is said to derive from the Latin gnarus, meaning all-knowing. The STA in STA 6400 stands for Semantic Traffic Analysis, which is presumably meant to give the impression that the device can classify Internet transmissions by their meaning.
According to the Narus Web site, the company got its start helping Internet service providers monitor, measure (and in some cases police) traffic on their own networks. For example, some carriers prohibit voice-over-internet-protocl (VoIP) telephone calls. The STA 6400—like the new model that has supplanted it, called the NarusInsight—is said to distinguish VoIP packets from other kinds of traffic. Other network operators might want to detect or regulate peer-to-peer file sharing; again, the Narus software claims to recognize such activity. These tasks of discrimination are not easy, but in my opinion they don’t quite rise to the level of “semantic” analysis. Recognizing a packet as belonging to a Skype conversation or a Kazaa download doesn’t penetrate very deeply into the packet’s meaning. And merely classifying packets according to their type or protocol doesn’t look like a very promising approach for detecting terrorist activity. But we can’t know what the NSA is doing with this equipment—if indeed it exists and if they are operating it.
Grepping the Net
For the sake of argument, suppose it’s all true, and the NSA is equipped to intercept and evaluate every bit that passes through the global Internet. If you were an analyst expected to retrieve useful information from this data stream, what kinds of patterns would you look for? Here are some possibilities that occur to me. I’d be interested to hear other ideas.
- Tracking specified individuals. If we knew the IP number of Osama bin Laden’s computer, we could slurp up every packet traveling to or from that machine, and reconstruct all the details of his Internet activity, just as if we were looking over his shoulder. (Then again, if we knew his IP number, we would also have a pretty good idea of his physical whereabouts, and we could go have an offline chat.) Domain names, e-mail addresses, account names and other kinds of identifiers could also provide a key to tracking known individuals. However, if surveillance of targeted individuals is the main aim of the program, wiretapping the entire Internet seems like a grotesquely wasteful way to go about it. There are more direct and efficient means of doing the same thing.
- Monitoring specified organizations. Once upon a time, this was the raison d’être of intelligence services—planting bugs in embassies, breaking the codes of the KGB or the GRU. Comprehensive Internet surveillance is presumably useful for such purposes too, although Al Qaeda doesn’t have an embassy.
- Getting inside the envelope. Some of the people charged in the train bombings in Madrid in 2004 are said to have communicated through a shared Yahoo e-mail account, from which they never actually sent any e-mail. According to the Spanish courts, one conspirator would log onto the account and write a draft of a message, leaving it unsent; then the rest would log on, using the same name and password, and read the draft. Perhaps they thought that an unsent message would never be exposed to eavesdroppers, but that is a misconception; in fact the text does flow over the network every time someone views it. (How else could it be displayed on the screen?) There’s no evidence that the NSA actually spotted any such messages, but in principle they could, if they knew what to look for, and if the packets passed through a node of the network where they had a listening post.
- Staking out the watering hole. When we think of wiretapping the Internet, what comes to mind first is reading people’s e-mail, or maybe their instant-message traffic. These are person-to-person modes of communication that seem to offer some modicum of privacy—which may be why we suppose that an intelligence agency would want to pry them open. Other Internet forums, such as usenet news groups, are so public that there’s no need for skullduggery to read their contents. Nevertheless, it’s possible that e-mail would not be the main target of surveillance, and that public areas of the Internet would attract considerable attention. For example, the NSA might find it worthwhile to compile lists of visitors to certain web sites (maybe even this one).
- All of the above. A government agency with legal clout and insider knowledge could gather all this information by more-conventional and less-intrusive means—but what a pain to have to pursue each case individually. If I were suspected of terrorist activity, I’m pretty sure my Internet service provider would give up the goods on me, but there’d be so much paperwork to do, and then the same rigmarole all over again with the hosting service that provides space for this Web site, and with Google for my gmail account, and so on. A tap on the Internet backbone offers one-stop shopping. No need to ask amazon.com what books I bought. No need to ask Google what terms I searched for. You just scoop it out of the river as it floats by.
- Grepping the Net. Spying on people who are already known or suspected malefactors may be the routine business of the intelligence services, but the mythic promise of the all-seeing surveillance device is the possibility of uncovering a conspiracy de novo, breaking up the plot before it’s even hatched. Legend has it that the heart of the Echelon program was blind scanning of intercepted communications traffic for interesting words or phrases. And so now we can imagine the NSA programming its Narus STA 6400s to grep “bomb \(plot \| conspiracy)” against the entire Internet. Maybe this works. I’m skeptical.
- Retrospective analysis. Here’s a reason for trying to save a copy of the entire Internet data stream, at least for a period of days. On September 10, 2001, an e-mail mentioning AA11, UA175, AA77 and UA93 would not have attracted the slightest attention, but a day later those flight numbers were notorious. I have no idea whether such an e-mail was ever sent, but I’m sure that investigators would have welcomed an opportunity to find out.
- Social network analysis. The reported interest of the NSA in telephone call-detail graphs hints at an attempt to discover communities of people with shared interests. The most clear-cut case is to identify cliques in the graph: sets of people who have all called one another. In the case of the long-distance telephone network, the databases of call records already exist, compiled by the telephone companies for their own purposes. Similar kinds of social-network analysis—possibly even more interesting—could be done with Internet data, but as far as I know, no one keeps the necessary records. Compiling such a database strikes me as a plausible motivation for the kind of monitoring the NSA is said to be doing. The closest analogy to the telephone case would be looking for groups of people who have all exchanged e-mail (or other kinds of direct messages), but the technique is not limited to that. Someone might also take an interest in finding groups of people who have all visited the same set of Web sites within a certain period of a time, or downloaded the same files. And it’s notable that this kind of information is particularly easy to acquire; it mostly depends on addresses of data packets, not their content.
- Jiggery-pokery. It would be unfair to conclude this list without mentioning some of the more cynical speculations about the likely uses of an Internet spinal tap. According to the Bush administration, the program of warrantless wiretapping was approved in the weeks after 9/11 as part of an effort to find the people responsible for the attacks and prevent any recurrence. But the search for the 9/11 conspirators led mainly to shadowy figures in Afghanistan. In 2001, Afghanistan was possibly the least-wired place on earth; the Taliban had banned the Internet, and there was only one computer in the country with a sanctioned connection. Thus the Internet would not seem to be the most obvious place to look for “chatter” among those hiding out in the caves of Tora Bora. On the other hand, the Internet is an excellent place for keeping track of political opponents or watching out for signs of disloyalty within your own ranks. (But that would be wrong.)
Occam’s Other Razor
Does any of this make sense? Set aside all questions of what’s legal or moral or even prudent, and simply consider what’s feasible. Can you listen in to the simultaneous clatter of several hundred million computers and extract anything of value from all the noise?
It’s easy to find people who aren’t hiding, but they’re probably not the ones you want to catch. And for those who do want to conceal their presence and their intentions, the Net offers many, many dark corners, despite room 641A. For example, it’s well known that 70 or 80 percent of all e-mail is spam, but the NSA dare not ignore it. If I were sending signals to the members of a covert cell, I would be tempted to embed my message in an offer for V1@gra or replica wristwatches. Indeed, much spam comes with an addendum of apparently random text, meant to fool spam-blocking filters. How do we know that’s all it fools? Or one could ride piggyback on the high-bandwidth BitTorrent, hiding out among the kids trading music tracks and warez. Then there’s cryptography; more on that below.
The story told by Mark Klein is highly plausible, and the published documents make a strong case, yet there are a few doubtful details. None of the documents that come from AT&T mention the NSA or any other government agency. The NSA connection comes solely from Klein’s own testimony. In statements to the press he has said that the NSA interviewed AT&T employees for a job that would involve working in room 641A, and eventually hired someone. Similarly, in a statement published by Wired News, Klein says, “only people with security clearance from the National Security Agency can enter this room.” Although I’ve personally never had any dealings with the NSA (as far as I know), this story sounds fishy. Would the NSA—long known as “No Such Agency”—identify itself so openly when interviewing telephone company employees? Do they hand out cards that say “NSA Security Clearance”?
But if not an NSA snooping roost, what is in room 641A? One possibility is a traffic-monitoring facility that AT&T might have built for its own use. Some years ago AT&T developed a system called PacketScope, not too different from the modern Narus devices; papers in the open literature [see update below] discuss the installation of PacketScope equipment at two AT&T Internet hubs. Other network operators do the same kind of monitoring. They want to know where their traffic comes from and goes to, and what their customers are doing with the bandwidth. But if the 16 splitters and the Narus STA 6400 and the Sun StorEdge T3 all have such an innocent explanation, why doesn’t AT&T say so? Not everyone would believe them, but the denial would certainly muddy the water. Instead AT&T has released a bland statement about their commitment to privacy and their obligation to assist law enforcement. And then the federal government intervened to dismiss the suit against AT&T, claiming that the issue cannot be adjudicated without revealing state secrets and endangering national security. It’s almost as if they want us to believe the worst.
Okay, I’ll go along. And I’ll keep in mind that this is the looking-glass world, ruled by Occam’s other razor, the one that favors whatever explanation is most convoluted and counterintuitive. Why would the government want the whole world to know that we’re listening? Here’s my best shot at an answer: Maybe to encourage people to encrypt their most-sensitive communications. How would that benefit the eavesdroppers? Packets carrying encyrpted text should be easy to recognize because of their “flat” statistics; they label themselves suspicious. But, you say, it’s no use recognizing an encrypted message if you can’t read what’s inside. Who can’t read what’s inside?
Update 2008-09-07: The “papers in the open literature” mentioned above are no longer quite so open. The link above has gone dead; the paper is still listed in the bibliography of Ramon Cáceres, but links to PDF and Postscript versions have been removed. As of today a PDF can still be downloaded through Citeseer.