Archive for the ‘modern life’ Category

The demon in the dryer

Wednesday, January 31st, 2007

Doing some laundry last night, I threw a duvet cover and nine pairs of socks into the dryer together. (Household hint: Don’t.) The duvet cover is a giant fabric pouch with a slit along one side; think of a queen-size pita pocket. Initially, all the socks were outside the pouch. When I pulled the load out of the dryer, all but three of the socks were inside the cover.

There’s nothing obvious about the geometry of this big, floppy bag that would suggest it has any special ability to capture socks. It’s not shaped like a fish trap with a funnel opening. In the random tumbling of the dryer, I would have thought that socks would move into or out of the opening with the same probability. But if that’s the case, then finding a 15–3 distribution is quite a fluke. There are 218 = 262,144 ways of arranging the socks in two groups, and only 5,220 of them have three or fewer socks in one of the groups. That’s less than 2 percent and a little beyond the 2σ level of unlikelihood. Does Maxwell’s sock demon live in my dryer?

The arXiv rolls over

Friday, December 8th, 2006

The mathematics section of the arXiv archived 989 preprints in October. Why is that fact worth noting? Because arXiv papers are identified by numbers of the format YYMMNNN, with two digits for the year, two digits for the month, and a three-digit sequence number. Ten more papers and all the world’s mathematicians would have been put on involuntary furlough, forbidden to sum another series or solve another equation until month’s end. It would have been rather like the state of New Jersey shutting down when the legislature fails to pass a budget bill. (I realize there are people who would find neither event regrettable.)

The arXiv is now introducing a new numbering scheme, with room for 9,999 papers per month. “If current growth rates continue,” says the announcment, “we expect to change the sequence number to 5-digits NNNNN in 10 to 15 years.” The new format continues to allocate two digits to the year. The arXivists seem to have no worries about disambiguating centuries. They weren’t panicked by Y2K, and apparently they’re not afraid of 9108 either, when the arXiv will have its own centenary.

Trivia note: Curiosity moved me to look up the very first paper issue by the arXiv (although the site was not then arxiv.org but rather xxx.lanl.gov). Paper 9108001 is “Exact Black String Solutions in Three Dimensions,” by James H. Horne and Gary T. Horowitz; the abstract begins, “A family of exact conformal field theories is constructed which describe charged black strings in three dimensions.” I can all too easily imagine that paper 9108.00001 in the new series will be on the same topic.

Back to school

Wednesday, December 6th, 2006

The U.S. Supreme Court heard arguments the other day on programs meant to maintain racial diversity in the public schools of Seattle and Louisville. Listening to accounts of the debate put me in mind of Thomas C. Schelling’s elegant mathematical model of race relations. The model suggests that extreme segregation can arise spontaneously even when no one in the population desires that outcome. I wish I could submit a friend-of-the-court brief (although I’m not feeling very friendly toward this particular court).

Thomas Schelling segregation model The panel at left shows four stages in the evolution of a simple version of Schelling’s model. Initially, equal numbers of individuals from two groups—they’re shown here as blue and tan, but I’m going to refer to them as black and white—are dispersed at random. Neither race seeks to escape or exclude the other; the only racial preference is of the mildest sort—a desire not to be entirely cut off from people of one’s own group. But that desire is enough to cause the two populations to separate like oil and water.

The simulation takes place on a two-dimensional lattice with N sites, occupied by M agents, where M < N, so that at least a few sites remain vacant. At each time step each agent examines the eight sites surrounding its position, calculating the proportion of neighboring agents that match its own color. If the proportion of same-race neighbors exceeds some given threshold β, the agent is contented and stays put. On the other hand, if the fraction of like-colored neighbors falls short of the threshold, the agent moves, choosing its new home at random from among the N – M vacant sites; in the process of moving, of course, the agent creates a new vacancy in its former position. The process is then repeated, stopping only if and when there are no more discontented agents.

The model has two parameters: the threshold of contentment β and the filling factor (the fraction of occupied sites, M/N). As each of these parameters increases from 0 toward 1, agents have a harder time finding a satisfactory home. The interesting region in parameter space is where the threshold β lies between about 0.25 and 0.5, and the filling factor is close to 1, so that there are few vacancies on the lattice. The high filling factor excludes solutions in which people get along by living in isolation. As for the threshold, it’s easy to find a pattern acceptable to all if no one minds being in the minority, but the challenge naturally gets harder as the agents become more discriminatory. Still, it’s important to note that even at a threshold of 0.5 (i.e., each agent wants to have at least half of its neighbors match its own color, and no one is willing to be in a minority) there is an obvious solution that makes everyone happy. That solution is a checkerboard pattern, which happens to represent a maximum of racial integration. But the checkerboard is not the pattern that typically emerges when the model is run from a random initial state. Instead, within a few hundred time steps, the two populations segregate themselves, forming large, amoeboid areas of racial uniformity. The sobering lesson of this simple model is that it doesn’t take deep-seated and vitriolic racism to produce a stark pattern of segregation; it’s enough that each of us feels uncomfortable when outnumbered. And the sharp color lines develop even though those who move don’t take race into account when choosing a new location.

For anyone who might care to experiment with the Schelling model, there are Java applet versions here and here (and probably more). The model is also among the demo programs included with several versions of the StarLogo and NetLogo programming languages.

Schelling’s model deals with residential racial patterns, but it’s not hard to adapt the same basic ideas to the study of segregation and integration in school systems, which was the issue before the Supreme Court this week. The main difference is that in the school case we need to look at a few large subsets of the population (the schools) rather than many small neighborhoods. In the scholastic variant of the Schelling model, a student attending a school where the proportion of his or her own race falls below the threshold β would transfer to a randomly chosen vacant place elsewhere in the system. Although I have not yet tried to build and run a computational version of this model, I think it’s possible to see what kind of behavior to expect, simply by considering the smallest nontrivial model: two schools of equal size, shared by black and white populations that are also equal. In this situation, if the fraction of white students in one school is r, then the fraction of black students in the same school must be 1–r, while in the other school the black fraction is r and the white fraction 1–r. Suppose r ≤ 1/2. Then any allocation of students for which r ≥ β will be agreeable to everyone. At the same time, there is another possible solution, namely r = 0, or total segregation, with one all-white school and one all-black school. As β approaches 1/2, the integrated solution becomes increasingly fragile and hard to maintain; at β = 1/2, the equilibrium is instantly destroyed by the slightest perturbation (such as someone moving into or out of the district); beyond β = 1/2 the shared-schools solution disappears, and only total segregation is feasible.

Does this simple-minded model have anything to do with the real-world plight of kids going to school in Seattle and Louisville? Both of those cities once had highly segregated schools and achieved a better racial balance only by means of deliberate incentives and policies, applied steadily over more than 30 years. In the case of Louisville, the desegregation plan was mandated and supervised by the courts from 1973 until 2000; since then, aspects of the program have been continued voluntarily as a means of maintaining racial diversity. In Seattle the measures were adopted by the school board without outside compulsion. Currently, Seattle’s 10 high schools have a citywide open-enrollment policy, so that any student can choose any of the schools. Those schools that have more applicants than places use several factors to determine which students will be admitted, including geography and the presence of siblings already attending the school. Racial balance is the final tiebreaker.

The students and parents who have challenged these policies (supported by an amicus brief from the Bush administration) argue that race should not be a factor at all in assigning children to schools—that it’s no more legimate to consider race as a way of achieving integration that it would be in maintaining segregation. In other words, policies must be colorblind, even though people aren’t.

We won’t know for months how the supremes will rule on this issue, but savvy court-watchers say the outlook is dim for the diversity policies; and if they are overturned in Seattle and Lousiville, similar programs will have to be dismantled in many other cities as well. What’s the appropriate response? One might well try to out-Herod Herod, taking the position that all school-assignment decisions (and analogous choices both within the schools and everywhere else in American life) should be conducted as a pure lottery. After all, if race is a forbidden criterion, how can we allow accidents of residence or wealth or personal history to intrude? This strategy has the virtue that it would make pretty much everyone fiercely unhappy—and so it probably wouldn’t last long.

For another approach to the problem, I would like to ask this: What could we change about Schelling’s model in order to get a somewhat brighter vision of the future? The obvious candidate is the model’s one-sided choice function. In the world of the Schelling model, we all want to avoid being too much in the minority, but no one worries about being too much in the majority. I’d like to think, on the contrary, that we might have at least a few people who go out of their way to seek diversity, who pull up stakes and move not only if they have too many neighbors of another race but also if they have too few. It would be interesting to explore the phase diagram of a model that included such a subpopulation.

A few references:

Schelling, Thomas C. 1971. Dynamic models of segregation. Journal of Mathematical Sociology 1:143–186.

Schelling, Thomas. C. 1978. Micromotives and Macrobehavior. New York: W.W. Norton.

Granovetter, Mark S., and Roland Soong. 1988. Threshold models of diversity: Chinese restaurants, residential segregation and the spiral of silence. In Sociological Methodology (edited by Clifford Clogg), pp. 69–104. PDF

Clark, W. A. V. 1991. Residential preferences and neighborhood racial segregation: a test of the Schelling segregation model. Demography 28:1–19.

Bruch, Elizabeth E., and Robert D. Mare. 2004. Neighborhood choice and neighborhood change. PDF

Aguilera, Antonio, and Edgardo Ugalde. 2006. A spatially extended model for residential segregation. http://arxiv.org/abs/nlin/0607026

Radio alert

Saturday, November 25th, 2006

Chris Joyce and crew \"getting ambi\"

Christopher Joyce of National Public Radio is about to launch a series of reports on the “Industriosphere,” in which I am taking part. As in my book Infrastructure: A Field Guide to the Industrial Landscape, the subject matter is all the manmade bits of the environment we live in. (What is that stuff? What does it do? Why is it there?) Two weeks ago I spent a couple of days exploring and recording in and around Washington, DC, with Joyce and his NPR crew. The first report is scheduled for broadcast tomorrow afternoon (November 26) on Weekend All Things Considered. Check your local NPR station for broadcast times, or listen on the Internet. After the broadcast, some additional material (including photographs) will be posted on the NPR web site. And there is already an Industriosphere Flickr group.

In the photo above Christopher Joyce (foreground), producer Tina Tennessen (right) and audio engineer Drew Reynolds are “getting ambi” (recording ambient sounds) near Thomas Circle in Washington, where we all spent some time playing in traffic.

Update 2006-12-03: A brief followup interview, discussing the wooden rooftop water tanks of New York City, was broadcast on Saturday 2 December 2006. The interview is online here.

Running on empty

Friday, November 24th, 2006

MINI CooperDriving over the river and through the woods yesterday, I was running low on fuel. My car has two kinds of instruments to tell me that I’ll soon be standing by the side of the road feeling foolish. A conventional gas gauge shows the fraction of a tankful remaining, presumably based on readings from some sort of float mechanism inside the tank. The second instrument measures the rate of fuel flow to the engine, showing the result on a digital display that can be set to any of three modes, labeled “consumption,” “average consumption” and “range.” The two “consumption” modes are calibrated in miles per gallon; the “range” mode gives an estimate of the distance remaining until the tank is empty, in miles. When you’re nervous about whether or not you can make it to the next gas station, the range is clearly of interest.

But keeping an eye on the range estimate is also somewhat disconcerting. If the meter says you can last another 23 miles, and then you drive a mile, it seems reasonable to expect that the meter will report a remaining range of 22 miles. In fact it may well say 19 miles, or 23, or even 26. It’s particularly bizarre to see the range increase as you continue driving.

What’s going on here? It’s not hard to guess. The estimated range is simply the number of gallons remaining in the tank multiplied by the fuel-use rate in miles per gallon. Both measurements doubtless have some noise in them, but variations in the fuel flow rate are the major cause of fluctuations in the range estimate. For my car, the instantaneous fuel economy dips down below 10 miles per gallon under hard acceleration, and it appears to go well above 100 miles per gallon when coasting downhill. (The meter tops out at 99.9 mpg.) These variations could alter the range estimate by a factor of 10 or more. Strictly speaking, the fluctuating estimates are not wrong—they indicate the actual range if you were to continue driving exactly as you were at the moment of measurement—but some averaging or filtering would seem sensible.

In fact, I think the range readings I see on my dashboard instrument are smoothed to some extent. The number is updated at intervals of about 30 seconds, and it may reflect an average calculated over a somewhat longer period. The question I want to ask is this: What is the optimum averaging interval—optimal in the sense that it minimizes some measure of error in the estimates? I doubt there can be any definitive answer without making some assumptions about the nature of the fluctuations, but I have a heuristic proposal that seems pretty good to me.

Here’s how I was thinking about the problem during my Thanksgiving pilgrimmage. Suppose you keep driving until the tank runs dry, all the while recording your distance and rate of fuel consumption, moment by moment. Retrospectively, then, it’s easy to determine the number that the range meter should have been displaying at any point during the trip: You just measure the distance backward from the point where the engine died, and by definition that’s the range remaining. But note that for every point along the route, this “retrodicted” range should be equal to the number of gallons left in the tank at that point multiplied by the fuel-use rate (in miles per gallon) averaged over the remainder of the distance. This fact suggests a perfect estimation strategy: You should always average the fuel-use rate over the remaining range. Unfortunately, the remaining range is exactly what you’re trying to calculate, so this algorithm is not very practical. But perhaps we can approximate it.

To reiterate: If the remaining range is n miles, then the ideal is to estimate this range by averaging the fuel consumption over the next n miles. We can’t quite do that, for two reasons. First, we don’t know what the average consumption will be in the n miles to come; it will depend on terrain, speed, traffic conditions, and many other imponderables. Second, we don’t even known what n is; that’s what we’re trying to estimate. But don’t despair. To cope with the first problem, we can choose some other interval of n miles as a surrogate for the n miles just ahead; the obvious choice is the n miles just behind us. As for the unknown quantity n, we calculate it iteratively. Make some initial guess r0 about the fuel consumption rate, perhaps using the long-term average since the car was manufactured. Multiply r0 by the gallons in the tank to get a first estimate n0 of the remaining range. Then take the average fuel consumption over the preceding n0 miles and again multiply by the number of gallons to get a better range estimate, n1. Only a few repetitions of this process ought to be needed to converge on a pretty good estimate of n. That estimate, of course, is what the dashboard meter will report.

With this scheme, as the estimated range gets smaller, it will also get more volatile, because the consumption rate will be averaged over a smaller interval. I argue that this tendency to wider fluctuations is not a failure of the algorithm. In the last few miles before the tank runs dry, the range really does depend sensitively on whether you’re descending a hill on the open highway or stopping and starting at a series of city traffic lights.

Something tells me I’m not the first person to think about this problem or the first to propose this solution. The same issues arise in lots of other contexts, such as predicting how long the battery will last in a laptop computer. If anyone has a plan that can beat mine, I’d be pleased to hear about it.

By the way, I arrived on time for Thanksgiving dinner, with a few drops left in the tank.

Conspiracy theories

Thursday, August 10th, 2006

The new issue (September-October) of American Scientist is available on the Web, and subscribers to the ink-and-paper edition should receive copies soon. My “Computing Science” column is more topical than I would like, given today’s headlines. I examine some ideas from graph theory and social-networking studies that intelligence agencies may (or may not—who knows?) be using to detect terrorist conspiracies.

Haµte cuisine

Sunday, August 6th, 2006

I can cook anything, as long as the recipe starts with “Take it out of the freezer” and ends with “Put it in the microwave.”

Go ahead and scoff, but I’m proud of my culinary accomplishments. Furthermore, I submit that the art of microwaving frozen foods is not without intellectual challenge. Inferior technique could leave some bits of your burrito still frozen while other parts are overcooked. The underlying cause of this well-known problem has lately been explored in depth and detail through computer simulations done by Motohiko Tanaka and Motoyasu Sato of the National Institute for Fusion Science in Toki, Japan. They present their results in an arXiv preprint.

The first mystery of microwave cookery is that it works at all. To heat a substance, you have to agitate the molecules, augmenting their random motions. Radiation at visible or infrared wavelengths does a good job of this: A visible or infrared photon is absorbed by a single molecule, which then goes bouncing off in some unpredictable direction. But microwaves are orders of magnitude too large and slow and weak to stimulate individual molecules. (In this respect the prefix micro is misleading; the wavelengths range from about a centimeter up to 10 or 20 centimeters.) A microwave’s energy is spread out over many millions of molecules—a situation that doesn’t seem like a very promising way to get them all moving in different directions.

One clue to how microwaves induce heating is what you’re not supposed to put in the microwave oven: aluminum foil. Metals have an abundance of free electrons, which are accelerated by the electric field of the microwaves. This field changes polarity a few billion times per second, and so the electrons slosh back and forth through the foil at that frequency. This motion of the electrons does not in itself constitute heating because it is not random; on the contrary, it is a highly organized oscillation—an alternating electric current. But the oscillating electrons collide with imperfections in the metal, and soon the orderly movement degrades into heat.

So much for tinfoil; most of us eat little in the way of metallic food. And nonmetals do not have an abundance of electrons (or other charged particles) free to move under the influence of an alternating electric field. What foods have is water. Although the H2O molecule is electrically neutral, it has positive and negative poles, where opposite charges congregate (most of the positive charge on the hydrogen atoms, most of the negative charge on the oxygen). When this dipole structure is immersed in an electric field, it takes up a preferred orientation, antiparallel to the applied potential. Since the microwave field changes direction at a frequency of a few gigahertz, the water molecules must continually flip or spin to keep pace. As with the sloshing of the electrons in tinfoil, this rotation of the water molecules can’t be considered heat because it’s too orderly; all the molecules are twirling in synchrony. But in liquid water the molecules are tightly packed, and as they spin they bump into one another like dancers on a crowded floor. These collisions randomize the motion, and so the temperature of the water rises.

Tanaka and Sato study this process in a simulated volume of water measuring about 40 Ångstroms on a side and containing about 2,700 water molecules. The molecules are attracted to one another through electrical forces, but there’s also a “hard core” repulsion that prevents them from getting too close. In the simulation, all of the forces acting on each molecule are recalculated every femtosecond (10–15 second), after which the positions and orientations of the molecules are updated. The system is allowed to come to equilibrium at a specified temperature during an initialization phase that lasts for a simulated time of 100 picoseconds (100 × 10–12 second), and then the microwave field is turned on for 500 picoseconds. Tanaka and Sato use an unrealistically intense field—about a million volts per centimeter—in order to speed up the process. (Although the simulated time is only about half a nanosecond, the actual running time on a cluster of four-processor Pentium workstations is 48 hours.)

In liquid water at 300 Kelvins (roughly room temperature), Tanaka and Sato find that microwave heating is quite efficient: The colliding, spinning molecules raise the temperature to about 350 Kelvins. But here’s the problem: In ice, unlike liquid water, Tanaka and Sato see almost no heating. The reason is that water molecules in an ice crystal are immobilized by strong electrostatic bonds, and microwaves have too little energy to break them free. In the oscillating microwave field, the ice molecules wobble back and forth a bit, but they cannot twirl, and thus they cannot collide. Tanaka and Sato don’t explicitly discuss the culinary implications of their work, but the inference is obvious: It’s because of this icy recalcitrance to microwaves that nuking a frozen burrito takes as much skill as baking a perfect soufflé or whipping up a sauce Bearnaise.

Interestingly, microwaves also lose much of their sizzle when water is superheated to 400 Kelvins. Under these conditions, the water molecules are easily set to spinning, but the bonds between them are so feeble that the rotation is not converted into random, thermal motion. I am tempted to see a certain philosophical significance in this curious behavior. There’s been much written about the specialness of the water molecule—most notably the geometric quirk that makes solid ice lighter than liquid water. If it were otherwise—if rivers and lakes and oceans could freeze from the bottom up—life would have had a hard time getting established on planet Earth. Now we know that modern slacker civilization also depends on a peculiarity of the water molecule. If we didn’t have this glorious interval of susceptibility to microwaves in a narrow window of temperatures near 300 Kelvins, I’d have to survive on Poptarts in the toaster oven.

Scheduled procrastination

Tuesday, June 27th, 2006

Forgive me if this story is slightly stale. I meant to write it two weeks ago, but for some reason I kept putting it off.

A paper titled “Scheduling Algorithms for Procrastinators” has been posted on the computer-science section of the arXiv. The authors are Michael A. Bender (Stony Brook University), Raphaël Clifford (University of Bristol) and Kostas Tsichlas (University of Patras). They confess that they got a late start on writing the article, and they finished at the last minute.

Their theme brings to mind another well-known paper: “The Effects of Moore’s Law and Slacking on Large Computations,” by Chris Gottbrath, Jeremy Bailin, Casey Meakin, Todd Thompson and J. J. Charfman (all of the University of Arizona). Gottbrath et al. considered the options that you face when you are about to begin a long-running computational project. Suppose the job would take three years of continuous work on the best computer available today. The obvious, no-nonsense approach to the problem is to get to work immediately with a state-of-the-art machine, and finish three years hence. But Moore’s Law—the widely noted observation that computer performance doubles every 18 months—suggests an alternative. You could go to the beach for a year and a half; then, returning with an enviable tan, you could buy a machine twice as fast as anything on the market now, so that you would still finish after a total of three years.

Bender, Clifford and Tsichlas consider a slightly different and more-general class of tasks—not necessarily computations—whose shared trait is that the rate of progress toward a conclusion is nonconstant; instead, the pace of work accelerates as a deadline approaches. This is a familiar phenomenon, at least for the dawdlers among us; personally, I find that my productivity gets a big boost on the night before a major project is due, especially if the penalty for missing the deadline is being flunked or fired. Bender and his colleagues focus on tasks with a simple linear speed law, where the rate of work increases steadily as the deadline nears. In particular, they assume that the initial speed is zero and the acceleration is constant until the job is done.

Given a set of such tasks, each with its own start time, deadline and linear speed function, can we find an optimum schedule? The answer is a curiously qualified maybe. If there is a feasible schedule (one where all tasks are completed on time), then there’s an algorithm for finding an optimal schedule (one where the total processing time is minimal). The trouble is, deciding whether or not a feasible schedule exists looks to be a hard problem. No known algorithm can settle the question in polynomial time, and it hasn’t even been established that the problem is in the class NP (nondeterministic polynomial time). For a problem in NP, you may or not be able to find a solution efficiently, but if a candidate solution is dropped in your lap, you can quickly test its correctness; even this limited ability is not guaranteed for the procrastinating scheduler (or the scheduling procrastinator). The root of the difficulty comes from an unexpected quarter. It’s a matter of numerical computation: Scheduled procrastination is hard because evaluating sums of square roots is hard. (Quick: Which is larger, √7 + √26 or √10 + √21?)

There’s more on the sums-of-square-roots problem at The Open Problems Project. For more on how the square-root problem relates to procrastination, see the paper by Bender, Clifford and Tsichlas.

As for me, I’ve got to run; I have a pressing deadline.

DCLXVI

Tuesday, June 6th, 2006

The ABC Evening News ended today’s broadcast with a riff on the date—06/06/06. We heard about a 6-pound, 6-ounce newborn, and a 66-year-old man celebrating his birthday. I’ve always been a little perplexed by this fascination with 666. The book that started it all (Revelation, or The Apocalypse of John) was written a full millennium before the Hindu-Arabic numeral “6″ appeared on the scene. The idea of place-value notation, which is essential if we want to give any meaning to “666,” was still centuries in the future. If we’re worried about ‘the number of the beast,” shouldn’t we be looking out for the dread sequence of symbols DCLXVI?

Room 641A

Wednesday, May 24th, 2006

These are the days of miracle and wonder
This is a long distance call

—Paul Simon

As a person who occasionally sends e-mail and talks on the telephone, I’ve been following with interest and curiosity all the recent press reports about alleged eavesdropping and data-mining by U.S. government agencies. Mathematically, the most intriguing part of this story has to do with the analysis of the call-detail graph, the big database of phone-company records that shows who calls whom (without revealing who said what). For now, though, I want to set that topic aside and focus on the rumors that someone might actually be listening in on my telephone conversations or reading over my shoulder when I’m online.

The current round of controversy began last December with a story by Eric Lichtblau and James Risen in The New York Times, claiming that the National Security Agency, with the cooperation of telephone companies, had “traced and analyzed large volumes of telephone and Internet communications flowing into and out of the United States.” Of course there has been speculation for many years that the NSA surreptitiously monitors international communications; in a fabled program known as Echelon, the agency supposedly intercepted satellite signals and deployed divers or submarines to secretly tap undersea cables. The new allegations describe a much less arduous procedure. According to Lichtblau and Risen, “senior government officials arranged with officials of some of the nation’s largest telecommunications companies to gain access to switches that act as gateways at the borders between the United States’ communications networks and international networks.”

The Lichtblau and Risen account was short on specifics, offering no hint of which switching centers had been tapped, or even which companies might be participating. But in April a more-concrete assertion surfaced. Mark Klein, a retired technician for AT&T, described the installation of equipment that he identified as NSA surveillance gear at an AT&T facility in San Francisco. He said the monitoring hardware was installed late in 2002 or early in 2003 in room 641A at 611 Folsom Street, a building then owned by SBC Communications, where three floors were occupied by AT&T. (SBC has since merged with AT&T.) Furthermore, Klein said, a coworker had told him of similar surveillance outposts in Seattle, San Jose, Los Angeles and San Diego. Since there’s no obvious reason to single out the West Coast, it’s a reasonable inference that if any of these reports are true, then similar eavesdropping facilities may have been built in other major cities as well.

Klein is a witness in a law suit brought against AT&T by the Electronic Frontier Foundation. Documents that Klein submitted as evidence have been put under seal by the court, but a few items have apparently leaked out, and on Monday Wired News published what purports to be the complete collection of documents—or at least that’s the suggestion conveyed by the headline, “Whistle-Blower’s Evidence, Uncut.” An accompanying story says the material came from “an anonymous source close to the litigation.” The documents include several memos, tables and diagrams labeled “AT&T Proprietary.”

Much about the case remains murky. Still, if we accept Klein’s interpretation of the documents, there’s enough information to begin asking some quantitative questions.

Drinking from the Firehose

From Klein’s account, it seems this particular installation monitors only the packet-switched network (roughly speaking, the Internet) and not the circuit-switched part of the communications infrastructure (including most telephone calls). We are told that fiber-optic cables carrying traffic to and from 16 other Internet providers and exchanges had “splitters” installed, so that a replica of the signals could be diverted into room 641A. The 16 fiber circuits operate at various data rates: four at OC-3 (roughly 150 megabits per second), eight at OC-12 (600 megabits per second), and four at OC-48 (2400 megabits per second). The total bandwidth is thus about 15,000 megabits per second. Converting this figure from the bits-per-second convention of the communications industry (where a million is 106) to the bytes-per-second world of computing (where a million is 220), we arrive at a data rate of a little under 2 gigabytes per second. That’s the maximum capacity of the fibers, but they surely do not run full all the time. If we suppose that the load factor is 50 percent (a number I’ve just plucked out of thin air), then room 641A is taking in 1 gigabyte per second, or 86 terabytes per day.

Coping with such a flood of data is a significant challenge, but it’s not totally beyond imagining. In high-energy physics and astronomy a few experiments will soon be generating data volumes of the same order of magnitude. If the NSA wanted to record the entire data stream for later perusal, it could probably be done, at least for brief bursts. One item of equipment in room 641A, according to the Klein documents, is a Sun StorEdge T3 disk array; the exact model is not indicated, but the largest version available in 2003 held 168 terabytes of storage space—enough to last a couple of days. Much more capacious storage arrays—with capacities beyond a petabyte (1,000 terabytes) are readily available now. (But the bandwidth of the connections being monitored has probably also grown in the past three years.)

It’s been said that if Google can index the entire World Wide Web, and even store many of the pages in its cache files, then surely the NSA can do the same. But the task is not, in fact, the same. For one thing, the Internet is not just the Web; there are many other streams of data passing over the network, including everything from e-mail and “instant messages” to FTP transfers. What’s more important, Google sees the Web mainly as a quasi-static structure, made up of pages that it can visit on its own schedule, and which it needs to re-index only after some change is noted. A sniffing device installed on an Internet backbone circuit has a very different view of the Web. If 100,000 people visit yahoo.com, then the monitor sees the Yahoo home page streaming by 100,000 times. This redundancy complicates the eavesdropper’s task; on the other hand, it also offers the potential of capturing additional information. You not only learn what’s on every Web page but also how many people are visiting that page, and maybe even who they are.

Frankly, trying to squirrel away a copy of everything is a brute-force strategy that seems unlikely, unnecessary and a bit dull-witted. It would merely defer the main problem: At some point you have to digest all that information. There’s no point in collecting things that you’ll never have time to read. Better to be more selective up front, scanning the bit stream as it rushes by, and only saving items that look like they might be worth closer scrutiny. The Klein documents support the hypothesis that room 641A is set up for such sifting. On the manifest of equipment to be installed in the room, the most distinctive item is a machine called a Narus STA 6400. Narus, Inc., is a company headquartered in Mountain View, Calif., whose name is said to derive from the Latin gnarus, meaning all-knowing. The STA in STA 6400 stands for Semantic Traffic Analysis, which is presumably meant to give the impression that the device can classify Internet transmissions by their meaning.

According to the Narus Web site, the company got its start helping Internet service providers monitor, measure (and in some cases police) traffic on their own networks. For example, some carriers prohibit voice-over-internet-protocl (VoIP) telephone calls. The STA 6400—like the new model that has supplanted it, called the NarusInsight—is said to distinguish VoIP packets from other kinds of traffic. Other network operators might want to detect or regulate peer-to-peer file sharing; again, the Narus software claims to recognize such activity. These tasks of discrimination are not easy, but in my opinion they don’t quite rise to the level of “semantic” analysis. Recognizing a packet as belonging to a Skype conversation or a Kazaa download doesn’t penetrate very deeply into the packet’s meaning. And merely classifying packets according to their type or protocol doesn’t look like a very promising approach for detecting terrorist activity. But we can’t know what the NSA is doing with this equipment—if indeed it exists and if they are operating it.

Grepping the Net

For the sake of argument, suppose it’s all true, and the NSA is equipped to intercept and evaluate every bit that passes through the global Internet. If you were an analyst expected to retrieve useful information from this data stream, what kinds of patterns would you look for? Here are some possibilities that occur to me. I’d be interested to hear other ideas.

  • Tracking specified individuals. If we knew the IP number of Osama bin Laden’s computer, we could slurp up every packet traveling to or from that machine, and reconstruct all the details of his Internet activity, just as if we were looking over his shoulder. (Then again, if we knew his IP number, we would also have a pretty good idea of his physical whereabouts, and we could go have an offline chat.) Domain names, e-mail addresses, account names and other kinds of identifiers could also provide a key to tracking known individuals. However, if surveillance of targeted individuals is the main aim of the program, wiretapping the entire Internet seems like a grotesquely wasteful way to go about it. There are more direct and efficient means of doing the same thing.
  • Monitoring specified organizations. Once upon a time, this was the raison d’être of intelligence services—planting bugs in embassies, breaking the codes of the KGB or the GRU. Comprehensive Internet surveillance is presumably useful for such purposes too, although Al Qaeda doesn’t have an embassy.
  • Getting inside the envelope. Some of the people charged in the train bombings in Madrid in 2004 are said to have communicated through a shared Yahoo e-mail account, from which they never actually sent any e-mail. According to the Spanish courts, one conspirator would log onto the account and write a draft of a message, leaving it unsent; then the rest would log on, using the same name and password, and read the draft. Perhaps they thought that an unsent message would never be exposed to eavesdroppers, but that is a misconception; in fact the text does flow over the network every time someone views it. (How else could it be displayed on the screen?) There’s no evidence that the NSA actually spotted any such messages, but in principle they could, if they knew what to look for, and if the packets passed through a node of the network where they had a listening post.
  • Staking out the watering hole. When we think of wiretapping the Internet, what comes to mind first is reading people’s e-mail, or maybe their instant-message traffic. These are person-to-person modes of communication that seem to offer some modicum of privacy—which may be why we suppose that an intelligence agency would want to pry them open. Other Internet forums, such as usenet news groups, are so public that there’s no need for skullduggery to read their contents. Nevertheless, it’s possible that e-mail would not be the main target of surveillance, and that public areas of the Internet would attract considerable attention. For example, the NSA might find it worthwhile to compile lists of visitors to certain web sites (maybe even this one).
  • All of the above. A government agency with legal clout and insider knowledge could gather all this information by more-conventional and less-intrusive means—but what a pain to have to pursue each case individually. If I were suspected of terrorist activity, I’m pretty sure my Internet service provider would give up the goods on me, but there’d be so much paperwork to do, and then the same rigmarole all over again with the hosting service that provides space for this Web site, and with Google for my gmail account, and so on. A tap on the Internet backbone offers one-stop shopping. No need to ask amazon.com what books I bought. No need to ask Google what terms I searched for. You just scoop it out of the river as it floats by.
  • Grepping the Net. Spying on people who are already known or suspected malefactors may be the routine business of the intelligence services, but the mythic promise of the all-seeing surveillance device is the possibility of uncovering a conspiracy de novo, breaking up the plot before it’s even hatched. Legend has it that the heart of the Echelon program was blind scanning of intercepted communications traffic for interesting words or phrases. And so now we can imagine the NSA programming its Narus STA 6400s to grep “bomb \(plot \| conspiracy)” against the entire Internet. Maybe this works. I’m skeptical.
  • Retrospective analysis. Here’s a reason for trying to save a copy of the entire Internet data stream, at least for a period of days. On September 10, 2001, an e-mail mentioning AA11, UA175, AA77 and UA93 would not have attracted the slightest attention, but a day later those flight numbers were notorious. I have no idea whether such an e-mail was ever sent, but I’m sure that investigators would have welcomed an opportunity to find out.
  • Social network analysis. The reported interest of the NSA in telephone call-detail graphs hints at an attempt to discover communities of people with shared interests. The most clear-cut case is to identify cliques in the graph: sets of people who have all called one another. In the case of the long-distance telephone network, the databases of call records already exist, compiled by the telephone companies for their own purposes. Similar kinds of social-network analysis—possibly even more interesting—could be done with Internet data, but as far as I know, no one keeps the necessary records. Compiling such a database strikes me as a plausible motivation for the kind of monitoring the NSA is said to be doing. The closest analogy to the telephone case would be looking for groups of people who have all exchanged e-mail (or other kinds of direct messages), but the technique is not limited to that. Someone might also take an interest in finding groups of people who have all visited the same set of Web sites within a certain period of a time, or downloaded the same files. And it’s notable that this kind of information is particularly easy to acquire; it mostly depends on addresses of data packets, not their content.
  • Jiggery-pokery. It would be unfair to conclude this list without mentioning some of the more cynical speculations about the likely uses of an Internet spinal tap. According to the Bush administration, the program of warrantless wiretapping was approved in the weeks after 9/11 as part of an effort to find the people responsible for the attacks and prevent any recurrence. But the search for the 9/11 conspirators led mainly to shadowy figures in Afghanistan. In 2001, Afghanistan was possibly the least-wired place on earth; the Taliban had banned the Internet, and there was only one computer in the country with a sanctioned connection. Thus the Internet would not seem to be the most obvious place to look for “chatter” among those hiding out in the caves of Tora Bora. On the other hand, the Internet is an excellent place for keeping track of political opponents or watching out for signs of disloyalty within your own ranks. (But that would be wrong.)

Occam’s Other Razor

Does any of this make sense? Set aside all questions of what’s legal or moral or even prudent, and simply consider what’s feasible. Can you listen in to the simultaneous clatter of several hundred million computers and extract anything of value from all the noise?

It’s easy to find people who aren’t hiding, but they’re probably not the ones you want to catch. And for those who do want to conceal their presence and their intentions, the Net offers many, many dark corners, despite room 641A. For example, it’s well known that 70 or 80 percent of all e-mail is spam, but the NSA dare not ignore it. If I were sending signals to the members of a covert cell, I would be tempted to embed my message in an offer for V1@gra or replica wristwatches. Indeed, much spam comes with an addendum of apparently random text, meant to fool spam-blocking filters. How do we know that’s all it fools? Or one could ride piggyback on the high-bandwidth BitTorrent, hiding out among the kids trading music tracks and warez. Then there’s cryptography; more on that below.

The story told by Mark Klein is highly plausible, and the published documents make a strong case, yet there are a few doubtful details. None of the documents that come from AT&T mention the NSA or any other government agency. The NSA connection comes solely from Klein’s own testimony. In statements to the press he has said that the NSA interviewed AT&T employees for a job that would involve working in room 641A, and eventually hired someone. Similarly, in a statement published by Wired News, Klein says, “only people with security clearance from the National Security Agency can enter this room.” Although I’ve personally never had any dealings with the NSA (as far as I know), this story sounds fishy. Would the NSA—long known as “No Such Agency”—identify itself so openly when interviewing telephone company employees? Do they hand out cards that say “NSA Security Clearance”?

But if not an NSA snooping roost, what is in room 641A? One possibility is a traffic-monitoring facility that AT&T might have built for its own use. Some years ago AT&T developed a system called PacketScope, not too different from the modern Narus devices; papers in the open literature [see update below] discuss the installation of PacketScope equipment at two AT&T Internet hubs. Other network operators do the same kind of monitoring. They want to know where their traffic comes from and goes to, and what their customers are doing with the bandwidth. But if the 16 splitters and the Narus STA 6400 and the Sun StorEdge T3 all have such an innocent explanation, why doesn’t AT&T say so? Not everyone would believe them, but the denial would certainly muddy the water. Instead AT&T has released a bland statement about their commitment to privacy and their obligation to assist law enforcement. And then the federal government intervened to dismiss the suit against AT&T, claiming that the issue cannot be adjudicated without revealing state secrets and endangering national security. It’s almost as if they want us to believe the worst.

Okay, I’ll go along. And I’ll keep in mind that this is the looking-glass world, ruled by Occam’s other razor, the one that favors whatever explanation is most convoluted and counterintuitive. Why would the government want the whole world to know that we’re listening? Here’s my best shot at an answer: Maybe to encourage people to encrypt their most-sensitive communications. How would that benefit the eavesdroppers? Packets carrying encyrpted text should be easy to recognize because of their “flat” statistics; they label themselves suspicious. But, you say, it’s no use recognizing an encrypted message if you can’t read what’s inside. Who can’t read what’s inside?

Update 2008-09-07: The “papers in the open literature” mentioned above are no longer quite so open. The link above has gone dead; the paper is still listed in the bibliography of Ramon Cáceres, but links to PDF and Postscript versions have been removed. As of today a PDF can still be downloaded through Citeseer.