Archive for the ‘modern life’ Category

The Right Click

Saturday, January 21st, 2012

For a few hours yesterday the front page of the New York Times was stealing right clicks. If I right-clicked on a hyperlinked headline (or option-clicked, or made a two-fingered tap on the trackpad), I did not get the usual context menu; instead, I was taken directly to the target of the link. This is the proper behavior for an ordinary mouse click—or a left click with a two-button mouse—but not for a right click.

The first time this happened, I thought it was just a slip-of-the-finger, but the error was consistently repeatable across two different machines and three different browsers (Firefox, Chrome, Safari). Furthermore, it affected only the New York Times. Indeed, it was only the front page of the Times that was misbehaving; right clicks elsewhere in the paper worked normally.

The cause of this problem may have been an innocent goof, but I’m skeptical. When the Times first put up a paywall, not quite a year ago, readers quickly found holes in it. One of those holes involves right-clicking a link to get a copy of the URL, pasting it in the browser address bar, and removing the referrer cruft following the question mark. My guess is that someone at the Times decided it was time to close the hole.

I hasten to add that freeloading is not my reason for right-clicking on Times headlines. I pay my $15 per doublefortnight. But my newsreading habit is to peruse the entire front page, opening each article that interests me in a separate tab. The “open in new tab” command lives in the right-click contextual menu.

Regardless of why the Times was interfering with my Second Amendment right to bear mouse buttons, I was curious about how they were doing it. They weren’t just disabling the contextual menu entirely. (You can read a scornful account of that nefarious practice at About.com, which identifies itself as “A part of The New York Times Company.” (Not, in my view, the best part.)) On the NYT front page, right clicks worked as usual in ordinary text; they were only hijacking right clicks on links.

Regrettably, I’m not going to be able to answer the how’d-they-do-it question. Before I could find the offending code, some grownup at the Times called off the whole crazy experiment, and normal right-clickery was restored.

Although I couldn’t find the click-stealer, I found plenty else. The Times, it seems, prints all the JavaScript that fits. Some of it is unsurprising. jQuery is loaded. There are scripts to run slide shows and videos, to manage cookies, to serve ads, to provide menus and other navigation aids. But there’s lots more:

  • beacon.js This may have something to do with all those little files named 1px.gif floating around like packing peanuts.
  • revenuescience.js Apparently a product of an outfit called Audience Science. “AudienceScience is processing trillions of behaviors per day and over 270 billion attributes at any given moment.” You don’t say.
  • krux-4.7.2.js The web site of Krux (which I had never heard of before) says: “Krux helps large and small websites control, energize, and responsibly monetize consumer data across screens and sources.” Reading further, I get the impression they are in the business of preventing snoopers from snooping on the snoopers who snoop on us. I’m certainly not having much luck snooping on their code. It looks like this:

    function(a){e(a)||A(b,c(a))}),h(b,c(a[1]),e(f)?f:function({o.js.apply(null,j)})):h(b,c(a[1]));

  • gw.js Even deeper obfuscation. I believe this is a JavaScript program whose function is to write another JavaScript program into the page header. It seems to be one of the tools that Audience Science uses to process those trillions of “behaviors” per day.

Phooey on them, I say.

Pretirement

Wednesday, November 23rd, 2011

As a high school kid in the 1960s, I wrote a snarky term paper arguing that retirement is wasted on old people. By the time you get your promised years of leisure, you’re too worn out to enjoy them. So I proposed a new order of working life: Everybody gets five or ten years off at the start, when they’re still full of spunk, in exchange for a promise to keep trudging away on the treadmill right up to the end.

I wasn’t able to arrange such a pretirement for myself, but the world now seems to be coming around to my way of thinking. Here’s some evidence, with data courtesy of the Bureau of Labor Statistics:

employment-to-population ratio for age groups 16-24 and 65+

The proportion of Americans who stay on the job after age 65 was falling steadily for many years and got down to about 10 percent in the 1980s; but it has been rising since then, and the rate of increase accelerated after 2000. Today almost 17 percent of the 65+ cohort are still working. Meanwhile, the analogous curve for youths aged 16 to 24 is pretty much a mirror image. The employment rate peaked in the 1980s and has been declining since then. In the years between 2000 and 2010 it fell from just under 60 percent to 45 percent.

My pretirement hypothesis—the notion that we’re giving people an opportunity to waste their youth on the golf course rather than their old age—is just about the most benevolent interpretation one could possibly put on these trends. A less-rosy reading of the same data puts the blame on old geezers like me who just won’t get out of the way and give the youngsters their turn. For some reason, this view seems to be prevalent among recent grads who expected a job offer at the end of their studies but instead got only a bill from Sallie Mae.

An op-ed piece in the Sunday New York Times takes issue with this sour diagnosis. Edward L. Glaeser, an economist at Harvard, argues that what motivates the elders who linger in the work force is not greed, selfishness or indifference to their children’s aspirations; it’s economic necessity. Their houses are underwater; their 401k’s have swooned; they can’t afford to retire. Furthermore, the kids should be grateful that grandmom and grandpop have hung on to the family business:

It’s counterintuitive, but the forever work life of older Americans may turn out to be a good thing for young workers…. Recent studies in Britain and Germany find a positive correlation between labor-force participation among the elderly and youth employment. It’s not that older workers never crowd out younger workers, but there are myriad ways in which older workers also increase employment among the young. As older workers earn more, they can afford to buy more products produced by the young. Older workers may be entrepreneurs who employ younger workers, and they may pass along valuable skills to the young.

America has a terrible youth unemployment problem…. We have reason to worry that the current economic slowdown will create a lost generation of Americans who are now in their 20s. But it’s a mistake to imagine we can fix the problem of youth unemployment by encouraging older workers to retire.

According to Wikipedia, Glaeser, is 44 years old—right in the middle between the involuntary pretirees and the never-gonna-retirees.

Glaeser doesn’t discuss the demographic context of these changes, and neither did I in my high school term paper. Looking back on it now, I see a serious flaw in my proposal. Retirement plans, such as the Social Security system, work best with a pointy population pyramid, so that a wide base of young earners supports a smaller number of pensioners. My plan called for reversing the flow of resources, which would not have worked out well given the age structure of the U.S. population in the 1960s, with my own generation of Baby Boomers fattening the base of the pyramid. But the situation is different now; the pyramid is slimming down, and citizens in their 60s may soon outnumber those in their 20s. Maybe pretirement is worth a second look.

Driving the dreamboat

Wednesday, August 17th, 2011

RCA electronic car of tomorrow ad

Slide behind the wheel of this dreamboat. Push the electronic control button. Then sit back and let transistors take over.

There’s something curiously tentative about this vision of the future of motoring, as seen from 1964. You’re invited to push the button and let the transistors take over. But you’ve still got your hands on the wheel; apparently you’re still responsible for driving the dreamboat.

Other early discussions of automatic automobiles are also fuzzy about exactly who or what is in charge. A notable example is the General Motors Futurama exhibit at the 1939 Worlds Fair in New York. “Safe distance between cars is maintained by automatic radio control,” intones the narrator, above creepy organ music. This certainly suggests something other than seat-of-the-pants driving. But the next sentence narrows the scope of that automatic control: “Curved sides assist the driver in keeping his car within the proper lane under all circumstances.” Thus the technology is merely assistive, not autonomous. And what’s that about “curved sides”? Norman Bel Geddes, the designer of the exhibit, explains all in Magic Motorways, published in 1940. It’s very low-tech. Freeway lanes are to be separated by high curbs of concave cross-section, which deflect a straying car back into its lane. (Later in the book Bel Geddes also discusses more elaborate guidance systems, involving buried conductors.)

The reprise of Futurama at the 1964 World’s Fair—an exhibit that I attended, along with 29 million other people—was even vaguer about the question of autonomous vehicles. We saw lots of miniature automobiles moving in close order along gleaming freeways, and personally I came away with the impression that all those vehicles were under computer control. But the transcript of the narration includes only a single sentence on the topic, and it’s open to almost any interpretation: “Vehicles electronically paced, travel routes remarkably safe, swift and efficient.”

Why so coy about the prospect of cars that would drive themselves without human intervention? Maybe the concept was just too outlandish for credibility, particularly in 1939. Or maybe GM recognized that their natural audience is made up of car enthusiasts, who want to drive their dreamboats, not just be carried along as electronically paced, radio-controlled passengers.

In any case, the coyness has now evaporated, and these days everybody is talking about truly autonomous vehicles. DARPA runs contests for them; an Italian group has driven them across Europe and Asia; Google has a “secret” fleet of them. And I too am talking about autonomous vehicles: “Leave the Driving to It” is my latest American Scientist column.

Note: The artwork above is from an RCA advertisement in the September 1964 issue of Scientific American. Stylistically, the painting owes something to the Futurama exhibits, but I’d like to make a wild guess that the (uncredited) artist who created this rendering lived in Minneapolis. That brightly lighted, colonnaded building to the right of center looks to me very much like a building at Hennepin and Washington (now owned by ING) that was completed in 1964, just as this ad appeared. The architect was Minoru Yamasaki, the designer of the World Trade Center.

Only correlate!

Saturday, May 28th, 2011

I’m not actually a shill for Google Labs, although it may seem that way from all my recent (and ongoing) attention to the Google Ngram Viewer: four posts (1, 2, 3, 4) and an American Scientist column, so far. What I particularly like about Google Labs is that they share their toys. They create Big Data projects that everybody can play with. For those of us without a server farm on the back 40, that’s a rare opportunity.

The latest Labs release is Google Correlate. If you have a time series—data expressed as a function of date, for any subinterval of the period since 2003—Correlate will try to identify Google search queries that exhibit a similar temporal pattern of activity. All this is easier to understand with an example. For a specimen time series, consider the interest-rate index known as the 1-year CMT, which is published every week. I scraped seven years of CMT data from this web site, and uploaded the file to Correlate. I got back a list of 100 phrases whose popularity as Google search terms has followed a trajectory more or less similar to that of the interest rates. As it happens, none of those highly correlated terms has an obvious connection to financial affairs. Roughly half of them are related to cell phones (”cingular” and “treo” turn up over and over). But the term with the strongest correlation (r=0.9751) is the phrase “pill identification”:

graph of time-series correlation between 1-year CMT interest rate data and Google searches for 'pill identitification'

In other words, the gradual rise in interest rates during the early 2000s was paralleled by a steady growth in the number of people seeking help in identifying the contents of mysterious unlabeled vials in the medicine cabinet. Then, sometime in 2007, both trends reversed direction. Why should these particular variables be so closely correlated? If there is a reason, I have no idea what it is. And I must immediately insert the obligatory disclaimer: Correlation is not causation. Emphatically so in this case. If you are trying to predict the future course of interest rates, I do not recommend tracking popular interest in pill identification. Or vice versa.

At a more personal level, there’s a time series I have been tracking since 2007: the volume of spam arriving in my email inbox. My records are monthly, whereas Google Correlate wants weekly data, so I did some resampling and smoothing, and came up with this:

graph of correlation between Brian Hayes's spam receipts and the Google query 'ashford blackboard login'

The best match, shown in the graph, is the mildly enigmatic query “ashford blackboard login.” Many of the other correlated series suggest a seasonal theme that I can understand in retrospect but that I did not see coming before looking at the results: “honda accord 2009,” “celica 2009,” “rav4 2009,” “2009 altima coupe,” “new cars 2009,” “2009 ranger,” etc. The most distinctive features of the spam curve are a peak in the fall of 2008, a deep dip the following winter, and an even stronger surge in the summer of 2009. Evidently shoppers for cars in the 2009 model year followed a similar trend line. (But again I would caution that spam volume is unlikely to be a good predictor of automobile sales.)

These results might be taken to suggest that every conceivable time series must be correlated with some set of Google queries, however farfetched the association. I tried submitting a few random walks, covering the same time span as the spam series, and they too fetched up matching queries from the Google database:

correlation graph for a random walk and the query 'att tilt software'

At the opposite end of the spectrum from a random walk, I tried some rigidly artificial probes, such as a series with nonzero entries only in the month of May. Sure enough, there are search-engine queries that follow the same recurrent annual pattern:

correlation of a time series with nonzero entries on in the month of May and the Google query 'j labs'

A time series that has all of its energy concentrated in a single pulse elicits from the database a variety of flash-in-the-pan topics—queries that came and went and were never heard of again.

correlation of a time series with a pulse in September 2005 and the query 'wolframtones'

Without too much work we could enumerate all such one-month wonders.

It is not the case, however, that every possible time series has a close correlate somewhere in the Google collection. Here is an example of a series for which Correlate finds no query that matches closely enough to bother reporting:

weekly driving mileage, late 2003 to late 2010

This is a weekly record of miles driven in the family car. Should we be surprised that not a single series among the tens of millions of queries in the Google database comes close to matching this pattern? One approach to this question is to ask just how many series of this kind might exist. The mileage record covers 364 weeks. As a lower bound, suppose the mileage associated with each week could have just two possible values: either we drove the car or we didn’t, so the mileage is either zero or greater than zero. Then there are 2364 (or about 10110) possible time series—many orders of magnitude greater than the total number of Google searches since the company was founded. Thus the set of queries in the Google archive must be an extremely sparse subset of all possible time series. Most of the series we could construct would necessarily come up empty. (I note in passing that there’s interesting structure in that mileage log of mine, which I never knew about until I graphed it—but that’s a story for another day.)

A really interesting question is how Google Correlate does it. Even with “only” tens of millions of queries in the database, comparing a submitted series with all the candidates would be impossibly expensive. A white paper explains:

In our Approximate Nearest Neighbor (ANN) system, we achieve a good balance of precision and speed by using a two-pass hash-based system. In the first pass, we compute an approximate distance from the target series to a hash of each series in our database. In the second pass, we compute the exact distance function on the top results returned from the first pass.

Thus the basic strategy is precomputation: Spend a lot of time in advance computing a succinct signature or hash associated with each time series in the database; then quickly compare hash values when looking to match a submitted time series.

A few further miscellaneous notes:

Google Correlate evolved from earlier work on tracking influenza outbreaks by monitoring search-engine queries. Initially this required a batch computation lasting hours, even when run on hundreds of computers. The new hash-based search takes less than a second. (Algorithms and data structures still count for more than hardware.)

Google Correlate includes a geographic component alongside the temporal database. If you have data distributed over the 50 U.S. states, you can retrieve Google queries that exhibit a similar spatial pattern. (I have not experimented with this system.)

Even if you don’t have a time series or a geographic data set of your own, you can play with the new service by cross-correlating one search query against others. For example, enter the term “solstice” in the search box, and you’ll see a graph with exactly the pattern of twice-a-year spikes that you might expect. You also get a list of other search terms whose temporal pattern has similar features. One of those correlated terms is “italian seafood salad.” A glance at the corresponding graph suggests there’s only half a correlation in this case:

correlation of 'solstice' and 'italian seafood salad'

I didn’t know until just a few minutes ago that frutti di mare was a dish to be eaten at the winter solstice.

Oh, the places I’ve been!

Wednesday, May 11th, 2011

When I heard the rumors that my iPhone was tracking my movements and keeping a log of location data, I was annoyed. Now that Apple has fixed this bug/feature, I’m even more annoyed. There’s just no pleasing me.

iPhone location map for a trip through north California and southern Oregon

The fuss began three weeks ago, when Alasdair Allan and Pete Warden announced at a conference that iPhones record a location fix every few seconds, based on position with respect to cell-phone towers and wifi access points. The log file is saved on the phone, they said, and also transferred to any computer the phone syncs with, in the form of an SQLite database named “consolidated.db.” Allan and Warden wrote a handy Macintosh application, iPhoneTracker, that ex­tracts the geo­graphical markers and displays them on a map. At left is the record of a trip I took last summer through northern California and southern Oregon, as traced by cell phone towers along the way. There are lots of mysteries and spurious details among those dots, but the broad outline of the route is displayed clearly enough: a counterclockwise loop up I-5, across the Cascades to Coos Bay, and back down to San Francisco on U.S. 101.

Here’s another map-pin travel diary, a memento of a brief visit to Pittsburgh for a meeting at Carnegie Mellon:

Pittsburgh

In this case the dots represent wifi hotspots rather than cell-phone towers. Lots of hotspots! They look like a swarm of bees. The cluster at lower left is the CMU campus, where the meeting was held. Moving northeast, the nearest clump of dots is the vicinity of my hotel. The rest of the dots, further north and east, trace the routes of a couple of walks I took, out looking for dinner. Curiously, a third long walk doesn’t show up at all, even though I had the phone with me, and indeed used the Maps app to get unlost a couple of times.

(I should mention that the version of iPhoneTracker distributed by Allan and Warden does not show wifi locations, and it plots cell towers on a rather coarse grid. But the software is open-source, and those limitations are undone with a couple of easy edits.)

Allan and Warden’s discovery of the iPhone location database wasn’t exactly new. Alex Levinson reported some months ago on an earlier version of the location log. And last July Apple explained its “location services” privacy policies in considerable detail in response to an inquiry from two Senators. No one took much notice of those earlier reports, but the new one caused a ruckus. It soon emerged that Android devices are collecting similar information and sharing it with Google. Yesterday, both Apple and Google were grilled in Congress.

The ruckus has mostly been about quaint 20th-century notions like personal privacy. I have my own worries on that score, but what irks me most is not that my phone is storing this information but that Apple gives me no access to it. If I’m going to help them build an immense database of cell towers and wifi beacons, then surely I should at least be able to retrieve and display my own coordinates, no?

The Allan and Warden program fills this need to some extent. And last week the New York Times bits blog announced a cloud-based approach that might be even better. They are inviting iPhone users to upload their location information to a service called OpenPaths, where you can build animated maps of your own peregrinations and, perhaps, if you choose, share the data for research purposes.

There’s just one problem. Apple’s response to the invasion-of-privacy complaints was to issue an operating system update that will make it even harder—probably impossible—for me to get access to my own data. After I install the update, my phone will not stop collecting geographic information, nor will it stop reporting location fixes to Cupertino, but it will encrypt the file so that I can’t read it. Maximally annoying. Geolocation wthout representation.

As far as I can tell, Apple is telling the truth about the nature and source of the information in consolidated.db. When the story first broke, I assumed—along with many others—that the database was recording the cell sites and wifi networks that my phone detected as I wandered around, carrying the device in my pocket. In other words, the database was a local copy of a location log that was also, presumably, being uploaded to Apple. An Apple press release from April 27 insists that I had it backwards. This is not information gathered by my phone. Instead it is a “crowd-sourced database” downloaded from Apple to my phone.

3. Why is my iPhone logging my location?

The iPhone is not logging your location. Rather, it’s maintaining a database of Wi-Fi hotspots and cell towers around your current location, some of which may be located more than one hundred miles away from your iPhone, to help your iPhone rapidly and accurately calculate its location when requested….

4. Is this crowd-sourced database stored on the iPhone?

The entire crowd-sourced database is too big to store on an iPhone, so we download an appropriate subset (cache) onto each iPhone….

6. People have identified up to a year’s worth of location data being stored on the iPhone. Why does my iPhone need so much data in order to assist it in finding my location today?

This data is not the iPhone’s location data—it is a subset (cache) of the crowd-sourced Wi-Fi hotspot and cell tower database which is downloaded from Apple into the iPhone to assist the iPhone in rapidly and accurately calculating location. The reason the iPhone stores so much data is a bug we uncovered and plan to fix shortly….

Why do I believe this self-serving story? Basically because a true log of the phone’s trajectory through time and space would look rather different from the list of entries I find in consolidated.db. My phone spends much of its time in one place, talking all day to the same wifi links and the same cell towers. A log that recorded my moment-by-moment position over a period of months would include many, many repeated contacts with these few nearby sites. But in fact consolidated.db has exactly one entry for each such site. (The structure of the database guarantees this: The wifi MAC address and the set of identifiers for cell towers are primary keys in the data tables, which means they must be unique.) Another clue: Clumps of sites in the same neighborhood all have exactly the same time stamp. It appears they were all downloaded to the phone at the same time. That’s not the way I would have encountered the sites while walking the streets of the Shadyside neighborhood in Pittsburgh.

All the same, I still insist that I have an ownership stake in this database. I’m part of the crowd that sourced it. Without the unwitting participation of millions of iPhone owners, Apple’s database wouldn’t exist. And a piece of it is stored on my phone—some 24 megabytes’ worth (24,209 cell phone towers, 177,103 wifi routers). Finally, even if the database is not constructed as a direct tabulation of my movements, it provides a remarkably accurate record of the places I’ve been. That too makes the data mine.

Addendum 2011-05-12: Apple and Google argue that if you want the benefits of location-based services, then you have to be willing to share information about your whereabouts. Is this trade-off actually necessary? I think not. If the entire database were resident on the phone, then the phone itself could calculate its position, without any need to reveal that position to the outside world. If the global database is too large to put a copy on every phone, then installing larger pieces of it would at least raise the granularity of the information being leaked. There’s a difference between knowing I was in Pennsylvania and knowing I was at the intersection of Fifth Avenue and South Aiken Avenue in the Shadyside neighborhood of Pittsburgh.

The real need for sending my position information to Apple or Google is not so that I can get the benefit of the cell/wifi database but rather so that I can help them build that database. When Google first set out to compile this kind of information, they did so at their own expense, equipping their Streetview photography cars with wifi and cell antennas. The Skyhook database was created in a similar way. Using cell-phone customers to do the same work changes the terms of the transaction in a way I find unpleasant. I’m contributing to a proprietary database; I’m doing the work of drivers who would otherwise have to be hired to cruise the streets; but I’m not being compensated; on the contrary, I’m paying for the privilege. I can see an argument for a scheme in which we all voluntarily contribute data for a public good, but that’s not the nature of this transaction.

I should point out that there are efforts to build publicly accessible databases of cell and wifi coordinates. There’s Cellspotting, which looks interesting but works only with a few kinds of mobile phones. And there’s OpenBmap, which has some rough edges but even so provides an impressive amount of information. It’s the place to go if you want to learn about the cell towers in your neighborhood and figure out the numbering scheme that identifies them in the consolidated.db file.

Finally, a question: Can we imagine a “zero-knowledge” internet location service? GPS works this way: I can get a fix on my position simply by receiving signals from GPS satellites and doing some arithmetic on them; I don’t have to transmit anything at all. What the satellites are broadcasting is merely a time signal, and the arithmetic I have to do consists in finding a consistent time-of-flight solution for signals from three or four of the satellites. If we had internet beacons of known location broadcasting a continual stream of high-resolution time signals, we could do something similar. A complication is that the internet is a very inhomogeneous medium, where signals move at very different speeds. On the other hand, it would be easy to collect input from hundreds of beacons, rather than three or four satellites. Even without the beacons, the art of inferring latitude and longitude from IP number seems to be pretty highly developed; there’s a fascinating recent paper (PDF) on how it’s done, by Yong Wang and colleagues at Northwestern and Microsoft Research. (The GPS-beacons-on-the-internet idea must have been proposed a zillion times by now, but I don’t have a reference ready at hand.)

Give me that good old-fashioned AI

Saturday, February 19th, 2011

It is said that to explain is to explain away. This maxim is nowhere so well fulfilled as in the area of computer programming, especially in what is called heuristic programming and artificial intelligence. For in those realms machines are made to behave in wondrous ways, often sufficient to dazzle even the most experienced observer. But once a particular program is unmasked, once its inner workings are explained in language sufficiently plain to induce understanding, its magic crumbles away; it stands revealed as a mere collection of procedures, each quite comprehensible. The observer says to himself “I could have written that”. With that thought he moves the program in question from the shelf marked “intelligent”, to that reserved for curios, fit to be discussed only with people less enlightened than he.

These words were written in 1966 by Joseph Weizenbaum, in a paper that explained—and explained away—his famous program ELIZA, the mock psychotherapist. ELIZA’s conversations could occasionally suggest a glimmer of understanding:

I need some help, that much seems certain.

WHAT WOULD IT MEAN TO YOU IF YOU GOT SOME HELP?

Perhaps I could learn to get along with my mother.

TELL ME MORE ABOUT YOUR FAMILY.

But Weizenbaum revealed that ELIZA was merely picking out a few keywords from the text and applying simple syntactic transformations, along with a dose of randomness. The program was manipulating symbols, but the symbols had no meaning attached to them.

What about Watson, the new Jeopardy champion? Watson gave a dazzling performance this past week, decisively winning a two-game match against the best human players in the history of the quiz show. But will the magic crumble if we look closely at how it works? Does the program really understand those quirky Jeopardy clues, or is it just pushing symbols around, in the manner of ELIZA?

The most detailed account of Watson’s innards that I’ve been able to find is an article published in AI Magazine last fall by David Ferrucci of IBM, the project’s lead engineer, and a dozen colleagues from IBM and Carnegie Mellon. (”Building Watson: An overview of the DeepQA project,” AI Magazine 31(3):59–79. The article is behind a paywall at the AI Magazine web site, but resourceful internauts may find it elsewhere.)

Here’s the overview:

The system we have built and are continuing to develop, called DeepQA, is a massively parallel probabilistic evidence-based architecture. For the Jeopardy Challenge, we use more than 100 different techniques for analyzing natural language, identifying sources, finding and generating hypotheses, finding and scoring evidence, and merging and ranking hypotheses.

Thus we learn that behind Watson’s calm, metallic voice is a clamor of 100+ agents doing their massively parallel probabilistic evidence-based thing. This is not one big brain but a society of mind. (By the way, I think “probabilistic” means simply that potential answers are scored by assigning them probabilities; as far as I can tell there is no randomness or indeterminacy in the algorithms, but I could be wrong about that.)

The first step in the run-time question-answering process is question analysis. During question analysis the system attempts to understand what the question is asking and performs the initial analyses that determine how the question will be processed by the rest of the system.

This description implies that the system is indeed making an effort to dig down into the semantics of natural language. But how does it attempt to understand the clue? The rest of the paragraph is more of a shopping list than an explanation:

The DeepQA approach encourages a mixture of experts at this stage, and in the Watson system we produce shallow parses, deep parses (McCord 1990), logical forms, semantic role labels, coreference, relations, named entities, and so on, as well as specific kinds of analysis for question answering.

The reference to (McCord 1990) is perhaps the most illuminating item in this list. The author is Michael C. McCord, who is at IBM’s Yorktown Heights lab where Watson was built. The phrase “deep parses” apparently refers to McCord’s idea of a slot grammar, which provides a single framework for combining the syntactic analysis of sentences (subject, predicate, object, etc.) with semantic features (word senses, logical relations, predicates). Unfortunately, the AI Magazine article gives no further hints about how slot grammars are used in the analysis of Jeopardy clues. (For some useful recent accounts of slot grammars, see the links in McCord’s publications list.)

The part of the question-analysis phase that Ferrucci et al. discuss at greatest length is a process they call “LAT detection.” LAT is “lexical answer type”: a word or phrase in the clue that specifies what kind of response is wanted—a person, a city, a book, a substance, and so on. Consider this clue, in a category titled “Oooh…. chess”:

Invented in the 1500s to speed up the game, this maneuver involves two pieces of the same color.

The LAT in the clue is “maneuver”: Whatever the answer is, it must be something that can plausibly be described as a maneuver. If you were to fixate on the wrong LAT—say “the game” or “two pieces”—you’d have no hope of coming up with the correct answer. Naming the two pieces “king and rook” would not score any points, even though that particular choice of pieces suggests you have the right idea in mind; to get credit for the answer, you need to give the name of the maneuver: “castling.”

Identifying the correct LAT is clearly important. It’s also clearly difficult. What’s not so clear is how Watson does it. In the chess example, does “maneuver” stand out from the rest of the words in the clue for grammatical reasons (it’s the subject of the main clause), or because it’s pointed to by the demonstrative adjective “this,” or for some other reason? How would you write a program to identify the LAT of an arbitrary Jeopardy clue?

Moving on from the analysis of questions to the finding of answers, the algorithmic details remain a little fuzzy.

Watson has access to various sources of “structured” knowledge: relational databases, taxonomies, ontologies. With such resources, retrieval is straightforward. Yet it turns out that few clues can be reformulated as database queries. Ferrucci writes: “Watson’s current ability to effectively use curated databases to simply ‘look up’ the answers is limited to fewer than 2 percent of the clues.” I suppose this is not really surprising. If the game could be reduced to database lookup, it wouldn’t be much fun.

For the other 98 percent of the queries, I gather that the retrieval process is more like Googling for the answer. The machine has no live internet connection during the Jeopardy contest, so it can’t actually search the web. But lots of free-form textual data was loaded into the Watson servers ahead of time, including all of Wikipedia and many other reference works. Using these documents as seeds, the system then trawled the web for other sources that might be useful, and cached copies of them for use offline. About four terabytes of material was available for query answering.

As for the search methods applied to this archive, the article by Ferrucci et al. offers another shopping list:

A variety of search techniques are used, including the use of multiple text search engines with different underlying approaches (for example, Indri and Lucene), document search as well as passage search, knowledge base search using SPARQL on triple stores, the generation of multiple search queries for a single question, and backfilling hit lists to satisfy key constraints identified in the question.

In a long series of training runs the system was tuned to balance the competing demands of coverage, accuracy and speed.

The operative goal for primary search eventually stabilized at about 85 percent binary recall for the top 250 candidates; that is, the system generates the correct answer as a candidate answer for 85 percent of the questions somewhere within the top 250 ranked candidates.

The trouble with free-form textual search is that you may very well identify relevant snippets of text but still have a hard time extracting the correct answer. Indeed, the same kind of analysis that goes into figuring out the question also has to be applied to candidate answers. For example, Ferrucci et al. discuss this clue: “He was presidentially pardoned on September 8, 1974.” Among the materials retrieved by the search algorithm was the text fragment: “Ford pardoned Nixon on Sept. 8, 1974.” For a human player with a little knowledge of U.S. history, this result would be more than enough to settle the matter, but a computer program still has some work to do. Suppose the program has correctly identified the LAT of the clue as “He,” and suppose further that it knows that both “Ford” and “Nixon” refer to male persons, perhaps even that they were presidents. Which of the two names is the right choice? Several of the tests that Watson applies are essentially string-matching algorithms, similar to those that search DNA sequences for genetic patterns. Those algorithms might count how often each name occurs in association with the given date, but that result will not resolve the ambiguity in this case. The correct answer comes from a program module that undertakes a deeper logical analysis and recognizes the difference between subject and object in the two statements.

•     •     •

Given this glimpse into how Watson works, do we deem its intelligence to be explained, or explained away? Personally, I have mixed feelings.

I admit to a sentimental fondness for what John Haugeland called Good Old-Fashioned AI, or GOFAI—the ambitious kind of artificial intelligence that aspired to build a true thinking machine, a system with some deep internal representation (a mental model) of the world in which it functions. The outstanding example of this style is Terry Winograd’s SHRDLU program, written in 1970, which conversed about objects in a world of toy blocks on a tabletop. At the time, Winograd firmly asserted that the program was able to “understand discourse,” and he meant by this that the program understood not only the words but also the objects and relations the words referred to.

The promise of SHRDLU was that we could extend the same methods to broader domains of discourse, steadily building toward a general-purpose, human-like intelligence, with the same kind of carefully planned knowledge representation. But that never happened. Later in the 1970s, AI entered a time of troubles. When it came back in the 80s, the emphasis had shifted, and the technology had diversified. The new AI focused on expert systems, on data mining, on statistical rather than deductive methods; another branch of AI turned away from the human cerebral cortex in favor of the motor neurons of the cockroach. Overall, the field took a more pragmatic turn, with less concern for understanding the ultimate nature of intelligence and more energy invested in getting useful results, whatever the methodology.

Watson is in this latter-day pragmatic tradition, with its 100+ agents and its massively parallel probabilistic evidence-based architecture. Compared with SHRDLU, it’s all so messy, so ad hoc, so opaque. But it works, doesn’t it.

And I suppose my own mind is not quite as tidy as I would like to believe.

•     •     •

Although Watson won its Jeopardy match by a wide margin and made very few mistakes along the way, the moment everyone will remember is the program’s spectacular flub of a Final Jeopardy question on the second night. The category was “U.S. Cities,” and the clue was:

Its largest airport is named for a WWII hero, its second largest for a WWII battle.

Watson replied “Toronto.” As it happens, I got that question right; just seconds after the clue was revealed, I called out “Chicago.” Later, though, when I thought about the mental process that led to my answer, I realized that this was not at all a product of well-focused deductive reasoning. I was doing the same kind of scattershot, parallel, probabilistic groping in the dark that I frown on in a machine.

My “reasoning” went something like this: If it has two airports, it must be a pretty big city…. New York has three airports…. There’s Dallas, with DFW and Love—but no heroes or battles there. Chicago has two. Oh! Midway—that must be the battle of Midway.

That’s when I pressed the buzzer.

Note how sketchy my thinking was. I had no idea O’Hare was named for a war hero. As a matter of fact, I had no idea that Midway was named for the naval battle. If I had been asked in a more straightforward way, “Why is Chicago’s second airport named ‘Midway’?”, I would have guessed that it lies halfway between Point A and Point B. The Pacific island would not have entered my consciousness. And I never bothered to dig any deeper into the catalogue of multi-airport cities—Washington, San Francisco, Houston (isn’t G. H. W. Bush a WWII hero?).

So messy and ad hoc.

Goooooogle

Wednesday, February 16th, 2011

Two weeks ago my wife told me about her new Googling strategy: She ignores the top-ranked items and immediately clicks through to page six of the results. All the earlier pages, she says, are larded with SEO spam—links whose ranking has been artificially inflated by some nefarious form of search-engine optimization.

When I heard this, my first thought was: Well, there’s a business opportunity! Let’s set up a search engine—we’ll call it page6.com, or maybe goooooogle.com—that will pass each query along to Google, collect the results, discard the first five pages, and return the rest. But that daydream didn’t last long. It’s not just that Google would slap a cease-and-disist order on me. More important, there’s no need for anything as elaborate as a pass-through search engine. It can all be done with some simple scripting within the browser. The following XML, when loaded into Firefox, creates a search-bar plugin that returns Google results starting with page six.

<OpenSearchDescription
xmlns="http://a9.com/-/spec/opensearch/1.1/"
xmlns:moz="http://www.mozilla.org/2006/browser/search/">
<ShortName>Goooooogle</ShortName>
<Description>Get page 6 from Google</Description>
<InputEncoding>UTF-8</InputEncoding>
<Url type="text/html" method="GET"
template="http://www.google.com/search?q={searchTerms}
&ie=utf-8&oe=utf-8&aq=t&start=50"></Url>
</OpenSearchDescription>

The crucial bit is the phrase “start=50,” highlighted in red, which skips over the first five pages of the results.

Problem solved! Mission accomplished! But then over the weekend the business section of the Times ran a long story by David Segal that both validated and undermined the page-six strategy. The validation came from the revelation that SEO manipulation of search results is blatant and widespread and all too effective. Segal revealed that J. C. Penney, the retailer, has been finagling their way to the top of the Google lists for dozens of search terms, such as “dresses,” “area rugs,” “home decor” and “furniture.” Penney’s method is to buy lots of inconspicuous links on “innocent” sites, all pointing to Penney’s pages and thereby raising their Google PageRank. I shouldn’t be surprised to learn of this practice. I get offers every week or two to place such paid links on bit-player.org. “Who couldn’t use an extra $100, $2,000, $10,000/month or more in passive advertising income?” asked one recent enticement. The surprise, I guess, is that such a clumsy and cloddish manipulation of the search engines actually works.

So that explains why Google’s page-one results are all crap, and we’re better off skipping to page six. Unfortunately, toward the end of Segal’s story we learn that page six isn’t safe either. When Segal went to Google with the evidence of these shenanigans, Google took corrective action.

On Wednesday evening, Google began what it calls a “manual action” against Penney, essentially demotions specifically aimed at the company.

At 7 p.m. Eastern time on Wednesday, J. C. Penney was still the No. 1 result for “Samsonite carry on luggage.”

Two hours later, it was at No. 71.

At 7 p.m. on Wednesday, Penney was No. 1 in searches for “living room furniture.”

By 9 p.m., it had sunk to No. 68.

In other words, all the cruft that used to be on page one is now on page six or seven or eight. My Goooooogle trick is hosed.

The prime twins conjecture

Tuesday, January 18th, 2011

Over the weekend, identical twin sisters Inez Harries and Venice Shaw both celebrated their 100th birthday in California. I heard about this on the TV news, where it was the human-interest teaser story. “What are the odds of that?” the anchorman asked. Then he promised: “We’ll do the math.”

Inez Harries and Venice Shaw at 100; photo credit James Davis, Lockheed Federal Credit Union

I stayed through the whole broadcast just to see them do the math, but in the end all they gave was an answer, without showing where the number came from: The odds are 1 in 700 million, they said. On the web I found other accounts of the Harries-Shaw birthday party that quoted the same figure, but they were no more helpful about the details of the calculation. One version cited “family members who researched the question.” Another story, discussing another pair of centenarian twins, attributed the number to a spokeswoman for Guinness World Records. I found no supporting information on the Guinness web site. Nevertheless, I suspect that Guinness is indeed the source of the number. An Amazon page for a 2002 edition of Guinness World Records includes the following statement:

The chance of identical twins both reaching and surpassing the age of 100 is about one in 700 million.

What does that mean, exactly? What the words seem to say is this: In a population of 700 million pairs of identical twins, we should expect to find approximately one pair in which both members survive to age 100 or more. (Or should it be 700 million twins, and thus just 350 million pairs?)

I suspect that the author of the sentence actually meant something different: In a human population of 700 million, we should expect to find about one pair of individuals who are identical twin siblings and who also both live to be 100 or more.

Even under the latter interpretation, I was skeptical of this number. My back-of-the-envelope estimates of the probability differed from the Guinness value by orders of magnitude. However, the back of my envelope is notoriously unreliable, especially when it comes to calculating probabilities. I might well have blundered.

Here are some numbers pertinent to the calculation:

  • According to twins.com, one out of 250 live births produces monozygotic twins. (I think that means that the proportion of people who have an identical twin sibling is 2/251. Gotta watch out for those pesky factors of 2-ish.)
  • According to the Centers for Disease Control, there were 2,809,000 births in the U.S. in 1911. (It is the survivors of that cohort who are turning 100 this year.)
  • According to the U.S. Social Security Administration, the survival rate for reaching age 100 or more in the U.S. is 657/100,000 for men and 2,223/100,000 for women.
  • According to Wikipedia, which in turn cites the Bureau of the Census, the U.S. had some 70,490 centenarians (age 100 or more) in September 2010.

And here’s how I figured it. Among the 2.8 million births in 1911, there should have been about 11,000 pairs of monozygotic twins. We want to know how many pairs have survived to 2011. We can get an individual survival rate either from the Social Security actuarial table or from the Census Bureau’s count of surviving centenarians. The two coefficients differ substantially—0.014 vs. 0.025—probably because the Census count includes the effect of net immigration. Let’s split the difference and say that 0.02 of the cohort has survived to age 100. Since we are tracking the simultaneous survival of pairs of individuals, we want the square of this number—0.0004. Multiplying the initial number of twin pairs by this factor suggests there should be four or five pairs surviving today. The odds I calculate are roughly 1 in 600,000, not 700 million.

Do you see my error? Yes, I’ve goofed again. But as far as I can tell I’m off only by a factor of 2, not a factor of 1,000. [Please see the comments.]

Lacking all faith in my own competence to do the math, I checked my calculation with a simple-minded computer run. I set up a vector of 2,809,000 bits, designated 11,191 pairs of them as identical twins, killed off bits at random until only 2 percent remained, and finally counted the pairs left standing. In a thousand runs I came up with a mean of 2.23 pairs (and a standard deviation of 1.45). The result of my earlier calculation was 4.48 pairs—just double the simulation outcome. Where did I go wrong? I had been careful to count pairs rather than individual twins in the hope of avoiding just this kind of confusion. But then I went and counted each pair twice! In effect, I was counting Venice and Inez as well as Inez and Venice. Sigh.

Even after correcting my flub, this computation should not be taken too seriously. It neglects a bunch of not-so-subtleties. In particular, I am pretending that the probabilities of all events are independent, when in fact the longevity of identical twin siblings is doubtless very highly correlated. But taking those correlations into account would make the discrepancy between my result and Guinness’s even wider.

My arithmetic—if I have finally done it right—suggests twin-survival odds of roughly 1 in 1.2 million. How did Guinness (or whoever?) come up with 1 in 700 million? My best guess is that they based their calculation on global demographic estimates rather than regional or national statistics. By tweaking the numbers, I can come fairly close to their result.

A U.N. report on aging estimates there are 455,000 centenarians worldwide. The global population in 1911 was apparently somewhere near 1.76 billion. To estimate the cohort size in 1911 we’d need to know the crude birth rate, which I have not been able to ascertain, but extrapolating backward from estimates for later years suggests something in the range of 50 births per thousand people per year. Running these numbers through the mill yields an estimate of 4.7 twin pairs worldwide, or odds of 1 in 375 million.

If I were to boost the 1911 birth rate a little higher, I could arrive at the curious prediction that there are more centenarian twins living in the U.S. than there are on the entire planet. This result does not inspire confidence in the methodology, but it’s also not really surprising. We’re dealing with events far out in the upper tail of a normal distribution. Including populations where the mean age at death is 48 rather than 78 adds more noise than information.

In any event, I toast Inez and Venice on this occasion. And I am happy to note that they are not alone. In rooting around on the web, I’ve found another way to put a lower bound on the twin survival rate. News reports published within the past few months mention other pairs of twins celebrating 100th birthdays in Alabama, Florida and Rhode Island. That makes at least four 700-million-to-1 events in a country of 300 million. Farther afield, there are also recent stories about pairs of centenarian twin sisters in Belgium and the U.K. And Wikipedia has a list of 16 such pairs (not all monozygotic) thought to be still living. Double congratulations to all of them.

CAPTCHA arbitrage

Tuesday, November 23rd, 2010

What a world we live in. It seems there are places on this planet that are wired well enough to support internet commerce, yet where people are poor enough that solving CAPTCHAs for 50 cents per thousand is an economically appealing proposition. That’s roughly three hours of work for half a buck—minus whatever it costs the worker for internet access.

I have learned this from a fascinating paper (PDF) by Stefan Savage and his colleagues at the University of California, San Diego. The Savage group studied the CAPTCHA-solving economy in a very direct way—by participating in it, both as customers and as workers.

reCAPTCHA example imageCAPTCHAs are meant to thwart computer programs that sign up for bogus email accounts in order to send spam, or that post spammy comments on forums or blogs like this one. Transcribing the distorted and obscured text is a task that’s supposed to be easier for people than for machines. The spammers’ first response to CAPTCHAs was to write programs that solve them algorithmically, but in this arms race the advantage is with the white hats. Deploying a new style of CAPTCHA is quick and cheap; developing a new solver for that CAPTCHA is slow and costly. And so the spammers have turned to human solvers.

A CAPTCHA that gets caught up in this illicit trade is likely to take quite a globe-girdling journey in a matter of seconds. The images are typically generated by a service such as reCAPTCHA (now owned by Google), and embedded in the sign-up form or comment form of a web site. The spammer’s software (such as GYC Automator) scrapes the image from the page and forwards it to a front-end system, which aggregates CAPTCHA-solving requests and collects payment from the spammers. The image is then passed on to a back-end operation, which marshals the services of individual solvers and distributes payments to them. The solver sees the image on a simple web form and types the transcription into a text box. If all goes well (from the spammer’s point of view), a correct solution comes back within the 30 seconds or so that most pages allow for solving the puzzle.

Some other findings of the Savage study:

  • The piece-work rate offered to solvers has fallen steadily, from $10 per thousand in 2007 to the present level of $0.50 to $0.75.
  • The price paid by spammers is higher, of course, and also variable. Typical current rates are in the range of $1 to $2 per thousand, but some services charge as much as $20.
  • Most of the services tested by Savage et al. were fast and accurate, solving 85 to 90 percent of the CAPTCHAs correctly, with a median response time of 14 seconds.
  • By gradually raising the rate of CAPTCHA submissions until responses slowed and additional work was refused, Savage et al. estimated the size of the workforce. The largest outfits seemed to have at least 400 to 500 workers online at once.
  • In an attempt to learn where the solvers live, Savage et al. sent out specially fabricated CAPTCHAs with images of words in various languages. They reasoned that accuracy would be highest in the worker’s native language. If this hypothesis is correct, many of the CAPTCHA solvers are fluent in Chinese, Russian or Hindi. But one organization showed exceptional linguistic versatility, even solving challenges in Klingon.
  • By offering their services as solvers, the Savage group were able to gather some statistics on what kinds of CAPTCHAs are flowing through the dark side of the internet. Microsoft CAPTCHAs (used on the Hotmail sign-up form) were the most popular. Others seen frequently included reCAPTCHA images and products of several Russian-language services.

 

The paper includes a discussion of the ethics and legality of this project. Is it acceptable, even for research purposes, to abet the unsavory activities of spammers? The Savage group concluded that acting as buyers of the service caused little harm because the purchased solutions were never used to register fraudulent accounts or post messages. But working as solvers was more troubling, because the solutions they provided would indeed be used to further the aims of spammers.

To sidestep this concern, we chose not to solve these CAPTCHAs ourselves. Instead, for each CAPTCHA one of our worker agents was asked to solve, we proxied the image back into the same service via the associated retail interface. Since each CAPTCHA is then solved by the same set of solvers who would have solved it anyway, we argue that our activities do not impact the gross outcome.

It’s an ingenious dodge, even if it doesn’t put one’s mind totally at ease about the ethical question. (Suppose we were studying murder-for-hire instead of CAPTCHAs-for-hire—would the same reasoning be acceptable?) Ethics aside, however, the tactic of recirculating work requests back into the same system raises other curious issues.

For one thing, it suggests a way of measuring the size of the CAPTCHA-solving enterprise. We could run a capture-recapture experiment (or should I say CAPTCHA-reCAPTCHA?). Suppose we never solve a CAPTCHA ourselves, but we make a record of each image as it arrives, before we dump it back into the work stream. From the fraction of recirculated CAPTCHAs that come back to us at least once more we could estimate the total size of the work flow. Of course this assumes that the stream of CAPTCHAs is well-mixed. It also assumes there is no one else out there recirculating CAPTCHAs.

This last caveat leads to an interesting economic question. As noted above, retail prices for CAPTCHA-solving vary over a wide range, from about $1 per thousand to $20 per thousand. This price spread, and the fact that it’s technically feasible to route a CAPTCHA through the system more than once, suggests a major arbitrage opportunity. We can set up a high-price CAPTCHA service and farm out all the actual work to low-price competitors. In a free economy—and what economy could be freer of regulation than a criminal one?—that situation is not supposed to endure.

•     •     •

While I’m on the subject of spam, I’ll take the opportunity to update my own running tally.

 

personal spam receipts Jan 2007 through oct 2010

Activity in my inbox has been unexciting since my last report. Spam volume is still well below the peaks of mid-2009, but on the other hand there’s not much support for the fond notion that spam is on the verge of extinction. News reports have suggested that the shutdown of a Russian web site called SpamIt.com, allegedly run by Igor Gusev, caused a sharp dropoff in the global spam rate in September. And indeed my September intake was the smallest since June of 2007. But the chronology isn’t quite right. SpamIt was closed on September 27, so that event should have depressed the October numbers more than those for September. My spam receipts rebounded in October. About 40 percent of the October spam messages use a Russian-language encoding.

Update 2010-11-28: Peter G. Neumann’s RISKs list has an item about criminal charges against the operators of Wiseguy Tickets, whose business model involved solving CAPTCHAs in bulk. By circumventing the CAPTCHAs, they supposedly jumped to the head of the line at Ticketmaster and scooped up 11,984 Hannah Montana concert tickets.

The RISKs item cites a Wired article by Kim Zetter, which seems to be based mainly on the indictment (PDF) filed in the U.S. District Court of New Jersey. Here’s how the Wiseguys did it, according to the indictment:

11. It was further part of the conspiracy that to enable the CAPTCHA Bots to purchase tickets automatically, Wiseguys:

a. Downloaded hundreds of thousands of possible CAPTCHA Challenges from reCAPTCHA. To obtain these CAPTCHA Challenges anonymously, Wiseguys wrote a computer script that disguised the origin of the download requests by impersonating would-be users of Facebook, which also subscribed to reCAPTCHA.

b. Created an “Answer Database” by having its employees and agents read tens of thousands of CAPTCHA Challenges (or listen to audio CAPTCHA Challenges) and enter the answers into a database of File IDs and corresponding answers.

12. It was further part of the conspiracy that, during the ticket-buying process, instead of “reading” a CAPTCHA Challenge, the CAPTCHA Bots identified the CAPTCHA Challenge’s File ID. The CAPTCHA Bots then instantly compared the CAPTCHA Challenge’s File ID against the Answer Database, looking for a matching File ID. If the CAPTCHA Bots found a matching File ID, it immediately and automatically transmitted the pre-typed answer to that CAPTCHA Challenge to the Online Ticket Vendors’ website. This process took place in a fraction of a second, much faster than a human user could respond to a typical CAPTCHA Challenge.

I’m having trouble making sense of this. The scheme would work only if reCAPTCHA is repeatedly sending out the same image, linked to the same file ID, and reusing images often enough that if you save a few hundred thousand of them, you’ll have a good chance of finding any new challenge already present in your database. Surely it’s not that easy?

It’s true that reCAPTCHA (unlike other CAPTCHA services) must gather multiple solutions for some images. That’s because of their aim of using the labor of solvers to proofread scanned texts. Each reCAPTCHA includes two words. The solution for the “control word” is known in advance and is used to authenticate the solver; the solution offered by the solver for the “unknown word” becomes a candidate reading of that word in the OCR process. Multiple solvers must agree on the same reading of the unknown word before the transcription is accepted. Thus each unknown word is presented more than once. But is it always paired with the same control word? And with the same file ID?

The reCAPTCHA folks are pretty savvy about such weaknesses. A list of guidelines on their web site includes this paragraph:

Script Security. Building a secure CAPTCHA is not easy. In addition to making the images unreadable by computers, the system should ensure that there are no easy ways around it at the script level. Common examples of insecurities in this respect include: (1) Systems that pass the answer to the CAPTCHA in plain text as part of the web form. (2) Systems where a solution to the same CAPTCHA can be used multiple times (this makes the CAPTCHA vulnerable to so-called “replay attacks”).

So if the reCAPTCHA programmers are taking their own advice, the Wiseguys should have been out of luck. And yet we have the evidence of those 11,984 Hannah Montana tickets. What’s the story?

Dotted lines

Tuesday, October 5th, 2010

Where I grew up, a dotted line ran through the neighborhood, just beyond my back yard. On maps, that line marked the boundary between the city of Philadelphia and its inner-ring suburbs. On the ground, it was a racial divide—absolute and knife-edge sharp. Our side was all white. The public schools I attended had an enrollment of roughly 8,000, with just three black students. The community on the other side of the line was our racial mirror image, almost entirely black.[*]

Revisiting the old neighborhood 50 years later, I have been pleased to find the boundary softened and blurred somewhat. Families have drifted across the line in both directions. However, it’s not yet time to celebrate the end of residential segregation in American cities.

I’ve recently learned about a remarkable set of maps showing population distribution by race and ethnicity in more than 100 metropolitan areas. The maps were created by Eric Fischer, a Bay Area programmer with an interest in cartography and urban life (and also, incidentally, the author of a wonderfully detailed history of ASCII). Here’s Fischer’s map of the Philadelphia area, based on block-level data from the 2000 U.S. Census:

Eric Fischer map of race and ethnicity in Philadelphia

color-key.pngEach dot represents 25 people, coded according to the color key at right. The image is at reduced resolution, and I’ve had to crop it slightly to fit this space. For a clearer view I recommend looking at the full-size and full-resolution images (3,000 × 3,000 pixels), which are all available on Fischer’s Flickr stream under a Creative Commons license.

Below is a detail of Philadelphia’s western boundary. The Schuylkill River winds along the right edge of the frame; I’ve added a black circle to mark my childhood turf.

close up of Philadelphia's western boundary

And here’s Fischer’s map of Detroit, the most extreme case in the whole collection, with the city’s northern boundary sharply delineated along Eight Mile Road:

Eric Fischer's map of race and ethnicity in Detroit

I suppose no one will be shocked to learn that racial divisions persist in the U.S., but I do think these maps offer particularly vivid evidence. Fischer was inspired to create the maps by earlier work of Bill Rankin, a historian and cartographer currently at Harvard. Using Census data, Rankin mapped the distribution of income as well as the geography of race and ethnicity in neighborhoods of Chicago and its suburbs. Rankin writes:

Any city-dweller knows that most neighborhoods don’t have stark boundaries. Yet on maps, neighborhoods are almost always drawn as perfectly bounded areas, miniature territorial states of ethnicity or class.

An apt example of those “miniature territorial states” is on exhibit in the map of Philadelphia reproduced below, which was prepared in 1936 by the Home Owners’ Loan Corporation:

1936 Philadelphia redlining map

The boundary lines drawn here determined where home mortgages were available; the red “hazardous” areas were effectively off-limits to lenders. It’s well known that there was a strong correlation between race and the “redlined” areas of such maps. (At the time, the West Philadelphia neighborhood near where I would later live was rated “still desirable,” or in other words mostly white. The change came after World War II.)

The process of creating a “miniature states” map from distribution data involves at least two levels of abstraction. First you have to carve the mapped area into distinct regions, choosing where to draw the boundaries either by eye or by some algorithmic method. Then you flatten the data within each region, turning what is surely a heterogeneous area into a uniformly pink or blue or yellow district.

For his Chicago maps Rankin adopted a more direct alternative: In each census block he drew a dot of the appropriate color for each 25 people of a given racial group or income category. The dots were randomly placed within the blocks. For example, my boyhood block near Philadelphia is listed in the 2000 census as having a total population of 95, of whom 66 are white, 25 are black, and 4 are Asian. Thus there ought to be 2.64 red dots, one blue dot and 0.16 green dots in the map area corresponding to that block. (How best to deal with fractional dots is an interesting methodological question.) Rankin drew the maps with the ArcGIS geographic information system, which has a built-in function for random-dot mapping. When Fischer undertook his 100-city mapping project, he wrote his own code for dot placement, based on a simple approximation. Instead of choosing random coordinates within the polygons that define the census blocks, Fischer placed the dots at random within disks of equivalent area, centered on an “internal point” that the Census Bureau specifies for each block. With this scheme some of the dots may stray outside the bounds of a census-block polygon, but the inaccuracy is probably minor at the scale of a metropolitan area. If one were to refine the technique, it might be helpful to replace the arbitrary constant of 25 persons per dot with a parameter that depends on the scale of the map. Thus close-up views of neighborhoods would have finer resolution.

To my taste, the dotty style of mapping has at least two major advantages over the “miniature states” approach. First, a single graphic device successfully conveys two kinds of information; it shows overall population density as well as racial/ethnic composition. (The empty areas of these maps are sometimes as intriguing as the populated ones—it’s fascinating to see how much land we are willing to cede for golf courses, airports and cemeteries.) Second, in the dotted maps, boundary lines are not imposed on the data but rather emerge from the data. Moreover, we can see at a glance just how hard-edged or fuzzy each boundary is.

•     •     •

Beyond matters of cartographic technique, there is the question of what social meaning we should attribute to these maps. Why are so many cities divided into large monochrome domains? In the 1950s, the whites-only status of some neighborhoods was enforced by coercive means—legal and illegal, and occasionally violent. That has changed, and yet the boundaries persist. Why? This is a huge question, the subject of learned dissertations, and I don’t pretend to have an answer. But I would like to say a word about one well-known mathematical model that seems to offer hope of a benign explanation.

 

In the late 1960s Thomas C. Schelling, an economist now at the University of Maryland, devised a simple lattice model of residential segregation. Quoting myself:

Black and white residents, initially scattered at random over the nodes of the lattice, were assumed to prefer living among neighbors of the same race; those who were unhappy with their current surroundings could move. Schelling’s most provocative finding was that it doesn’t take vicious bigotry to produce a sharply segregated housing pattern; even the mildest preference for neighbors of the same race leads to a phase separation.

Thus we are invited to believe that our social landscape is a product of congregation rather than segregation.

Do the Rankin and Fischer maps lend any support to this notion? Well, the maps don’t look much like computer simulations of the Schelling model (there are dozens on the web), which tend to yield sinuous, pulsing blobs of population, like zebra stripes or leopard spots, and not at all like Eight Mile Road. But maybe that’s just because the simulations are run on a perfectly uniform background, whereas real cities have rivers and freeways and other physical barriers, as well as political and administrative boundaries, not to mention gradations in the size and price of houses. I suppose it’s appropriate to say that further research is needed.

It will probably be a few years before we have block-level results from the 2010 census. When those numbers start coming in, I look forward to revised versions of these maps. I’m hoping they’ll look a little fuzzier.

[Note: Schelling's main paper on the segregation model does not seem to be available online. The journal reference is: Schelling, Thomas C. 1971. Dynamic models of segregation. Journal of Mathematical Sociology 1:143--186. Dietrich Stauffer and Christian Schulze have written a lucid, somewhat critical, description and evaluation of Schelling's model, available at arXiv:0710.5237.]

Update 2010-10-24: Bill Rankin writes to let me know that he has a Philadelphia map prepared with his more-precise technique of placing dots at random within the bounds of census-block polygons. Below is a detail of West Philadelphia and some of the adjacent suburbs. The complete map is available here.

race distribution in West Philadelphia and suburbs, map prepared by Bill Rankin