Shut up and calculate!

12 August 2008

In my latest American Scientist column I advert to a famous passage in Leibniz (translation by Robert Latta, 1898):

When controversies arise, there will be no more necessity of disputation between two philosophers than between two accountants.  Nothing will be needed but that they should take pen in hand, sit down with their counting tables and (having summoned a friend, if they like) say to one another: Let us calculate.

That final exhortation, “Let us calculate,” is a single word in Leibniz’s Latin text, lending me a title for my column: Calculemus! I might have expressed the same thought more forcefully, quoting somebody or other, as: “Shut up and calculate!”

The column is a manifesto in support of computer programming as a tool for exploring, experimenting and problem-solving—what I like to call inquisitive computing. This is an activity that shouldn’t need defending or promoting, and yet I feel it is becoming a neglected art. These days, computer programming is often viewed mainly as an aspect of software development—but that’s not the kind of programming I have in mind. In software development, the program is the product; it’s what you hand over to the end user. Inquisitive computing is different: The program is just a device for getting an answer. The ultimate goal is not the program itself but the result of running the program. Indeed, once the answer is in hand, the program is often of no further use or interest and can be thrown away.

My gripe is that tools for inquisitive computing are not getting as much attention as they once did. The world of software development offers luxurious, richly appointed programming environments, systems such as Xcode on the Macintosh, or the Eclipse editor favored by many Java programmers. But these systems are not well-adapted to the needs of inquisitive computing, where the emphasis is on low overhead and incremental, trial-and-error methods.

Am I alone in feeling aggrieved about this situation? I’d be interested in knowing what others think. Do you do the kind of programming that I’m describing as inquisitive computing? What tools do you use? Are you happy with them? (Let me hasten to add that I don’t see this as another food fight over the virtues of various programming languages. I’d welcome well-made environments for inquisitive computing based on any language whatever.)

In any case, I don’t want to sound too grumpy about this. The new column is also supposed to be an anniversary celebration: I’ve been writing these essays for 25 years. To mark the occasion I am trying to make some of my earlier work available online. There are no machine-readable copies of the early columns, so my only recourse is to run paper pages through the scanner. The result is a bloated PDF of marginal quality; sorry, it’s the best I can manage. So far I have scanned about a dozen of the columns I wrote for Scientific American and Computer Language in the 80s and for The Sciences in the 90s. I’ll be adding more over the next few weeks. There are links in my publications list.

Big Money

3 August 2008

Zimbabwean bank notes, including a ZW$50,000,000,000 Special Agro-Check

(Photo courtesy ZeroOne.)

It’s a cruel irony: As the citizens of Zimbabwe sink into bitter poverty, they are becoming millionaires and billionaires. Inflation is eroding the value of the Zimbabwean dollar so rapidly that everyday transactions turn into lessons in the arithmetic of large numbers. When the photo above was made on July 17, the largest currency denomination in circulation was a note for ZW$50,000,000,000. Last week the nation’s central bank issued a ZW$100,000,000,000 bill. (I’ll spare you the trouble of counting zeroes: That’s 1011, or 100 billion by American reckoning.)

The Zimbabwean inflation is the worst in the world at the moment, but it is not (yet) setting all-time records. Probably the most famous episode of extreme inflation was that of the German Weimar Republic (a story told vividly in Erich Maria Remarque’s novel The Black Obelisk.) In 1921, German marks traded at about 60 to the U.S. dollar; two years later, in December of 1923, the exchange rate was 4.2×1012 per dollar. The Hungarian inflation following World War II reached even greater numerical heights. In a single year the exchange rate for the Hungarian pengo went from 100 per U.S. dollar to 4×1029. As Feynman said, astronomical numbers are dwarfed by economical ones.

Takayuki Mizuno, Misako Takayasu and Hideki Takayasu have analyzed the German and Hungarian episodes of “hyperinflation.” (Citation: Physica A 308 (2002) 411; there’s also an arXiv preprint.) Inflation at its worst, they find, proceeds at a doubly exponential rate. In other words, prices rise not just as an exponential function of time—exp(t)—but as an exponentiated exponential—exp(exp(t))—or:

doubleexpt.png

This growth law has a simple meaning in terms of everyday experience. With “ordinary,” single-exponential inflation, prices have a constant doubling time. If bus fare was 1 million last month and 2 million this month, it will be 4 million next month. Under double-exponential growth, the doubling time itself decreases exponentially. In the last months of the Hungarian inflation the doubling time fell from about 20 days to 15 hours.

On a logarithmic scale, a simple exponential function yields a straight-line graph. Here is the Mizuno-Takayasu evidence that the final phase of the Hungarian inflation was superexponential:

Mizunofg1.jpg

And here are the data for the final six months plotted as log(log(p(t))), showing a simple linear trend:

Mizunofg2.jpg

How does the Zimbabwean economy look when submitted to this kind of scrutiny? I don’t know of a reliable source of data on prices in Zimbabwe, but foreign exchange rates can serve as a rough proxy. Until three months ago, the official ZW$ rate was pegged at roughly 30,000 per US$, but on May 10 the currency was allowed to float free, and the rate immediately jumped to 190,000,000 ZW$ per US$. By July 31 the rate had reached 57,381,544,140. Thus the 50 billion ZW$ note in the photo above was worth a little less than a 1 US$ by the end of last month. And that’s at the official rate of exchange; the street value is reportedly about a tenth of the official quote.

Here’s how the official exchange rate has varied in the 84 days between May 10 and August 1, as plotted on a linear scale:

ZW-rates.png

And here’s the same data after a logarithmic transformation:

ZW-log-and-fit.png

Although there’s more bumpiness here than in the Mizuno-Takayasu data, the trend looks reasonably linear to me. The fitted line has slope 0.03358, which yields a doubling time of about nine days. I see no hint of superexponential growth. I’d like to think this is an encouraging sign, a glimmer of hope that Zimbabwe will be spared an even more pernicious phase, when even inflation has inflation.

Runaway inflation is usually blamed on the incompetence or malevolence of governments and the central banks that implement their policies. In the case of Zimbabwe, the government of Robert Mugabe certainly has a lot to answer for. The country was once the shining success story of southern Africa—I have friends who migrated across the continent to go to school there—but the nation is now a basket case, and inflation is only one of many urgent crises. (The unemployment rate is reported to be 80 percent.) The Mugabe regime can’t escape blame for this situation. Still, it seems that hyperinflation is not to be explained purely in terms of fundamental economic imbalances—too many dollars and not enough goods. Sometimes it seems there is also a psychological component. When you believe that prices will double next week, you raise your own prices in anticipation. It’s a self-reinforcing process.

One sign of such a feedback loop in the inflationary spiral is that inflation sometimes stops even though the underlying economic situation hasn’t really changed. The Weimar hyperinflation ended with the introduction of the Rentenmark, which was set equal to 1012 old marks but really had no firmer backing than the earlier Papiermark. The change in currency did nothing to solve Germany’s problems of debt and unemployment, but the inflation ended anyway. Evidently, people chose to believe that the value of the Rentenmark would remain stable, and it did.

The central bank of Zimbabwe has just announced a similar effort at currency reform, devaluing the ZW$ by a factor of 1010. In other words, the ZW$100,000,000,000 note introduced a week ago is equal in value to a new ZW$10 bill. According to press reports, the main motive for the change was simply logistical convenience:

Gideon Gono, the Central Bank governor, … acted because the high rate of inflation was hampering the country’s computer systems. Computers, electronic calculators and automated teller machines at Zimbabwe’s banks cannot handle basic transactions in billions and trillions of dollars. (AP/Baltimore Sun)

But perhaps one can hope that the newly denominated currency will bring more than numerical benefits. Over the weekend, the official exchange rate has held at 6.569 new Zimbabwe dollars to the U.S. dollar. We’ll have to wait a few more days to see if the curve has really flattened out.

Update 2008-09-04: With another month of exchange-rate data, here’s what the situation looks like:

ZW-rates-904.png

ZW-log-and-fit-904.png

The blue line in the semilog graph is the same as the one in the corresponding earlier graph—that is to say, it is fitted to the first 80 days of data. It appears that the inflation rate has diminished slightly since the revaluation at the end of July. But that slightly lower rate is still formidable; in a little more than a month the value of the new Zimbabwe dollar has fallen from about 15 cents (U.S.) to about 2 cents.

Update 2008-10-02: After another month, what passes for good news is that the rate of exponential growth does not seem to be growing:

On the other hand, news reports suggest that the situation in Harare is bleaker than ever. Money is scarce as well as nearly worthless; people stand in line all night for the privilege of withdrawing the equivalent of a dollar or two from their own bank accounts. (Note that the equivalent of $1 U.S. is $ZW137 in the devalued currency issued in August. In pre-devaluation Zimbabwe dollars, it comes to $ZW1.37 trillion.)

Isn’t it curious that both here in the U.S. and in Zimbabwe, the financial pages are filled with such enormous numbers.

Update 2008-11-02: One more month of data:

Still no sign of “hyperinflation”—if that term is taken to mean doubly exponential growth—but that can’t be much solace to the Zimbabweans whose currency has yet again lost three-fourths of its value over the course of a month. Adjusting for the August devaluation, one U.S. dollar now buys 5.6 trillion Zimbabwean dollars.

Rosalind

25 July 2008

This morning I’m about to get on the T and go to Cambridge City Hall for a significant personal event. I’d like to celebrate the occasion with a few lines of verse by the great mathematician (and lesser poet) James Joseph Sylvester. (A tip of the top hat to Jerry Alexanderson for bringing Sylvester’s versifying to my attention.)

Fairest, O ! of lily-kind,
Perfect pearl and priceless find !
Pure as poet’s milk-white hind,
Spirit ! from all dross refined,
Hearts to ravish Heav’n-designed ;
Fresh as rills that sparkling wind
On their way the sea to find,
[O'er smooth pebble-bed patíned.]
Dew-drop ! on which sun hath shined,
Shedding glamour undefined,
And a light that doth remind
of the cloudless heav’n behind.
Whose light laugh, like bells sweet-chimed,
[Or hushed trill of bow refined
Touched with fingers taper-tined,]
Piercing feeling’s inmost rind,
Tone-spray tossing on the wind
Leaves no smart unmedicined.

  With fond folly all untined,
  To each duteous law resigned ;
    (Round her finger she can wind
    All to obey her disinclined :
    Who so wilful or so blind
    To say nay to Rosalind !)

F. Fortesque Fingerhut

16 July 2008

Riffling through some file folders last night, I happened upon an item that I evidently clipped out of Datamation years ago. It’s titled “Magic Moments in Software,” by Deborah Sojka and Philip H. Dorn. I can’t find a date on any of the pages, but internal evidence suggests it’s from the early 1980s.

The piece is organized as a timeline, listing various notable events in the history of software, such as the first successful run of a “user-written, meaningful Fortran program” (20 April 1957), the release of Lisp (1962) and BASIC (1964), and the publication of The Mythical Man-Month (1975). One of the early entries caught my eye:

1949—F. Fortesque Fingerhut, while trying to debug his first program on the ACE computer at the National Physical Laboratory, cannot find the problem. He cracks under the strain, disappears, and is not seen until 1981 when he reemerges as the net court judge at Wimbledon.

All the other events recorded in this chronology seem to be genuine, and there’s no April Fool disclaimer at the end. Is it possible that the story of F. Fortesque Fingerhut is not a joke?

arXival mysteries

2 July 2008

Catching up on new submissions to the arXiv, I came across a paper by Robert Baillie, “Summing the Curious Series of Kempner and Irwin,” which is item 0806.4410 in the mathematics listings. Here’s the abstract, exactly as it appears at http://lanl.arxiv.org/abs/0806.4410v1:

In 1914, Kempner proved that the series 1/1 + 1/2 + … + 1/8 + 1/10 + 1/11 +… + 1/18 + 1/20 + 1/21 + …, where the denominators are the positive integers that do not contain the digit 9, converges to a sum less than 90. (The actual sum is about 22.92068.) In 1916, Irwin proved that the sum of 1/n where n has at most a finite number of 9’s is also a convergent series. We show how to compute sums of Irwins’ series to high precision. For example, the sum of the series 1/9 + 1/19 + 1/29 + 1/39+ 1/49 + … where the denominators have exactly one 9, is about 23.04428708074784831968. Note that this is larger than the sum of Kempner’s “no 9″ series. We also show how to construct nontrivial subseries of the harmonic series that have arbitrarily large, but computable, sums. For example, the sum of 1/n where n has at most 434 occurrences of the digit 0 is about 10016.32364577640186109739.

Baillie’s article is full of really interesting mathematics and algorithmics, which ought to be reason enough to mention the paper here. But it was something stranger that caught my attention. Look closely at the large number in the last line of the abstract. The HTML source of the arXiv page looks like this:

  1<a href="abs/0016.3236">0016.3236</a>4577640186109739.

It seems the sequence 0016.3236, embedded in a larger string of digits, has been interpreted as an arXiv identifier. I can only guess that some program at arXiv.org is scanning abstracts looking for strings of the form nnnn.nnnn, where n is any decimal digit. It’s a little like those rogue search-and-replace scripts that do amusing things like turning every “gay” into a “homosexual.”

As it happens, there is no arXiv paper with the identifier 0016.3236. There can’t be because the identifier format is actually yymm.nnnn, encoding the year and month of submission in the first four digits. Obviously there is no month 16, and the identifier scheme had not yet been introduced in the year 00. Thus, as far as I know, the competition is still open for the first perfectly self-referential arXiv preprint—one that finds a legitimate reason to embed its own identifier in its abstract. (Part of the challenge is that you can’t know in advance—at least not with high precision—what number will be assigned to a paper when it is submitted.)

[More on the Kempner series: Baillie and Thomas Schmelzer also have an article on related work in the latest American Mathematical Monthly. The article is available (pdf) from Schmelzer's web site.]

Update 2008-09-07: Because this post makes fun of programs that automatically scan and “fix” text, it inevitably suffers that very fate. Although the long quotation above was pasted into my editor window exactly as it appeared in the original arXiv abstract, that’s not how the quotation will arrive on your screen. If you try to follow the link embedded in that long number, you’ll find that the address is not “abs/0016.3236″ but “bit-player.org/2008/abs/0016.3236″. Somewhere in the WordPress software is a routine that “corrects” any URL lacking a top-level domain. I have resisted the urge to correct the correction. I note that arXiv is following the same policy: The latest version of the Baillie abstract (posted 17 August) still includes the curious embedded link.

Update 2009-07-31: In the 2008-09-07 update I unjustly impugned WordPress. It is not WordPress that is munging the URL; it is my very own browser (and probably yours too). A URL without a fully qualified domain name is interpreted by the browser as a link relative to the current site. Viewing the page source shows that the “abs/0016.3236″ is not changed within the HTML. Many thanks to Keith Beckman for setting me straight on this.

Unscrabbled

1 July 2008

I’ve been Scrabbling by email lately. In today’s game my partner started out by playing

                    H
                    E
                    X

and I responded with

                   A H
                   W E
                   E X

At this point my opponent might well have continued with another three-letter word to make a tidy square block such as:

                  Y A H
                  E W E
                  S E X

In actuality she did something quite different. She played a seven-letter “bingo,” using all her letters to earn a 50-point bonus; as a result I’m hopelessly far behind in the scoring. But let us say no more about the tawdry details of winning and losing; there’s a puzzle here. Looking at that three-by-three block of letters and words, it occurs to me there must surely be legally reachable configurations of a Scrabble board that have no legal continuation. Scrabble rules say that, except for the first move, letters can be added to the board only on squares adjacent to existing letters, and all sequences of two or more letters (both vertically and horizontally) must be dictionary words. The rules say nothing about the situation where continued play is impossible.

I’m sure there must be many stymied positions, where no words can be formed, regardless of what letters you have on your rack. Or so I assert; but, the fact is, I haven’t been able to find even one such configuration. A cursory examination of the list of all allowed two-letter words argues that no two-by-two block of letters can be stymied. What about two-by-three or three-by-three blocks? Somebody must have settled these questions, but my Googling has failed to find the answer. What is the smallest stymied position? (I don’t require that a solution be a rectangular block of letters, but having stray letters dangling off the edges of a block makes it easier rather than harder to form words.)

Sleight of handle

23 June 2008

As I mentioned, the American Scientist web site is undergoing an overhaul. One aspect of the transition that’s still in transition is redirecting http requests so that old links and bookmarks will retrieve the correct document on the new site. I wish I could snap my fingers and fix this problem globally, but that seems to be beyond my power. Over the weekend, however, I decided I could at least try to reduce the entropy of my own little corner of the WWW by repairing all the links at bit-player.org that point to American Scientist articles. It wasn’t quite as much fun as I had expected.

The first problem is that the old links are completely opaque. They look like this:

     http://www.americanscientist.org/AssetDetail/assetid/48550

The identifier “assetid/48550″ offers no clue to what it is identifying. It could be anything the magazine has published in the past 10 years. The second problem is that you can’t just follow the link to find out where it leads. None of these links are working anymore—that’s the whole point.

Does this situation suggest a certain lack of foresight on my part? Oh well. I’ll just sit here in the corner until the paint on the floor dries.

For links that point to my own columns, I’m generally able to infer from their context what the proper target is. And, if I get stuck, the Wayback Machine is there to rescue me (thank you Brewster Kahle). Still, reviving a batch of dead links is a dreary exercise. I find a link; I figure out which column it refers to; I run a search on the new web site to find the updated URL; I record the mapping from old link to new, so that this information won’t be lost; I correct the link in the bit-player posting. Wash, rinse, repeat.

After fixing only about a dozen links in this way, I came to a moment of self-knowledge: During my remaining years on this planet, however few or many they may be, I never want to go through this process again. Which raises the question: What’s the best way to create permanent pointers to objects that may not stay where you put them?

The answer to this question has been known for decades. What’s needed is double indirection: a pointer to a pointer. Also known as a handle.

  • Zeroth-order indirection: I send you the document itself.
  • First-order indirection: I send you an address where you can find the document.
  • Second-order indirection: I send you an address where you can find the address of the document.

At order zero—with no indirection at all—the document is beyond my control from the moment I send it. With a single level of indirection, I can update or correct the document and you’ll see the new version, but if I move it, you’ll lose access. With double indirection I can change either the content of the document or its location whenever I please, as long as I take care to update the intermediary pointer—the forwarding address.

Double indirection is already a well-established technology in Internet operations. It is the basis of the Domain Name System. When you follow a link to “bit-player.org,” a nameserver looks up that string of characters and returns an IP number, such as 69.89.21.70, which specifies the real whereabouts of the page you’re now reading. I can move bit-player.org to a new host with a new IP number and you’ll still be able to find me, provided I let the world’s nameservers know the new address.

The same kind of mechanism can be made to work at a finer level of detail. In particular, Digital Object Identifiers offer an infrastructure for attaching permanent names to documents or other online resources. For example,

    http://dx.doi.org/10.1511/1998.5.410

is the DOI irrevocably assigned to one of my old columns, titled “Bit Rot.” In that column, written 10 years ago, I discussed the sad tendency of digital information to go stale, to decay, to become inaccessible. Alas. As you have doubtless guessed, the DOI has lost touch with its target; following the link above will get you nothing but a “page not found” error. And fixing errant DOIs looks no easier than fixing raw, broken links. I’m afraid that “dx.doi.org/10.1511/1998.5.410″ is almost as opaque as “assetid/15568.”

In practice, the world seems to have settled on a different mechanism of double indirection for keeping track of stuff on the web. We don’t try to remember or record URLs; we simply go to Google and run a search. The trouble is, though, Google itself relies on the whole distributed network of links to trace and rank documents. If everyone counts on Google to know where everything is, Google will have no way of finding anything.

According to legends of yore, the Internet was designed to survive a nuclear attack. And at the hardware level, the Net is indeed extraordinarily resilient. But the superstructure of linked documents we’ve built atop that foundation is not so unshakeable. If we were to wake up one morning and find that all the links on all web pages were broken, it wouldn’t be much consolation that the underlying documents still existed. Much of their value lies in the connections between them.

Tim Berners-Lee, the first weaver of this web, offers some excellent advice: URLs should never change. If only we’d thought of that sooner.

Back in 1994 I wrote a column titled simply “The World Wide Web.” It was all so new then. My elaborate definitions and explanations seem very quaint now, like someone describing in meticulous detail how to dial a telephone or flush a toilet:

What is the Web? Is it a place? A program? A protocol? One of the <A HREF=”http://info.cern.ch/hypertext/WWW/TheProject”> documents</A> in which the Web describes itself offers this assessment: “The World Wide Web (W3) is the universe of network-accessible information, an embodiment of human knowledge.” That about covers it.

For something as vast as a universe, the Web is surprisingly easy to find your way around in. It works like this. On a computer connected to the Internet you start up a program called a browser; the browser goes out over the network and retrieves a document, which we can assume for the moment is simply a page of text. Within the text are some highlighted phrases, displayed in color or underlined. When you select one of the highlighted words by clicking on it with a mouse, a new document appears, with new highlighted “links.” Clicking on one of these links takes you to still another document. Each time you follow a link, you may be visiting another network site, perhaps quite distant from your original destination as well as from your own location.

Is it any surprise that the link embedded in this passage—which I have formatted so that the URL remains visible—summons a “404 Not Found” message?

Jottings on .js

22 June 2008

Theorists and theologians of programming languages give a lot of thought to issues like referential transparency, lexical scope rules and idempotency. More often than not, though, programming languages live or die for reasons that have nothing to do with such syntactic and semantic virtues. In the early 1980s everyone wrote programs in Microsoft BASIC because that’s what shipped with the IBM PC. A little later we all switched to C because…. Well, I’m actually not sure why, but I know it had nothing to do with referential transparency. Turbo Pascal was enormously popular for a very simple reason: It cost 50 bucks at a time when a C compiler would set you back $500.

Now we have JavaScript. It’s not Microsoft BASIC or Turbo Pascal. As a matter of fact, it’s a language with some decidedly highbrow features. And yet I’m pretty sure the root of its popularity does not lie in its lexical closures or its first-class functions. JavaScript is thriving because it’s the way to make nifty stuff happen inside a web page. Those Google maps that glide around in their box when you push them with the little hand—that’s JavaScript magic, and I was thunderstruck the first time I saw it. And the drag-and-drop pictures on Flickr, the fill-in forms that won’t let you schedule a delivery on February 30th, the online surveys, the clickable seating chart on the airline-reservation page, the everybody’s-an-editor features of Wikipedia, virtually everything on Facebook—none of it would happen without JavaScript. (It’s only fair to mention that most of the obnoxious things on web pages, such as popup ads, also depend on JavaScript.)

So here are a few miscellaneous JavaScript developments that I’ve been noticing lately.

John Resig has ported the Processing graphics language to JavaScript. The original Processing implementation of Ben Fry and Casey Rees produces Java applets. Java programs can also be embedded in web pages (here’s an example ready at hand), so what’s the point of switching to JavaScript? Both languages have their pluses and minuses, but Java has just enough startup overhead that many folks in our impatient age are reluctant to run it. JavaScript avoids that delay.

(Note: This is where I’m supposed to explain that Java and JavaScript are not dialects of the same language, or even closely related. On the other hand, JavaScript is the same as LiveScript, JScript and ECMAScript, and it’s almost indistinguishable from ActionScript. I don’t know what we did to deserve this confusion of names. Maybe it had something to do with excess consumption of froufrou coffee drinks. For more on the sorry history of the two languages see Steve Champeon or Douglas Crockford.)

Cat is another graphics language with an interpreter that emits JavaScript.

Looking beyond all the web-gui-glitz, and more in the direction of referential transparency and the like, Sjoerd Visscher has released some interesting code that makes JavaScript look more like Haskell.

There’s a new JavaScript (excuse me, ECMAScript) standard on the way, someday. (It’s been coming for at least seven or eight years.)

A couple of things I should have known long ago but just discovered: The dashboard “widgets” introduced in Apple OS X 10.4 a few years ago are coded in JavaScript. And several big Adobe applications (InDesign, Photoshop, Illustrator) have JavaScript interpreters built in.

There’s more news from Apple: WebKit, the open-source core of the Safari browser, is getting a new JavaScript interpreter called SquirrelFish. Here’s what the announcement says:

SquirrelFish is a register-based, direct-threaded, high-level bytecode engine, with a sliding register window calling convention. It lazily generates bytecodes from a syntax tree, using a simple one-pass compiler with built-in copy propagation.

As for the deep mystery of the name “SquirelFish,” if there is there a connection to Holocentrus adscensionis, I don’t get the joke.

Still another technology connected with Apple—and still another goofy name—is SproutCore:

SproutCore is a framework for building applications in JavaScript with remarkably little amounts of code. It can help you build full “thick” client applications in the web browser that can create and modify data, often completely independent of your web server, communicating with your server via Ajax only when they need to save or load data.

The provenance and current status of this system are somewhat murky to me. SproutCore was created by Charles Jolley, an independent software developer, but apparently it was embraced and promoted at the recent Apple Worldwide Developer Conference as a technology for creating web applications that resemble the OS X Cocoa framework (see here and here). But so far Apple has said nothing openly to confirm their adoption of SproutCore. It’s worth noting that there are lots of other JavaScript libraries and frameworks (for example, the Yahoo User Interface Library YUI, qooxdoo, John Resig’s JQuery).

When I look upon all these somewhat baffling developments, my role is that of a not-quite-innocent bystander. I don’t have the expertise or experience to speak authoritatively on the best choice of programming language and development environment. On the other hand, I have an interest in the outcome of the language competition. I want to be able to write small, illustrative programs (like the Processing sketch mentioned above, or another done in ActionScript) and make them available to the world at large via web pages. JavaScript and its variants are likely to be the vehicle for that activity. Could we make it a smooth-running and reliable vehicle, please?

I close with an interesting comment from Hackerdashery:

As the correct metaphor for a web page moves farther from “document” and closer to “application”, maybe it makes sense for browsers to act more like operating systems.

Bloom-filtered Britney

9 June 2008

Imagine an unending stream of names:

Britney, Brad, Angelina, Britney, Pamela, Jessica, Jessica, Britney, Clay, Brad, Britney, Britney, Pamela, Clay, Brad….

Your job is to keep a running tally of the number of unique names. (The snippet above, with 15 names altogether, has 6 unique names.)

There’s a straightforward method of solution: Keep a record of all the unique names seen so far, then match each name received against all those already on record. If a new name matches one in your database, ignore it. If there is no match, add the new name to the database and increment the count of unique names.

The trouble with this scheme is that the memory requirement is potentially unbounded. You’ll need to store one copy of each unique name. In the worst case (where no name is ever repeated), your database has to be capable of storing the entire stream. If that stream consists of, say, hundreds of millions of queries submitted to an Internet search engine—with more pouring in every second—you may have a hard time keeping up.

Tasks like this one are the subject of my new “Computing Science” column, titled “The Britney Spears Problem.” It’s now available online, and paper-and-ink magazines should be reaching subscribers and newsstands soon. (Note: American Scientist has just revamped its web site. A lot of effort has gone into this migration, but it’s still a work in progress, so please be patient. Links to the old site are not yet being redirected to the corresponding page of the new site. If you discover anything else that’s not working or not to your taste, I’d be grateful if you’d drop me a note or leave a comment below. I’ll pass suggestions along.)

But back to Britney and her friends.

The basic assumption in the study of stream algorithms is that you get only one look at the stream. You are standing on a bridge, watching the river flow by below. When something interesting floats along in the current, you can choose to grapple it out and store it on shore. But if you let it float past, you never get another chance at it. And you have only a fixed amount of storage space for things you choose to save. There’s also a time constraint: You have to be finished with one object before the next comes along. (In some cases the time constraint can be relaxed to an averaged or “amortized” bound.)

With a fixed and finite amount of storage and a stream of infinite variety, there’s no way to count the unique elements exactly. In my American Scientist column I describe some approximation methods. Here I want to briefly mention another approach that I didn’t have room to include in the column. It’s called a Bloom filter, after Burton H. Bloom, who described the idea in 1970. (See Bloom, Burton H. 1970. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13(7):422–426. Link (subscription needed).)

The term “filter” is a little misleading in this context; what Bloom defined is really a specialized memory. The underlying data structure is an array of m bits, all initially set to zero. Also needed are k different “hash” functions, each of which can be applied to any item being stored in the memory. The hash functions map a data element to a number between 0 and m–1; this number can then be interpreted as an index into the array of bits. (An ideal hash function is both perfectly random and perfectly deterministic—a hard combination to achieve in practice. It’s random in the sense that any data item can hash to any of the m possible values with equal probability. It’s deterministic in the sense that the same data item will always hash to the same value.)

There are two operations defined on the Bloom filter:

  • To store a data element, set each of the bits specified by the k hash functions to 1. (If any of the bits are already set to 1, leave them as they are.)
  • To find out if a data element is already present, calculate the k hash functions for that element and examine the corresponding bits; if any of the bits is 0, then the element cannot be present in the memory.

For the name-list problem, the Bloom filter offers a way to check every name as it comes along to see whether or not it has appeared before. The checking can be done in a fixed amount of space and a fixed amount of time per name. Problem solved, no? Not quite. Although the Bloom filter can never produce a false negative (claiming that a name is not present when it really is) there is a risk of false positives (reporting that a name has been seen before when in fact it is novel). For example, with k=3 and m=10, suppose the name Britney hashes to {0,3,6} and Brad yields bits {3,4,9}. Now, with those two names already stored, Angelina comes along and hashes to the values {4,6,9}. All of those bits are set, and so the Bloom filter will falsely report that Angelina has been encountered previously.

The probability of a false positive obviously depends on the proportion of bits that are set to 1 in the memory array. You might want to try working out just how this probability grows as a function of m, k and n (the number of items stored). Or take the easy way out and read Andrei Broder and Michael Mitzenmacher’s survey. I also recommend the Wikipedia article on Bloom filters.

The Bloom filter strikes me as one of those ideas that seems thunderingly obvious after you’ve read about it. But it’s really very elegant and clever.

Does anyone know who Burton H. Bloom is or was? In 1963 he wrote an MIT AI memo (No. 47) on heuristics in chess. The ACM Guide to Computing Literature lists just one other publication, from 1969. By then he was working for the Computer Usage Company in Newton Upper Falls, Mass., which apparently closed up shop in 1986. I’m not the first (see here and here) to go looking for him.

Spam stats

5 June 2008

Hormel Foods, the Minnesota meatpacker, reports a surge in sales of Spam. News accounts attribute the rising popularity of the pink meat-in-a-can to higher prices for other commodities. Or maybe it’s the Spam musubi fad.

Meanwhile, the other kind of spam seems to be surging as well. I’ve been keeping track of my personal spam consumption for the past five years. (I first wrote about this in 2003, with a follow-up in 2007.) Here’s a record of the total number of messages landing in my spam bin each month since the start of 2007:

spamvolume.png

The lull last spring gave me some hope that spam was finally in decline; the monthly intake even fell below 1,000 messages in March and April. But the respite didn’t last. There was steady growth through last summer and fall, and now another spike in volume has brought the rate to nearly 3,000 messages per month.

The message counts charted above lump together spam sent to several email addresses. Here’s a breakdown by address, covering the entire 17-month period:

mailboxes.png

The two addresses that attract the most unwanted traffic—namely, my address here at bit-player.org and another at amsci.org—are both published openly on the web, without any form of obfuscation. So are the addresses identified in the pie chart as “il-perms” and “il-prints”; they appear on my industrial-landscape.org web site. I’m certainly not surprised that spammers have discovered these addresses; they are fair game to anyone who knows how to scrape a web site. But there are still some puzzles in the data. I have several more email addresses that are equally vulnerable—they are published in the same places—but they receive nary a spam. Why not? And my earthlink.net and acm.org addresses are not published (or even much used), yet they get a healthy share of junk mail.

The content of the spam remains much the same—replica watches, blue pills, pirate software, phishing expeditions. Numbingly repetitious. In one week I got 25 messages with the same subject line: “eBay New Unpaid Item Message from snorelax67.” Then there were the 34 messages with subject lines such as “Viadzgra - $1.20,” “Viabqgra - $1.75,” “Viafmgra - $1.09″ and “Viategra - $1.38.” (Evidently someone has written a little program to insert random letter pairs in the middle of the word. My spam filter was not fooled. Nor did it fall for “Hihg - qualiyt repliacs of the ebst lcock of the wrold!!”) In “How Many Ways Can You Spell V1@gra?” I argued that most of the world’s spam is coming from a relatively small number of senders—tens or possibly hundreds, but not thousands—and I think the evidence continues to support that conjecture.

One interesting trend in my spam is that it seems to be growing more cosmopolitan. Back in 2003, about 18 percent of the spam I received was written in languages other than English; the figure now is 34 percent. The distribution of languages is curious. Here are the data for May 2008, when I received a total of 933 non-English spams:

spamlangs.png

Does everybody get gobs of spam in Russian, or is it just me? Is there something about my Internet activity that leads mailing-list compilers to believe I read Russian? Well, here’s the sad truth: My knowledge of Russian is so totally lacking that I’m not even sure all those messages are really Russian. They come with a Cyrillic character encoding, but for all I know some of them could be Bulgarian or Ukrainian. I’m equally in the dark about the 153 messages that appear to be written in various Asian languages (Chinese, Japanese, Korean). As for the German messages, they are something of a novelty. Until a few weeks ago, I almost never saw spam in German, and now there’s a sudden spate. It’s pretty clear that all of it comes from the same source. I’m seeing no French spam, nor Portuguese, nor Hindi, Urdu, Arabic, Hebrew.

Linguistic diversity is laudable, and in general I’m pleased to see challenges to Anglophone hegemony. I’m always flattered when someone addresses me in another language—even if I can’t respond in kind. But in this case I’m afraid there’s no reason to be congratulating myself. The spammers are not sending me these multilingual documents because they take me for an accomplished and urbane polyglot. They’re sending them to me (and to millions of others) because selectivity just isn’t worth the bother. Addressees like you and me are too cheap to count. Spam is becoming something like the cosmic microwave background radiation. It’s everywhere, it’s meaningless, it can be mistaken for birdshit.

Update 2008-07-01. More pink meat. I’ve tallied up the receipts for June, and my personal spam volume has set a new record: 3,354 messages, an increase of 20 percent over the previous high of 2,794 in May. The updated graph now covers 18 months:

spamvolume701.png

It’s worrisome to see the quantity growing so fast, but let me try to put the matter in perspective. Alongside the 3,354 spams I received in June, I also received 1,245 nonspam messages. Thus the proportion of spam is about 73 percent—well under the figure of 90 percent that’s often bandied about by companies that sell anti-spam products and services. Moreover, the spam causes me very little actual bother; almost all of it goes directly into the junk folder without need for human intervention. The nonspam messages, on the other hand, demand to be read and responded to. Perhaps I’d get more accomplished if more of my mail were spam.

I have not done a language analysis of the new batch, but I can tell at a glance that I’m still attracting a bizarre glut of Russian spam. A subject line that caught my eye reads:

programspam.png

I can sound out just enough Russian to guess the transliteration “programme spam.” Inside the message is an image of an advertisement (also in Russian) for various warez. But the decoy text that’s meant to get the message through the spam filters is a sports story written in German. Thus even individual messages are now becoming multilingual.

Update 2008-09-01: When I started this thread back in the spring, I thought I was taking note of a step function in the spam rate—a sudden jump from 2,000 a month to a new plateau at 2,500 a month. The trend looks different now: not a series of steps but sustained steady growth, with an increment of roughly 500 a month:

spamvolume901.png

Total spams received in my various inboxes came to 3,886 for July and 4,489 for August.

And I continue to be amazed and baffled by the quantity of Russian-language spam. The proportion of my spam written in a Cyrillic alphabet is now above 40 percent. The growth in Russian-language messages accounts for about two-thirds of the overall increase in the past few months. Should I read some geopolitical meaning into this trend?