Archive for 2006

Radio alert

Saturday, November 25th, 2006

Chris Joyce and crew \"getting ambi\"

Christopher Joyce of National Public Radio is about to launch a series of reports on the “Industriosphere,” in which I am taking part. As in my book Infrastructure: A Field Guide to the Industrial Landscape, the subject matter is all the manmade bits of the environment we live in. (What is that stuff? What does it do? Why is it there?) Two weeks ago I spent a couple of days exploring and recording in and around Washington, DC, with Joyce and his NPR crew. The first report is scheduled for broadcast tomorrow afternoon (November 26) on Weekend All Things Considered. Check your local NPR station for broadcast times, or listen on the Internet. After the broadcast, some additional material (including photographs) will be posted on the NPR web site. And there is already an Industriosphere Flickr group.

In the photo above Christopher Joyce (foreground), producer Tina Tennessen (right) and audio engineer Drew Reynolds are “getting ambi” (recording ambient sounds) near Thomas Circle in Washington, where we all spent some time playing in traffic.

Update 2006-12-03: A brief followup interview, discussing the wooden rooftop water tanks of New York City, was broadcast on Saturday 2 December 2006. The interview is online here.

Running on empty

Friday, November 24th, 2006

MINI CooperDriving over the river and through the woods yesterday, I was running low on fuel. My car has two kinds of instruments to tell me that I’ll soon be standing by the side of the road feeling foolish. A conventional gas gauge shows the fraction of a tankful remaining, presumably based on readings from some sort of float mechanism inside the tank. The second instrument measures the rate of fuel flow to the engine, showing the result on a digital display that can be set to any of three modes, labeled “consumption,” “average consumption” and “range.” The two “consumption” modes are calibrated in miles per gallon; the “range” mode gives an estimate of the distance remaining until the tank is empty, in miles. When you’re nervous about whether or not you can make it to the next gas station, the range is clearly of interest.

But keeping an eye on the range estimate is also somewhat disconcerting. If the meter says you can last another 23 miles, and then you drive a mile, it seems reasonable to expect that the meter will report a remaining range of 22 miles. In fact it may well say 19 miles, or 23, or even 26. It’s particularly bizarre to see the range increase as you continue driving.

What’s going on here? It’s not hard to guess. The estimated range is simply the number of gallons remaining in the tank multiplied by the fuel-use rate in miles per gallon. Both measurements doubtless have some noise in them, but variations in the fuel flow rate are the major cause of fluctuations in the range estimate. For my car, the instantaneous fuel economy dips down below 10 miles per gallon under hard acceleration, and it appears to go well above 100 miles per gallon when coasting downhill. (The meter tops out at 99.9 mpg.) These variations could alter the range estimate by a factor of 10 or more. Strictly speaking, the fluctuating estimates are not wrong—they indicate the actual range if you were to continue driving exactly as you were at the moment of measurement—but some averaging or filtering would seem sensible.

In fact, I think the range readings I see on my dashboard instrument are smoothed to some extent. The number is updated at intervals of about 30 seconds, and it may reflect an average calculated over a somewhat longer period. The question I want to ask is this: What is the optimum averaging interval—optimal in the sense that it minimizes some measure of error in the estimates? I doubt there can be any definitive answer without making some assumptions about the nature of the fluctuations, but I have a heuristic proposal that seems pretty good to me.

Here’s how I was thinking about the problem during my Thanksgiving pilgrimmage. Suppose you keep driving until the tank runs dry, all the while recording your distance and rate of fuel consumption, moment by moment. Retrospectively, then, it’s easy to determine the number that the range meter should have been displaying at any point during the trip: You just measure the distance backward from the point where the engine died, and by definition that’s the range remaining. But note that for every point along the route, this “retrodicted” range should be equal to the number of gallons left in the tank at that point multiplied by the fuel-use rate (in miles per gallon) averaged over the remainder of the distance. This fact suggests a perfect estimation strategy: You should always average the fuel-use rate over the remaining range. Unfortunately, the remaining range is exactly what you’re trying to calculate, so this algorithm is not very practical. But perhaps we can approximate it.

To reiterate: If the remaining range is n miles, then the ideal is to estimate this range by averaging the fuel consumption over the next n miles. We can’t quite do that, for two reasons. First, we don’t know what the average consumption will be in the n miles to come; it will depend on terrain, speed, traffic conditions, and many other imponderables. Second, we don’t even known what n is; that’s what we’re trying to estimate. But don’t despair. To cope with the first problem, we can choose some other interval of n miles as a surrogate for the n miles just ahead; the obvious choice is the n miles just behind us. As for the unknown quantity n, we calculate it iteratively. Make some initial guess r0 about the fuel consumption rate, perhaps using the long-term average since the car was manufactured. Multiply r0 by the gallons in the tank to get a first estimate n0 of the remaining range. Then take the average fuel consumption over the preceding n0 miles and again multiply by the number of gallons to get a better range estimate, n1. Only a few repetitions of this process ought to be needed to converge on a pretty good estimate of n. That estimate, of course, is what the dashboard meter will report.

With this scheme, as the estimated range gets smaller, it will also get more volatile, because the consumption rate will be averaged over a smaller interval. I argue that this tendency to wider fluctuations is not a failure of the algorithm. In the last few miles before the tank runs dry, the range really does depend sensitively on whether you’re descending a hill on the open highway or stopping and starting at a series of city traffic lights.

Something tells me I’m not the first person to think about this problem or the first to propose this solution. The same issues arise in lots of other contexts, such as predicting how long the battery will last in a laptop computer. If anyone has a plan that can beat mine, I’d be pleased to hear about it.

By the way, I arrived on time for Thanksgiving dinner, with a few drops left in the tank.

Jacobsthal numbers

Thursday, November 23rd, 2006

In an item published here last May I stumbled across the sequence

1 3 5 11 21 43 85 171 341 683 1365 2731 5461 10923 21845 43691
87381 174763 349525 699051

which I dubbed “the oddest numbers” but which Neil J. A. Sloane’s Encyclopedia of Integer Sequences calls the Jacobsthal sequence. I asked: Who is or was Jacobsthal? Keith Matthews, creator and maintainer of the Number Theory Web, has stepped forward with an answer. Ernst Jacobsthal (1882-1965) was a German-born mathematician who fled to Norway when the Nazis came to power and eventually became a Norwegian citizen and professor at Trondheim. The story is told in an obituary that Matthews has made available on the Number Theory Web both in the original Norwegian and in an English translation by Jan Kristian Haugland. Matthews also mentions a brief biography in German. And the Mathematics Genealogy Project offers the further information that Jacobsthal was a student of Georg Frobenius and Issai Schur.

A bit more poking around reveals that Jacobsthal’s doctoral dissertation is available online, published by the Humboldt-Universität zu Berlin. In this work Jacobsthal gives a proof that every prime of the form 4n+1 can be written as a sum of two squares.

I’m very grateful to Matthews for answering my idle question about Jacobsthal—but there’s no end to questions. What I’d like to know now is when and where and in what context Jacobsthal discussed his sequence. The numbers are most readily defined by the recurrence Jn = Jn-1 + 2Jn-2, with J0 = J1 = 1. How did this idea come up in his work? Several journal articles and web sites mention the Jacobsthal numbers, but I’ve yet to find a reference to any specific publication.

Stupid questions

Wednesday, November 22nd, 2006

Whether or not anyone else is reading the stuff I’m posting here, the spammers have discovered me. And I’m afraid that “comment spam” is not as amusing as the e-mail variety. I have therefore installed one of those annoying sentry programs that will ask for the answer to a trivial question before allowing you to post a comment. I’m choosing questions that I’m sure every reader of this blog will be able to answer instantly — things like, “Who published the first proof of the impossibility of trisecting an angle?” and “True or false: The present king of France is bald?” Please give it a try. If you have any trouble — and you’re not a spambot — e-mail me at brian@bit-player.org.

Update 2006-12-14: A few days ago I thought I had solved the comment-spam problem by other means, and so I turned off the interrogation. But the spammers are back, and so I have put the sentry on duty again. This time the questions all take the same form: a sequence of positive integers. You’ll be asked to name the next element of the sequence. None of them are tricky (no lists of subway stops), or even interesting. If the spambots learn to solve the problems in this set, well, good for them; at that point we can think about escalating to the next level.

What’s so special about {0,2,3,4,7,11,12,14}?

Tuesday, November 21st, 2006

Three postings here (1, 2, 3) have discussed what happens when you form all pairwise sums and differences from a finite set of integers. The number of differences almost always exceeds the number of sums—a fact that lends special interest to the occasional exceptional sets, with More Sums Than Differences (MSTD).

A question left up in the air when I last wrote on this subject was the size of the smallest MSTD set. The smallest known set was {0, 2, 3, 4, 7, 11, 12, 14}, which has 26 sums but only 25 differences. This set has eight elements. Could there be an MSTD set with seven or fewer elements? The question has now been answered in the negative by Peter V. Hegarty of the Chalmers University of Technology and Göteborg University. He proves there is no smaller set, and furthermore that {0, 2, 3, 4, 7, 11, 12, 14} is the only MSTD set of size eight. (Apart from other eight-element sets generated from {0, 2, 3, 4, 7, 11, 12, 14} by affine transformations.)

Read all about it at the arXiv.

The backblog

Tuesday, November 21st, 2006

I want to thank those of my friends who sent me links to Jenn Shreve’s amusing collection of bloggers’ “haven’t posted in awhile” excuses and apologies. What has kept me away all this time is the search for an adequate excuse.

Thanks, too, to those of my friends who didn’t send me the links.

R6RS

Friday, September 15th, 2006

Scheme, the dialect of Lisp invented by Guy Lewis Steele, Jr., and Gerald Jay Sussmann, is my first love among programming languages, though I haven’t always been faithful. There’s a new draft standard for Scheme now circulating at http://www.r6rs.org/. The proposal is open for comments until next March; on final approval it will become “The Revised6 Report on the Algorithmic Language Scheme.”

Scheme has always been a language under both tension and pressure. It started out (more than 30 years ago) as a sandbox for playing with some new ideas in programming-language semantics, and also (I think) as an attempt to “do Lisp right.” Soon it became an important teaching language, and it still is. But there’s always been pressure to make more of it. Shouldn’t a language for grownups have modules, libraries, foreign-function interfaces—not to mention classes, methods, inheritance and all the other apparatus of OO? That’s the pressure part. The tension has come from unceasing debates over issues such as the ideal syntax and semantics of macros, and the morality of creating a version of Lisp without an eval procedure.

I’ve only just begun to read the new draft, which is officially the Revised5.91 Report. But I think one can draw some preliminary conclusions just from its heft. Here are the page counts of the principal versions of the standard:

Date Pages
The Revised Report 1978 34
The Revised Revised Report 1985 76
The Revised3 Report 1986 42
The Revised4 Report 1991 55
The Revised5 Report 1998 76
The Revised5.91 Report 2006 142

An introductory essay that first appeared in the Revised3 Report begins:

Programming languages should be designed not by piling feature on top of feature, but by removing the weaknesses and restrictions that make additional features appear necessary.

The same text is still present in the new document. I’ll be interested to learn whether the same spirit has survived. It’s not to be taken for granted that just because the standard has doubled and then redoubled in size, the language has also grown obese. But it’s an obvious worry.

Softer infrastructure

Thursday, September 14th, 2006

For those who’ve been waiting patiently… My book Infrastructure: A Field Guide to the Industrial Landscape is just out in paperback. Except for the cover and the price, it’s identical to the hard-cover edition (errors and all!). Statistics:

  • characters: 1,302,389
  • words: 219,748
  • sentences: 10,940
  • pages: 541
  • photographs: 735
  • weight: 4.6 pounds
  • dimensions: 10.3 x 10.3 x 1.2 inches
  • list price: $49.95 (US) hardbound, $35 paperbound
  • ISBN: 0-393-05997-9 hardbound, 0-393-32959-3 paperbound

Links for the paperback edition:

More info at http://industrial-landscape.com/index.html

End of sales pitch.

Only connect!

Thursday, September 14th, 2006

Euclid famously said, “There is no royal road to geometry.” Among the non-royal roads, the computational pathway is notably muddy, rutty and potholed. Over the weekend I needed to write a program for a simple geometric task—finding the intersection of two line segments in the plane. It’s Wednesday Thursday now, and I finally have my program. Let me tell you some of my adventures along the way.

First steps. This is a low-key, off-the-cuff, back-of-the-envelope project. The program doesn’t have to be blazingly fast; it isn’t going to be doing real-time animation in a video game. Nor is it going to trace the trajectories of aircraft in the holding pattern at JFK; no lives will be lost if I make a goof. Still, I would like to get correct answers, and not waste cpu cycles too extravagantly.

In Euclid’s way of doing geometry—with straightedge and compass—there’s not much to be said about algorithms for finding intersections. You construct the lines and see where they cross. Most computers, however, are not adept with straightedge and compass; the problem has to be encoded somehow in a more-algebraic language. Descartes knew how to do this. The cartesian equivalent of the Euclidean procedure goes something like this: Given two segments specified by the x and y coordinates of their end points, write the equation y = mx + b for each of the infinite lines that pass through these pairs of points. Then try to solve the two simultaneous equations to find a point (x,y) that lies on both lines; this point, if it exists, must be the intersection of the lines. Now go back to check whether the intersection lies within each of the segments.

Seems straightforward—but watch out for those potholes. For starters, in the equation y = mx + b the slope m is defined as Δy/ Δx. What if the segment is vertical? Then Δx = 0, and the slope is undefined. More insidiously, what if the two end points of a segment are actually the same point, so that the segment has zero length? Again the slope is undefined; now it’s 0/0. Then there’s the matter of parallel line segments. In Euclid’s world, parallel lines just don’t intersect; that’s pretty much how “parallel” is defined. But in dealing with segments, it does makes sense to ask about the intersection of two segments lying along the same line. In this case the intersection could be empty, or it could be a single point of overlap, or it could be a segment. The method of slopes and intercepts and simultaneous equations won’t work in this situation.

Down the garden path. I like a good puzzle, and I try not to cheat, but sometimes it’s just crazy to pretend you’re the first person on Earth who ever faced a problem. So I did some scouting around on the Net, as well as in filing cabinets and on bookshelves.

If you do a Web search for “segment intersection algorithm,” you’ll find lots of references to works by eminent computer scientists: Jon Louis Bentley, Bernard Chazelle, Herbert Edelsbrunner, and others. Following those leads, you’ll find some eminently clever algorithms, the result of a long series of inventions and refinements in a sort of algorithmic arms race that has gone on for more than 20 years. Unfortunately, these powerful tools solve the wrong problem—or at least they don’t solve my problem. They deal with the task of efficiently identifying all the intersections among a large set of line segments; the subtask of finding the specific point of intersection between two given segments is a minor detail generally left as an exercise for the reader.

Furthermore, most of the papers describing these algorithms take a fairly cavalier attitude to the various singularities and degeneracies mentioned above. I quote from Chazelle and Edelsbrunner 1992 (An optimal algorithm for intersecting line segments in the plane, Journal of the Association for Computing Machinery 39:1–54):

For the ease of exposition, we shall assume that no two endpoints have the same x or y coordinates. This, in particular, applies to the two endpoints of the same segment, and thus rules out the presence of vertical or horizontal segments…. Our rationale is that the key ideas of the algorithm are best explained without having to worry about special cases every step of the way. Relaxing the assumptions is very easy (no new ideas are required) but tedious. That’s for the theory. Implementing the algorithm so that the program works in all cases, however, is a daunting task. There are also numerical problems that alone would justify writing another paper. Following a venerable tradition, however, we shall try not to worry too much about it.

Thanks, guys.

Out on the Web I did find a few odd bits of code that address the specifics of the two-segments problem. One author finesses the inconvenience of vertically oriented segments by setting Δy/0 equal to 1010, with the comment, “close enough to infinity.” I wasn’t seriously tempted to follow this example. It’s not that I want a better approximation to infinity; I’d agree that 1010 is probably close enough. But this is one of those cases where infinity itself is not a very good approximation to the result of dividing by zero. Even if you believe that parallel lines meet at infinity, single isolated points don’t get there even in the limit. Also, despite venerable tradition, I do worry that a finite infinity may invite numerical trouble somewhere down the road.

The road not taken. What’s most annoying about the whole vertical-line mess is that it’s not a fundamental geometric constraint but just an artifact of how lines are represented. It’s only by convention that we measure slope with respect to the y axis; we could choose another orientation. Which suggests a way around the problem. If one of the segments turns out to be vertical, rotate the whole coordinate frame before attempting to test for intersections, and then rotate it back afterwards. The rotation is just a matrix multiplication, so it’s not a big computational burden, and you can do it only when needed. I seriously considered this strategy, and even now I wonder if it isn’t the right way to go about it. I finally decided against it because it doesn’t solve the problem of a segment that’s a single point: That kind of degenerate line has no well-defined slope in any coordinate frame.

Perhaps this fact is an argument for outlawing single-point segments altogether. I considered that too, but it seemed a bit pusillanimous, legislating a problem out of existence just because I found it inconvenient to solve. Mathematically, a one-point line segment may be pathological, but computationally it’s just an instance of aliasing—a very common occurrence, where two things turn out to be the same thing. Besides, it would be handy to have an intersection routine that can also test whether a given point is on a line.

On the road again. Lacking a brilliant Gordian-knot insight that would solve the problem with a single stroke, I had no choice but to push on into the tangled maze of if-then-else case analysis. Is either segment vertical? Is either segment a single point? Are the segments parallel (i.e., identical slopes)? If they are parallel, are they also collinear (identical y intercepts)? But be careful: If they’re parallel, they could both be vertical, with undefined y intercepts. Each of these cases could require a separate calculation of the intersection point. I’m not even sure how many cases there are—at least a dozen, but it depends on how you count.

The procedure I eventually wrote (after exploring several more blind alleys) is not quite as ugly as I feared it would be, but I still have the firm sense that there’s a better way. If not a royal road, then at least a trail that doesn’t require four-wheel drive. I was able to get the number of cases down to five, though only by exploiting a little Lisp hocus-pocus. For convenience of reference in what follows I label the segments red and blue.

  • Case 1. Both slopes exist, and they are not equal. This is the general case of two lines that are not vertical and not parallel. We can solve the simultaneous equations.
  • Case 2. The red slope is well-defined but the blue slope does not exist. The blue segment could be either a vertical segment or a single point. In either case, if the intersection exists, its x coordinate must be that of the blue segment.
  • Case 3. The blue slope is well-defined but the red slope does not exist. Ibid, mutatis mutandis.
  • Case 4. Both segments have identical slopes and also identical y intercepts; hence they are parallel and collinear. This is where the hocus-pocus comes in: The Lisp predicate
           (equalp (slope red-segment)
                   (slope blue-segment))
    

    returns true if both slopes are numeric and are equal, and it also yields true if both slopes are nil, the Lisp idiom for things that don’t exist. By this trickery, with a similar test on the y intercepts, I pack several cases into a single cond clause: two collinear vertical segments, two collinear non-vertical segments, and various combinations of vertical segments and single points.

  • Case 5. None of the above. The only possibility remaining—unless I am mistaken—is equal slopes and different y intercepts: The lines are parallel but not collinear, and so there is no intersection.

All that analysis merely determines whether or not the lines intersect; we still have to check whether or not the intersection lies within the segments. That involves still more case analysis, since the process is a little different for parallel segments than for others. I found a way of doing it that’s not grotesquely ugly.

The road back. Belatedly, after I had a working procedure, I did some more scouting in the literature and discovered a few paths worth exploring.

Robert Sedgewick’s Algorithms textbook suggests a quite different and ingenious method of detecting intersections of segments. Suppose you walk along the red segment, and when you come to the end, you have to turn counterclockwise to reach one end point of the blue segment, and clockwise to reach the other blue end point. Now try the same experiment walking on the blue segment; if again you must turn counterclockwise in one case and clockwise in the other to see the red end points, then the two segments intersect. Conversely, if in either case both end points are reached by turning in the same direction, there can be no intersection. (To be comprehensive, we also have to consider the case where an end point is straight ahead or directly behind.) Nifty! Unfortunately, the procedure only detects the existence of an intersection; I see no easy way to make it yield the coordinates.

Joseph O’Rourke’s Computational Geometry text (I looked at the C version) also has a discussion of intersection-finding. O’Rourke suggests working with parametric equations for the two segments—defining x and y as functions of distance along the segment. This avoids the problem with undefined slopes but encounters another singularity with parallel lines. Overall, the computation does not appear to be notably simpler.

Stuck in the mud. The animus behind this entire rant is a feeling that I must be missing something obvious, that a problem like finding the intersection of two line segments shouldn’t be this hard. The difficulty I’m talking about is not computational but conceptual. There are lots of hard computational problems—graph coloring, say, or factoring integers—for which one can write a very tidy and perspicuous program. True, the program may have a running time that exceeds your lifespan, but it’s easy to describe what needs to be done. Programs for geometric problems, in contrast, seem often to be efficient but hideous, with tangled logic, an abundance of special cases, and hidden numerical perils. Why is that? Nature seems to have no trouble at all detecting intersections or collisions. If two wires cross on a circuit board, you can count on blowing a fuse no matter what the slopes of the conductors. Why can’t we compute the same result so effortlessly?

Or maybe I have indeed missed something obvious—or even something subtle. If anyone has a better intersection algorithm, please pass it on.

Update 2006-09-15. My thanks to all of the readers who so promptly weighed in with good ideas. Over the weekend I’m going to try turning a couple of those suggestions into working code, and I’ll report back.

Sums, differences, and surprises

Friday, August 25th, 2006

I’ve received the following note from Barry Cipra, bit-player’s Bureau Chief in Northfield, Minnesota, (where all hail broke loose yesterday):

Your latest postings [here and here] have motivated me to idle away some time with a variant of the problem(s) you discussed. To wit, I got to wondering what happens if you restrict yourself to taking sums of distinct pairs and positive differences of numbers from a (finite) set S — i.e., x+y and x–y with x > y in S.

It was easy, after a while, to see that the sum set is usually larger than the difference set. In particular, for the set {0, 1, 2, …, m}, the sum set is {1, 2, …, 2m–1}, while the difference set is {1, 2, …, m–1}. It seems, or at least seemed, obvious that the sum set is never smaller than the difference set, but a proof keeps slipping through my (fumbling) fingers, which suggests that either it isn’t true, or the proof is complicated, or the slow decline of my mathematical abilities has steepened and accelerated. Intuitively, each duplicate sum is associated with two duplicate differences: If a+d = b+c with a < b < c < d, then d–b = c–a and d–c = b–a. The problem, of course, is that you can have duplicate duplicates.

It seemed worth looking at how the sum/difference ratio tends toward 2 as one “grows” a set S by randomly adding numbers between 1 and some large upper bound, say m = 10,000. I wrote, ran, debugged, ran, debugged,…., and finally ran successfully a little (Basic) program to do this, and tried it with various values of m, printing out the ratio as each new (random) element got added. The number of elements required to reach any given ratio r (less than 2) seems to be proportional to sqrt(m). In particular, the number of elements to reach r=1.5 seems to be 2*sqrt(m). It would be most gratifying if this turned out to be asymptotically exact.

Barry also points out a connection with the birthday “paradox.” Here on Earth, any gathering of 23 or more people has a better-than-even chance of including at least two people with the same birthday. On Mars, where the year has 687 days, you need to assemble 32 people to have a fifty-fifty chance of such a coincidence. On Saturn, where people can have any of 1,055 birthdays, the minimum size of the group is 39. (I would add that on Pluto, with a year of 14,158 Plutonian days, the minimum likely size for a double-birthday party is 141 people. However, there can’t be that many people on Pluto, since it’s only a dwarf planet.) Anyway, the point of all this is that the number of people needed for a birthday coincidence grows in proportion to the square root of the number of possible birthdays. The calculation is similar when it comes to the number of elements in a set drawn at random from {0, 1, 2, …, m}, except that the probability we’re interested in is not that of finding two identical elements but rather that of having two pairs of elements with the same sum or difference.

Barry’s version of the problem has a satisfying symmetry. For both sums and differences, we’re looking at the upper triangle of a matrix, excluding the diagonal and everything below it. If there are no duplicates, or coincidences, then the sum set and the difference set are necessarily of the same size.

Inspired by Barry’s example, I have idled away a few of my own declining hours, and gone off on a bit of a tangent.

In an arbitrary finite set of non-negative integers, suppose there are n elements, and the largest element is m. If m is much larger than n, coincidences are unlikely among both the sums and the differences; in the limiting case, they have probability zero. If m = n–1, then the set must be the arithmetic progression {0, 1, 2, …, m}, and the number of coincidences is maximized. Barry’s algorithm for adding randomly chosen numbers to a set, one at a time, can be viewed as a way of exploring the transition between these two extremes. We can ask, at each stage of the process: What is the probability that a randomly chosen number, distinct from all the existing elements of the set, can be adjoined to the set without generating a coincidence?

Here is a somewhat different question. Instead of choosing numbers at random to add to the set, suppose we always insert the smallest number that can be adjoined to the set without creating a coincidence. In other words, we build the set from the bottom up. What does the resulting set look like?

First consider just the constraints imposed by the sum set, ignoring coincidences among differences. Starting from the empty set {}, we can obviously go on to {0} and then to {0, 1}. Next we can insert 2, because the three sums 0+1 = 1, 0+2 = 2 and 1+2 = 3 are all distinct. But we cannot add 3 to the set, because that would introduce the coincidence 0+3 = 1+2. The next permissible element is 4; the set {0, 1, 2, 4} produces the sum set {1, 2, 3, 4, 5, 6} without coincidences. Going on from there, both 5 and 6 are excluded, but 7 is okay. Working with pencil and paper, I extended this set as far as {0, 1, 2, 4, 7, 12, 20, 29}. This gave me enough to go consult the Oracle of AT&T. I found the sequence listed as No. A010067 in Neil Sloane’s Encyclopedia of Integer Sequences. The sequence is described as follows: “A B2 sequence: a(n) = least value such that sequence increases and pairwise sums of distinct elements are all distinct.” This was just what I had been looking for, and the success boosted my confidence that at least I had done the arithmetic right.

Now I tried the same method with the differences. Again the first few sets that yield no coincidences are {], {0} and {0, 1}, but in the subtractive case {0, 1, 2} fails, because 2–1 = 1–0. On the other hand, {0, 1, 3} is without conflicts. The series continues {0, 1, 3, 7, 12, 20, 30}. It was time to visit the Oracle again. This sequence is No. A025582, but the identifying tag line is not quite what I was expecting: “A B2 sequence: a(n) = least value such that sequence increases and pairwise sums of elements are all distinct.” I had generated the series by looking at differences of distinct elements, but it turns out that sums of not-necessarily-distinct elements yield the same series. Curious.

Let’s consider the element-by-element differences between these two sequences, which I’m going to call S(n) for the sums of distinct pairs and D(n) for the differences of distinct pairs:

   n        D(n)       S(n)      D(n)-S(n)
   0          0          0            0
   1          1          1            0
   2          3          2            1
   3          7          4            3
   4         12          7            5
   5         20         12            8
   6         30         20           10
   7         44         29           15
   8         65         38           27
   9         80         52           28
  10         96         73           23
  11        122         94           28
  12        147        127           20
  13        181        151           30
  14        203        181           22
  15        251        211           40
  16        289        257           32
  17        360        315           45
  18        400        373           27
  19        474        412           62
  20        564        475           89

From the first few entries in this table one might well guess that the sums will always be lagging behind the differences, so that the D(n) – S(n) sequence grows monotonically. But then there’s a reversal at n = 10, and after that it’s not clear exactly what the trend will be. The obvious question is: Can S(n) ever be greater than D(n)? This time the Oracle can’t help us; a Sequence Server search for {0, 0, 1, 3, 5, 8, 10, 15, 27, 28} comes up empty.

I tried producing a few more terms of S(n) and D(n) and their differences. Struggling with a grotesquely inefficient program, I invested an hour of cpu time, which got me as far as n = 60:

   n        D(n)       S(n)      D(n)-S(n)
   0          0          0            0
   1          1          1            0
   2          3          2            1
   3          7          4            3
   4         12          7            5
   5         20         12            8
   6         30         20           10
   7         44         29           15
   8         65         38           27
   9         80         52           28
  10         96         73           23
  11        122         94           28
  12        147        127           20
  13        181        151           30
  14        203        181           22
  15        251        211           40
  16        289        257           32
  17        360        315           45
  18        400        373           27
  19        474        412           62
  20        564        475           89
  21        592        530           62
  22        661        545          116
  23        774        607          167
  24        821        716          105
  25        915        797          118
  26        969        861          108
  27       1015        964           51
  28       1158       1059           99
  29       1311       1160          151
  30       1394       1306           88
  31       1522       1385          137
  32       1571       1434          137
  33       1820       1555          265
  34       1895       1721          174
  35       2028       1833          195
  36       2253       1933          320
  37       2378       2057          321
  38       2509       2260          249
  39       2779       2496          283
  40       2924       2698          226
  41       3154       2873          281
  42       3353       3060          293
  43       3590       3196          394
  44       3796       3331          465
  45       3997       3628          369
  46       4296       3711          585
  47       4432       3867          565
  48       4778       4139          639
  49       4850       4446          404
  50       5122       4639          483
  51       5242       5021          221
  52       5297       5064          233
  53       5750       5322          428
  54       5997       5613          384
  55       6373       6003          370
  56       6800       6273          527
  57       6924       6493          431
  58       7459       6641          818
  59       7546       6979          567
  60       7788       7275          513

Would you care to place a bet on the further behavior of this series? At n = 60, D(n) – S(n) looks like it’s still growing, on the average, but it’s also still fluctuating. Will any of the dips in the curve ever plunge deep enough to bring the balance below zero? Or will the overall upward trend prevail in the long run? Maybe it’s like π(x) – Li(x) in number theory. This function—the difference between the actual number of primes up to x and an estimate proposed by C. F. Gauss—is negative for all values of x anyone has ever calculated explicitly. Nevertheless, there’s a proof that π(x) – Li(x) does eventually become positive; indeed, it changes sign infinitely often. But the first such zero crossing is somewhere out in the remotest exurbs of the number line.

I debated: Is it worth the bother of writing a better program to add a few more lines to the table? If D(n) – S(n) is positive for all values of n up to n = 60, what is the probability of discovering a negative value at n = 61, or even n = 600? I have no idea how to answer that question mathematically, but psychologically the prospects seemed pretty dim. Nevertheless, I wrote the program. Here are the next four entries in the table:

   n        D(n)       S(n)      D(n)-S(n)
  61       8219       7587          632
  62       8502       8017          485
  63       8729       8373          356
  64       8941       9071         -130

There it is: S(64) is greater than D(64). And I soon learned that this excursion into negative territory is not a one-of-a-kind freak occurrence. In the interval between n = 0 and n = 500, D(n) – S(n) is positive for 301 values, negative for 198 values and 0 for 2 values. It seems a reasonable conjecture that positive and negative values occur equally often. Here’s a graph of D(n) – S(n) for n = 0 to n = 250:

D(n)-S(n) for n=0 to n=250

Final note: Although I have just today blundered down this interesting path, I am assuredly not the first to pass this way. I gather that I need to do some reading on Sidon sequences. Are there also connections with Golomb rulers?