The Zune Bug

The last day of 2008 was the day the world stood still for Microsoft Zune music players. First-generation Zunes—those with 30-gigabyte disk drives—went silent everywhere on December 31. The cause was soon traced to calendrical code in the device’s firmware. The bug is an interesting one, if only because all the details, including the source code, immediately came to light. Explanations were posted in the Zuneboards forum, at Ars Technica, at ProgramPhases.com and elsewhere. A copy of the C source module (which bears copyright notices from both Microsoft and Freescale Semiconductor) was posted here.

When a high-profile bug like this one comes along, it’s tempting to heap abuse and scorn on the hapless programmers who released the faulty code into the wild. And I do think this is an error that should have been caught in testing or during a code review. On the other hand, I’ve made so many mistakes myself over the years that I’m reluctant to give others a hard time. And I think the main lesson from this incident is not the usual refrain that Microsoft is lame. Rather it’s the even more commonplace observation that software is hard. Subtle traps lie in wait even in what seems a very simple algorithm.

Here’s the code that caused all the trouble:

    year = ORIGINYEAR;
    while (days > 365)
    {
        if (IsLeapYear(year))
        {
            if (days > 366)
            {
                days -= 366;
                year += 1;
            }
        }
        else
        {
            days -= 365;
            year += 1;
        }
    }

The purpose of this routine is to calculate the current year. On entry, the variable days is equal to the number of days since January 1, 1980, which is taken as the beginning of time. On exit, the variable year is supposed to hold the correct year number. (The global constant ORIGINYEAR is set to 1980. The IsLeapYear predicate does just what you would guess.)

The intent of the code is clear enough: We’ll repeatedly decrement the days count—by 365 in common years and by 366 in leap years—and each time we do that, we’ll increment the year number by 1. At the end of the process, the year count should have reached the correct current value. But of course I wouldn’t be writing this if the program always gave the right answer. Can you spot the problem?

Consider what happens when the initial value of days is 366. Since this number is greater than 365, we pass the test at the head of the while loop and enter that block. Since the initial year is 1980, the IsLeapYear predicate returns true, and we enter the consequent clause. Now we try the test (days > 366), which is false, and so we do not decrement days or increment year. In fact, we do nothing at all, because there’s no else clause attached to this if statement. We simply return to the head of the while loop, where we find that days is still greater than 365, and so we go through the same motions again, and again, ad infinitum. This is just what happened on December 31, 2008, which was day 10,593 in Zune time. The bug would be triggered on the last day of any leap year; it wasn’t observed before now simply because the Zune didn’t yet exist the last time February had 29 days.

A reader at Zuneboards.com quickly suggested a fix: Replace the test (days > 366) with (days >= 366). As another reader pointed out, this is incorrect. The change eliminates the infinite loop, but the altered routine returns the wrong year on December 31 of every leap year. The reader who spotted this error suggested a different solution, adding an else clause with a break command:

    year = ORIGINYEAR;
    while (days > 365)
    {
        if (IsLeapYear(year))
        {
            if (days > 366)
            {
                days -= 366;
                year += 1;
            }
            else
            {
                break;
            }
        }
        else
        {
            days -= 365;
            year += 1;
        }
    }

Still another reader offered this option:

    year = ORIGINYEAR;
    while (days > 365)
    {
        if (IsLeapYear(year))
        {
            if (days > 366)
            {
                days -= 366;
                year += 1;
            }
            else if (days == 366)
                days -= 366;
        }
        else
        {
            days -= 365;
            year += 1;
        }
    }

As far as I can tell from running a few test cases, both of these versions yield correct results, but their logic is anything but perspicuous. Trying to verify correctness by ure reason—as opposed to trial-and-error testing—looks almost hopeless. And then there’s the matter of aesthetics: In my opinion, all three programs are just plain ugly. A task this simple shouldn’t need so many ifs and elses.

Or so I thought. After spending some hours futzing about with the problem, and finding awful bugs in several of my own attempts, I’m not so sure there’s a simple and satisfying solution. Here’s my best effort so far:

    year = ORIGINYEAR - 1;
    while (days > 0)
    {
        year += 1;
        if (IsLeapYear(year))
            days -= 366;
        else
            days -= 365;
    }

I suppose I consider this an improvement over the alternatives given above, but I don’t really like it. What annoys me most is the trick of initially setting year to ORIGINYEAR - 1. This violates an implicit assumption that year will always be equal to or greater than 1980. In some programming environments, we could even make that assumption into an enforceable assertion, in which case this code would break.

To avoid introducing the phantom year 1979, we could do it this way:

    year = ORIGINYEAR;
    while (days > 0)
    {
        if (IsLeapYear(year))
            days -= 366;
        else
            days -= 365;
        if (days > 0)        
            year += 1;
    }

But, to my taste, repeating the (days > 0) test is even more offensive than starting the program before the beginning of time. And both of these routines can leave the variable days with a negative value, which is also nonsensical.

So I put the question: Is there a better way? Can anyone come up with the One True Algorithm for calculating years from elapsed days?

Three further notes:

1. In many counting problems there’s doubt about how to begin. Is the first day of January in 1980 to be interpreted as day 0 or as day 1? Nothing I could find in the original code resolves this ambiguity. When I first looked into the matter, I guessed that counting began with day 0 because the Zune routine correctly returns year 1980 for day values from 0 through 365. But if the zero-based assumption were correct, the Zunes would have died on January 1, not December 31. Evidently, we’re counting from 1.

2. For the record, I haven’t actually tested any of the C code above. I’ve been translated to and from Lisp. Here’s the Lisp version of the 1979 routine:

    (defun my-zune-year (day-number)
      (let ((year 1979) (day day-number))
        (loop while (> day 0)
              do (incf year)
                 (if (leap-year-p year)
                   (decf day 366)
                   (decf day 365)))
        year))

3. Elsewhere in the Zune firmware, there’s another routine that raises questions about programming style and practice. Here’s the implementation of the IsLeapYear function:

    static int IsLeapYear(int Year)
    {
        int Leap;
        Leap = 0;
        if ((Year % 4) == 0) {
            Leap = 1;
            if ((Year % 100) == 0) {
                Leap = (Year%400) ? 0 : 1;
            }
        }
        return (Leap);
    }

The code implements the rules of the Gregorian calendar: A year is a leap year if it is divisible by 4, unless it is divisible by 100 but not by 400. So far so good. Elsewhere in the program, however, we find these declarations:

#define ORIGINYEAR 1980               // the begin year
#define MAXYEAR    (ORIGINYEAR + 100) // the maxium year

Thus it turns out the Zune has a fixed lifespan of 101 years, from January 1, 1980, through December 31, 2080. Throughout that interval, leap years can be detected by the simple test ((Year % 4) == 0), without worrying about centuries. (Many systems have been designed to work from 1901 through 2099 precisely in order to take advantage of this shortcut.) Given the MAXYEAR declaration, should the Zune have adopted a simpler leap year algorithm? Or have the authors done the right thing by putting the full logic into the code, just in case someone later decided to extend the MAXYEAR deadline past 2100?

Posted in computing | 33 Comments