The Zune Bug

The last day of 2008 was the day the world stood still for Microsoft Zune music players. First-generation Zunes—those with 30-gigabyte disk drives—went silent everywhere on December 31. The cause was soon traced to calendrical code in the device’s firmware. The bug is an interesting one, if only because all the details, including the source code, immediately came to light. Explanations were posted in the Zuneboards forum, at Ars Technica, at ProgramPhases.com and elsewhere. A copy of the C source module (which bears copyright notices from both Microsoft and Freescale Semiconductor) was posted here.

When a high-profile bug like this one comes along, it’s tempting to heap abuse and scorn on the hapless programmers who released the faulty code into the wild. And I do think this is an error that should have been caught in testing or during a code review. On the other hand, I’ve made so many mistakes myself over the years that I’m reluctant to give others a hard time. And I think the main lesson from this incident is not the usual refrain that Microsoft is lame. Rather it’s the even more commonplace observation that software is hard. Subtle traps lie in wait even in what seems a very simple algorithm.

Here’s the code that caused all the trouble:

    year = ORIGINYEAR;
    while (days > 365)
    {
        if (IsLeapYear(year))
        {
            if (days > 366)
            {
                days -= 366;
                year += 1;
            }
        }
        else
        {
            days -= 365;
            year += 1;
        }
    }

The purpose of this routine is to calculate the current year. On entry, the variable days is equal to the number of days since January 1, 1980, which is taken as the beginning of time. On exit, the variable year is supposed to hold the correct year number. (The global constant ORIGINYEAR is set to 1980. The IsLeapYear predicate does just what you would guess.)

The intent of the code is clear enough: We’ll repeatedly decrement the days count—by 365 in common years and by 366 in leap years—and each time we do that, we’ll increment the year number by 1. At the end of the process, the year count should have reached the correct current value. But of course I wouldn’t be writing this if the program always gave the right answer. Can you spot the problem?

Consider what happens when the initial value of days is 366. Since this number is greater than 365, we pass the test at the head of the while loop and enter that block. Since the initial year is 1980, the IsLeapYear predicate returns true, and we enter the consequent clause. Now we try the test (days > 366), which is false, and so we do not decrement days or increment year. In fact, we do nothing at all, because there’s no else clause attached to this if statement. We simply return to the head of the while loop, where we find that days is still greater than 365, and so we go through the same motions again, and again, ad infinitum. This is just what happened on December 31, 2008, which was day 10,593 in Zune time. The bug would be triggered on the last day of any leap year; it wasn’t observed before now simply because the Zune didn’t yet exist the last time February had 29 days.

A reader at Zuneboards.com quickly suggested a fix: Replace the test (days > 366) with (days >= 366). As another reader pointed out, this is incorrect. The change eliminates the infinite loop, but the altered routine returns the wrong year on December 31 of every leap year. The reader who spotted this error suggested a different solution, adding an else clause with a break command:

    year = ORIGINYEAR;
    while (days > 365)
    {
        if (IsLeapYear(year))
        {
            if (days > 366)
            {
                days -= 366;
                year += 1;
            }
            else
            {
                break;
            }
        }
        else
        {
            days -= 365;
            year += 1;
        }
    }

Still another reader offered this option:

    year = ORIGINYEAR;
    while (days > 365)
    {
        if (IsLeapYear(year))
        {
            if (days > 366)
            {
                days -= 366;
                year += 1;
            }
            else if (days == 366)
                days -= 366;
        }
        else
        {
            days -= 365;
            year += 1;
        }
    }

As far as I can tell from running a few test cases, both of these versions yield correct results, but their logic is anything but perspicuous. Trying to verify correctness by ure reason—as opposed to trial-and-error testing—looks almost hopeless. And then there’s the matter of aesthetics: In my opinion, all three programs are just plain ugly. A task this simple shouldn’t need so many ifs and elses.

Or so I thought. After spending some hours futzing about with the problem, and finding awful bugs in several of my own attempts, I’m not so sure there’s a simple and satisfying solution. Here’s my best effort so far:

    year = ORIGINYEAR - 1;
    while (days > 0)
    {
        year += 1;
        if (IsLeapYear(year))
            days -= 366;
        else
            days -= 365;
    }

I suppose I consider this an improvement over the alternatives given above, but I don’t really like it. What annoys me most is the trick of initially setting year to ORIGINYEAR - 1. This violates an implicit assumption that year will always be equal to or greater than 1980. In some programming environments, we could even make that assumption into an enforceable assertion, in which case this code would break.

To avoid introducing the phantom year 1979, we could do it this way:

    year = ORIGINYEAR;
    while (days > 0)
    {
        if (IsLeapYear(year))
            days -= 366;
        else
            days -= 365;
        if (days > 0)        
            year += 1;
    }

But, to my taste, repeating the (days > 0) test is even more offensive than starting the program before the beginning of time. And both of these routines can leave the variable days with a negative value, which is also nonsensical.

So I put the question: Is there a better way? Can anyone come up with the One True Algorithm for calculating years from elapsed days?

Three further notes:

1. In many counting problems there’s doubt about how to begin. Is the first day of January in 1980 to be interpreted as day 0 or as day 1? Nothing I could find in the original code resolves this ambiguity. When I first looked into the matter, I guessed that counting began with day 0 because the Zune routine correctly returns year 1980 for day values from 0 through 365. But if the zero-based assumption were correct, the Zunes would have died on January 1, not December 31. Evidently, we’re counting from 1.

2. For the record, I haven’t actually tested any of the C code above. I’ve been translated to and from Lisp. Here’s the Lisp version of the 1979 routine:

    (defun my-zune-year (day-number)
      (let ((year 1979) (day day-number))
        (loop while (> day 0)
              do (incf year)
                 (if (leap-year-p year)
                   (decf day 366)
                   (decf day 365)))
        year))

3. Elsewhere in the Zune firmware, there’s another routine that raises questions about programming style and practice. Here’s the implementation of the IsLeapYear function:

    static int IsLeapYear(int Year)
    {
        int Leap;
        Leap = 0;
        if ((Year % 4) == 0) {
            Leap = 1;
            if ((Year % 100) == 0) {
                Leap = (Year%400) ? 0 : 1;
            }
        }
        return (Leap);
    }

The code implements the rules of the Gregorian calendar: A year is a leap year if it is divisible by 4, unless it is divisible by 100 but not by 400. So far so good. Elsewhere in the program, however, we find these declarations:

#define ORIGINYEAR 1980               // the begin year
#define MAXYEAR    (ORIGINYEAR + 100) // the maxium year

Thus it turns out the Zune has a fixed lifespan of 101 years, from January 1, 1980, through December 31, 2080. Throughout that interval, leap years can be detected by the simple test ((Year % 4) == 0), without worrying about centuries. (Many systems have been designed to work from 1901 through 2099 precisely in order to take advantage of this shortcut.) Given the MAXYEAR declaration, should the Zune have adopted a simpler leap year algorithm? Or have the authors done the right thing by putting the full logic into the code, just in case someone later decided to extend the MAXYEAR deadline past 2100?

This entry was posted in computing.

33 Responses to The Zune Bug

  1. Stefan Ciobaca says:

    The correct & DRY way to write the code is the following:

    #define DaysInYear(year) (IsLeapYear(year) ? 366 : 365)

    year = ORIGIN_YEAR;
    while (days > DaysInYear(year)) {
    days -= DaysInYear(year);
    year++;
    }

    Of course, for the anti C macro guys, you can always turn DaysInYear into an inlined function :D

  2. Lucas says:

    Shouldn’t this code have worked correctly on January 2nd? Then days==367, so the year would be properly incremented, and everything would be fine.

  3. brian says:

    @Stefan Ciobaca: That was fast!

    @Lucas: Yes it’s self-correcting if we assume there’s some independent process that can update the value of days while the year calculation twirls in its tight light loop.

  4. Craig says:

    It’s not pretty, but I think the most straightforward solution would be…

    int days_per_400_years = 365 * 400 + 97; // 97 leap years every 400 years
    int year_table[] = { 0, 0, 0, 0, 0, 0 … } // array of length as above, with numbers from 0 to 399.

    int get_year( int days ) {
    return 1980 + (400*(days / days_per_400_years) +
    year_table[days % days_per_400_years]);
    }

    Yes, year_table needs to be an array of length 146097 short integers (or 146098 if your first day is day 1), but who cares? Plus, you do away with yucky iteration. You can push more of this to integer arithmetic and away from tables, but it’s annoying because the anomalous leap years are in the middles of the cycles by virtue of starting at 1980.

  5. svat says:

    Why start at 1980, BTW? It’s 10 years after the Unix epoch starts at 1970, so maybe the Zune’s “fixed lifespan” is not 101 years but only until 2048 :-)

  6. Darius Bacon says:

    I see someone already made a similar reply, but for what it’s worth:
    int year = ORIGINYEAR;
    for (;;) {
    days -= year_length (year);
    if (days <= 0) break;
    ++year;
    }

    In the past when I’ve written this function for real I think it went along the lines of (days + days/4) / 365, but I don’t care to work out just how to do that correctly for a blog comment…

    Maybe the best thing Microsoft could have done was divide their system into processes Erlang-style so this one function wouldn’t lock up the whole device.

  7. Darius Bacon says:

    Well, my expression was so wrong I had to work out a correct one — it’s quite ugly with the Zune’s conventions, though:

    enum { four_years = 4 * 365 + 1 };
    int fours = (days - 1) / four_years;
    days = (days - 1) % four_years;
    year = ORIGINYEAR + 4 * fours + (days < 366 ? 0 : 1 + (days - 366) / 365);

  8. Darius Bacon says:

    Re your last question, I would use the simple code and leave a comment at the definition of ORIGINYEAR/MAXYEAR.

  9. Stefan Ciobaca says:

    @brian: days is an argument, not a global, not volatile, therefore there is no (reasonable) way it is updated in the background. In fact users apparently have to reboot their devices the next day to get them working.

    Regarding the MAXYEAR question: MAXYEAR’s only use seems to be when setting the date (to validate it’s a reasonable date). I imagine that if people keep their Zune for +100 years and do not touch the date, the chip will probably happily increment it well past MAXYEAR.

    That said, and while it is unrealistic anyone will be using a Zune 100 years in the future, I think it is better for IsLeapYear to be written this way for consistency (this way meaning checking for % 400).

  10. Fred says:

    @svat: 1980 is the Epoch of Microsoft :) It’s in DOS and Windows…

  11. Peter says:

    year = ORIGINYEAR;
    int daysinyear = IsLeapYear(year) ? 366 : 365;
    while (days > daysinyear) {
    year += 1
    days -= daysinyear;
    daysinyear = IsLeapYear(year) ? 366 : 365;
    }

  12. asdfjlasjgklasfjgjasfhgk says:

    “Is the first day of January in 1980 to be interpreted as day 0 or as day 1? Nothing I could find in the original code resolves this ambiguity.”

    this is quite clear from the comparsion ‘days > 365′. year must be incremented when days is greater then the index of the last day in year. if the last (365th) index is 365 then the first one is 1.

    “The Zune routine correctly returns year 1980 for day values from 0 through 365.”
    in this case it should return 1981 for days=366. just try it :)

  13. Ghost Writer says:

    Sorry about this off topic comment, but today’s the day this pet peeve of mine is forcing me to speak up. It’s very annoying to follow a link from another website to an article such as this one and be left wondering who wrote it. Yes, in this case one can deduce from the comments that someone named Brian wrote it. Or, one can click through to the site’s home page and learn that it’s Brian Hayes. Brian, be proud of your writing and let it be known it’s yours! Every article should have your name, the date, and any copyright/left/whatever notice you prefer. The latter can be at the bottom, but the rest should appear at the top in some form so readers know whose voice we’re “hearing” and from when. You do have the date in small type at the end of your posting, so that’s better than nothing.

  14. Klutometis says:

    “I think the main lesson from this incident is not the usual refrain that Microsoft is lame. Rather it’s the even more commonplace observation that software is hard.”

    “Software is hard” implies you should not rewrite software that works; e.g. glibc locales, which have correctly handled leap years since, oh, the 1980s.

    Since Microsoft rewrote working software poorly, it’s natural to infer that “Microsoft is lame.”

  15. All the above code snippets would get mediocre grades
    in my class!

    One should use logic constructs for expressing logic.
    Year y is a leap year if it is a multiple of 400, or if
    it is a multiple of 4 but not of 100.

    static int IsLeapYear(int y)
    {
    return !(y%400) || ( !(y%4) && (y%100) );
    }

    To be sure we have the “counting the years” loop right,
    we could construct a loop invariant:

    original_days = days + sum_{y : ORIGINYEAR <= y DaysInYear(year))
    {
    days -= DaysInYear(year);
    year += 1;
    }
    // 1 <= days <= DaysInYear(year)

  16. All the above code snippets would get mediocre grades
    in my class!

    One should use logic constructs for expressing logic.
    Year y is a leap year if it is a multiple of 400, or if
    it is a multiple of 4 but not of 100.

    static int IsLeapYear(int y)
    {
    return !(y%400) || ( !(y%4) && (y%100) );
    }

    To be sure we have the “counting the years” loop right,
    we could construct a loop invariant:

    original_days = days + sum_{y : ORIGINYEAR <= y < year} DaysInYear(y)

    But if we rewrite the code to be clear it is so simple
    there is little need.

    static int DaysInYear(int Year)
    {
    return IsLeapYear(Year) ? 366 : 365;
    }

    // …

    int year = ORIGINYEAR;
    while (days > DaysInYear(year))
    {
    days -= DaysInYear(year);
    year += 1;
    }
    // 1 <= final days <= DaysInYear(year)

  17. I’m going to post three separate replies so as to keep three issues separate; my apologies if that is annoying.

    Reply 1 of 3: Brian says that the version containing

    else if (days == 366)
    days -= 366;

    seems to work and that he prefers a version based around

    while (days > 0)

    In fact, neither of these would work if inserted into the original context, as opposed to Brian’s days-to-years conversion procedure, because the original was used not only to calculate the year, but also the day number within that year. (The day number is subsequently converted into a month number and the date within the month). As such, it matters not only that the loop terminates, and what the value of years is upon termination, but also what the value of days is upon termination — which both of these variants get wrong.

  18. Reply 2 of 3:

    The version that adds an else clause containing just a break statement is interesting not only because it is a small change that makes the procedure correct, but also because essentially the same code appears elsewhere in the buggy source file.

    As background, the Zune 30 apparently uses a pair of chips from Freescale Semiconductor, each of which contains its own real-time clock. One is the i.MX31, the main heavy-duty processor which is powered off when not in use. The other is the MC13783, a lower-power device that stays on and provides power management and some user-interface functionality.

    If the date is available from the i.MX31, that is used, but if the i.MX31 has not had its real time clock reset since it was last powered back up, then the code gets the date from the MC13783 and sets that into the i.MX31 for future use.

    The source file containing the now infamous bug has two different copies of the date conversion code, once for each of the two chips. The conversion logic is identical — just the code for getting the raw number of days is different — and yet the conversion code is duplicated, or rather, is nearly duplicated.

    The version for the i.MX31 contains the missing else clause:

    else
    {
    OALMSG(OAL_ERROR, (TEXT(“ERROR calculate day\r\n”)));
    break;
    }

    Now, this isn’t very confidence inspiring, because the OALMSG macro suggests that the programmer though this was an error condition. But in point of fact, in a production build, the OALMSG macro would not generate any code, and so this would be completely equivalent to just a break — which is correct.

    The really weird thing is that there are two pieces of evidence that all lead me to suspect that the correct i.MX31 version of the code is older than the incorrect MC13783 version. First, the i.MX31 hardware slightly predates the MC13783, and systems can be built with only i.MX31. Second, the code to for the two chips is broken into procedures differently: in the i.MX31 version, the code for accessing the raw number of days out of the hardware is in the same procedure with the conversion, whereas in the MC13783 verion, the hardware access is separate from the conversion. If the MC13783 version was the earlier one, there would have been no reason to duplicate the conversion code — the same procedure could simply have been called. Whereas if the i.MX31 code was the original one, the conversion code wouldn’t have been directly available for reuse without some kind of refactoring or duplication.

    It looks like someone when adding in the code for the MC13783 saw the conversion code they wanted, but that it was packaged into the same procedure as the lower-level aspects of getting the date from the clock hardware. Rather than factoring it out — so that one copy of the conversion code would work in both cases — they left it alone and made a new copy. (Programmers of hardware drivers are loath to touch any working code.) But rather than copying it verbatim, they simplified it a bit, leaving out all of the OALMSG calls (not just the “error” one I showed above — also a bunch of tracing ones). And, seemingly, in the case in question, they simplified a bit too much — missing that the else clause was not *really* just an error report, but actually was a functionality-critical break statement.

    Of course, this reconstructed history is just speculation. But if anyone from Microsoft and Freescale ever wants to do the software-development community a service by reporting the history of how the bug slipped through, it would be particularly interesting to hear about the relationship between the two nearly-identical copies of the conversion code. And, until then, I think any of us who use the buggy while loop as an example in our teaching really ought to also keep in mind the presence of the other version — which perhaps changes the range of lessons we ought to draw.

  19. Reply 3 of 3:

    Another plausible replacement for the buggy loop would use a test-in-the-middle loop. Like Brian, I tested this in a different programming language, so apologize if the C is a bit off:

    while(TRUE){

    int daysInYear = IsLeapYear(year) ? 366 : 365;

    if(days <= daysInYear) break;

    days -= daysInYear;

    year += 1;

    }

  20. Jim Ward says:

    Software is a garden, ever generation must dig in and turn over the soil. Sometimes you kill the flowers, but you learn. I don’t fault Microsoft for rewriting glibc, I work with many coders who aren’t curious at all about how things work, they call all these black boxes, like glibc. They’re plumbers, not programmers.

  21. Dave Minter says:

    If clarity is the aim, why not just split the logic into its constituent procedures?

    1. Determine if we have passed the end of the year

    2. Determine if a given year is a leap year

    3. Adjust the day and year counts.

    I admit that I may be missing some subtlety as I don’t really follow why a loop was used at all – but I’d expect something like the following three functions:

    // (1)
    if( isLeapYear(year) ) {
    return days > 366;
    } else {
    return days > 365;
    }

    // We’ve already got (2) from the original implementation.

    // (3)
    if(isAfterEndOfYear(days,year)) {
    days = 1;
    year += 1;
    }

    Even if I’m off base somewhere the problem of clarity seems to be more to do with trying to combine too many logical tests into a single

  22. Anthony says:

    To those that claim “dealing with dates is not easy”, a really simple unit test would have caught that bug – like feeding the function integers from 1 to 1000 and testing that it did not crash for any of them. Bonus points for testing that it returned something sensible each time.

    Sadly, this code probably wasn’t unit tested at all.

  23. krick says:

    Here’s a more concise leap year function…

    int isLeapYear(int year)
    {
    return !(year % 100 ? year % 4 : year % 400);
    }


    Krick
    http://www.tankadin.com

  24. dargor17 says:

    About your last question, I don’t think there’s ever a reason to substitute a general code with a code working under some assumptions unless there is a huge difference in performance or maintainability…
    I don’t really know the hardware specifics of the zune, but I guess leaving one more “if” should not affect the time it takes to execute the function that much…

    @Barak A. Pearlmutter (or any one who can help me):
    what does this construct mean? I don’t think I’ve ever seen it… Is it C? Or is just some formatting problem of the blog software?
    original_days = days + sum_{y : ORIGINYEAR <= y < year}

  25. @dagor17: The text you quoted from Barak Pearlmutter’s comment wasn’t from the C code he included within his comment, but rather from the explanatory text talking about that code. It is a mathematical equation, not an assignment statement. You didn’t quote it in full: you missed the DaysInYear(y) that appears at the end of the equation. The equation contains a term that would be written with a big sigma if it were in conventional mathematical notation rather than this typed form. The equation is saying that after any number of complete iterations through the loop (including 0, i.e., before it starts), the original value of days is equal to the current value of days plus the sum of DaysInYear(y) for y ranging from ORIGINYEAR up to but not including the current value of year. (Note that before the first iteration of the loop, there are no such values of y, so the sum would be 0.)

  26. Michael Mather says:

    The following needs to be tested, but has the advantage of not having any loops. It should be good until the year 2100.

    // calculate the year from 1980 = 0
    year = (days * 4 – 4) / 1461;

    // calculate the day within that year, from 1 jan = 1
    days -= (year * 1461 + 3) / 4;

    // adjust the year to start from 1980
    year += 1980;

  27. Alex says:

    just change the condition for the while loop and the code can be that easy:
    year = ORIGINYEAR;
    while ((days > 365 && !IsLeapYear(year)) || (days > 366 && IsLeapYear(year))
    {
    if (IsLeapYear(year))
    {
    days -= 366;
    }
    else
    {
    days -= 365;
    }
    year += 1;
    }

  28. roland says:

    Max Hailperin:

    Great comment. Maybe the person who deleted the error message thought the code works, so this clause will never be executed and is therefore unnecessary.

  29. jagbot says:

    #include

    int is_leap (int y) {
    if ( (y % 4 == 0) && (y % 100 == 0) && (y % 400 != 0)) return 1; else return 0;
    }

    int main () {
    int days;
    int year = 1980;
    scanf (“%d”, &days);
    while (days > 364) {
    if (is_leap (year) == 1) days -= 366; else days -= 365;
    ++year;
    }
    printf (“year is %d”, year);
    }

  30. jagbot says:

    // add include stdio.h to above

    using 364 solves the problem cleanly.
    why all the fuss?

  31. jagbot says:

    oh my is_leap is wrong!
    replace with:

    int is_leap (int year) {
    if (year % 4 == 0) {
    if ((year % 100 == 0) {
    if (year % 400 != 0) return 1; else return 0;
    }
    return 1;
    }
    return 0;
    }

  32. One word.. Microsoft. Is it any wonder that such a simple error could occur on one of their products?!?

  33. Palm Pre says:

    I have had a Zune for the past year now and I am loving every bit of it. Although it is currently in the rapair shop I still use the software. Hopefully when the Zune HD comes out this fall I will be able to get it. Would really like to see how it compares to the iPod Touch.