## 737: The MAX Mess

Controlled Flight into Terrain is the aviation industry’s term for what happens when a properly functioning airplane plows into the ground because the pilots are distracted or disoriented. What a nightmare. Even worse, in my estimation, is Automated Flight into Terrain, when an aircraft’s control system forces it into a fatal nose dive despite the frantic efforts of the crew to save it. That is the conjectured cause of two recent crashes of new Boeing 737 MAX 8 airplanes. I’ve been trying to reason my way through to an understanding of how those accidents could have happened.

Disclaimer: The investigations of the MAX 8 disasters are in an early stage, so much of what follows is based on secondary sources—in other words, on leaks and rumors and the speculations of people who may or may not know what they’re talking about. As for my own speculations: I’m not an aeronautical engineer, or an airframe mechanic, or a control theorist. I’m not even a pilot. Please keep that in mind if you choose to read on.

### The accidents

Early on the morning of October 29, 2018, Lion Air Flight 610 departed Jakarta, Indonesia, with 189 people on board. The airplane was a four-month-old 737 MAX 8—the latest model in a line of Boeing aircraft that goes back to the 1960s. Takeoff and climb were normal to about 1,600 feet, where the pilots retracted the flaps (wing extensions that increase lift at low speed). At that point the aircraft unexpectedly descended to 900 feet. In radio conversations with air traffic controllers, the pilots reported a “flight control problem” and asked about their altitude and speed as displayed on the controllers’ radar screens. Cockpit instruments were giving inconsistent readings. The pilots then redeployed the flaps and climbed to 5,000 feet, but when the flaps were stowed again, the nose dipped and the plane began to lose altitude. Over the next six or seven minutes the pilots engaged in a tug of war with their own aircraft, as they struggled to keep the nose level but the flight control system repeatedly pushed it down. In the end the machine won. The airplane plunged into the sea at high speed, killing everyone aboard.

The second crash happened March 8, when Ethiopian Airlines Flight 302 went down six minutes after taking off from Addis Ababa, killing 157. The aircraft was another MAX 8, just two months old. The pilots reported control problems, and data from a satellite tracking service showed sharp fluctuations in altitude. The similarities to the Lion Air crash set off alarm bells: If the same malfunction or design flaw caused both accidents, it might also cause more. Within days, the worldwide fleet of 737 MAX aircraft was grounded. Data recovered since then from the Flight 302 wreckage has reinforced the suspicion that the two accidents are closely related.

The grim fate of Lion Air 610 can be traced in brightly colored squiggles extracted from the flight data recorder. (The chart was published in November in a preliminary report from the Indonesian National Committee on Transportation Safety.)

The outline of the story is given in the altitude traces at the bottom of the chart. The initial climb is interrupted by a sharp dip; then a further climb is followed by a long, erratic roller coaster ride. At the end comes the dive, as the aircraft plunges 5,000 feet in a little more than 10 seconds. (Why are there two altitude curves, separated by a few hundred feet? I’ll come back to that question at the end of this long screed.)

All those ups and downs were caused by movements of the horizontal stabilizer, the small winglike control surface at the rear of the fuselage. The stabilizer controls the airplane’s pitch attitude—nose-up vs. nose-down. On the 737 it does so in two ways. A mechanism for pitch trim tilts the entire stabilizer, whereas push­ing or pulling on the pilot’s control yoke moves the elevator, a hinged tab at the rear of the stabilizer. In either case, moving the trailing edge of the surface upward tends to force the nose of the airplane up, and vice versa. Here we’re mainly concerned with trim changes rather than elevator movements.

Commands to the pitch-trim system and their effect on the airplane are shown in three traces from the flight data, which I reproduce here for convenience:

The line labeled “trim manual” (light blue) reflects the pilots’ inputs, “trim automatic” (orange) shows commands from the airplane’s electronic systems, and “pitch trim position” (dark blue) represents the tilt of the stabilizer, with higher position on the scale denoting a nose-up command. This is where the tug of war between man and machine is clearly evident. In the latter half of the flight, the automatic trim system repeatedly commands nose down, at intervals of roughly 10 seconds. In the breaks between those automated commands, the pilots dial in nose-up trim, using buttons on the control yoke. In response to these conflicting commands, the position of the horizontal stabilizer oscillates with a period of 15 or 20 seconds. The see-sawing motion continues for at least 20 cycles, but toward the end the unrelenting automatic nose-down adjustments prevail over the briefer nose-up commands from the pilots. The stabilizer finally reaches its limiting nose-down deflection and stays there as the airplane plummets into the sea.

### Angle of attack

What’s to blame for the perverse behavior of the automatic pitch trim system? The accusatory finger is pointing at something called MCAS, a new feature of the 737 MAX series. MCAS stands for Maneuvering Characteristics Augmentation System—an im­pressively polysyllabic name that tells you nothing about what the thing is or what it does. As I understand it, MCAS is not a piece of hardware; there’s no box labeled MCAS in the airplane’s electronic equipment bays. MCAS consists entirely of software. It’s a program running on a computer.

MCAS has just one function. It is designed to help prevent an aerodynamic stall, a situation in which an airplane has its nose pointed up so high with respect to the surrounding airflow that the wings can’t keep it aloft. A stall is a little like what happens to a bicyclist climbing a hill that keeps getting steeper and steeper: Eventually the rider runs out of oomph, wobbles a bit, and then rolls back to the bottom. Pilots are taught to recover from stalls, but it’s not a skill they routinely practice with a planeful of passengers. In commercial aviation the emphasis is on avoiding stalls—forestalling them, so to speak. Airliners have mechanisms to detect an imminent stall and warn the pilot with lights and horns and a “stick shaker” that vibrates the control yoke. On Flight 610, the captain’s stick was shaking almost from start to finish.

Some aircraft go beyond mere warnings when a stall threatens. If the aircraft’s nose continues to pitch upward, an automated system intervenes to push it back down—if necessary overriding the manual control inputs of the pilot. MCAS is designed to do exactly this. It is armed and ready whenever two criteria are met: The flaps are up (generally true except during takeoff and landing) and the airplane is under manual control (not autopilot). Under these conditions the system is triggered whenever an aerodynamic quantity called angle of attack, or AoA, rises into a dangerous range.

Angle of attack is a concept subtle enough to merit a diagram:Adapted from Lisa R. Le Vie, Review of Research on Angle-of-Attack Indi­cator Effectiveness.

The various angles at issue are rotations of the aircraft body around the pitch axis, a line parallel to the wings, perpendicular to the fuselage, and passing through the airplane’s center of gravity. If you’re sitting in an exit row, the pitch axis might run right under your seat. Rotation about the pitch axis tilts the nose up or down. Pitch attitude is defined as the angle of the fuselage with respect to a horizontal plane. The flight-path angle is measured between the horizontal plane and the aircraft’s velocity vector, thus showing how steeply it is climbing or descending. Angle of attack is the difference between pitch attitude and flight-path angle. It is the angle at which the aircraft is moving through the surrounding air (assuming the air itself is motionless, i.e., no wind).

AoA affects both lift (the upward force opposing the downward tug of gravity) and drag (the dissipative force opposing forward motion and the thrust of the engines). As AoA increases from zero, lift is enhanced because of air impinging on the underside of the wings and fuselage. For the same reason, however, drag also increases. As the angle of attack grows even steeper, the flow of air over the wings becomes turbulent; beyond that point lift diminishes but drag continues increasing. That’s where the stall sets in. The critical angle for a stall depends on speed, weight, and other factors, but usually it’s no more than 15 degrees.

Neither the Lion Air nor the Ethiopian flight was ever in danger of stalling, so if MCAS was activated, it must have been by mistake. The working hypothesis mentioned in many press accounts is that the system received and acted upon erroneous input from a failed AoA sensor.

A sensor to measure angle of attack is conceptually simple. It’s essentially a weather­vane poking out into the airstream. In the photo below, the angle-of-attack sensor is the small black vane just forward of the “737 MAX” legend. Hinged at the front, the vane rotates to align itself with the local airflow and generates an electrical signal that rep­resents the vane’s angle with respect to the axis of the fuselage. The 737 MAX has two angle-of-attack vanes, one on each side of the nose. (The protruding devices above the AoA vane are pitot tubes, used to measure air speed. Another device below the word MAX is probably a temperature sensor.)

Angle of attack was not among the variables displayed to the pilots of the Lion Air 737, but the flight data recorder did capture signals derived from the two AoA sensors:

There’s something dreadfully wrong here. The left sensor is indicating an angle of attack about 20 degrees steeper than the right sensor. That’s a huge discrepancy. There’s no plausible way those disparate readings could reflect the true state of the airplane’s motion through the air, with the left side of the nose pointing sky-high and the right side near level. One of the measurements must be wrong, and the higher reading is the suspect one. If the true angle of attack ever reached 20 degrees, the airplane would already be in a deep stall. Unfortunately, on Flight 610 MCAS was taking data only from the left-side AoA sensor. It interpreted the nonsensical measurement as a valid indicator of aircraft attitude, and worked relentlessly to correct it, up to the very moment the airplane hit the sea.

### Cockpit automation

The tragedies in Jakarta and Addis Ababa are being framed as a cautionary tale of automation run amok, with computers usurping the authority of pilots. The Washington Post editorialized:

A second fatal airplane accident involving a Boeing 737 MAX 8 may have been a case of man vs. machine…. The debacle shows that regulators should apply extra review to systems that take control away from humans when safety is at stake.

Tom Dieusaert, a Belgian journalist who writes often on aviation and computation, offered this opinion:

What can’t be denied is that the Boeing of Flight JT610 had serious computer problems. And in the hi-tech, fly-by-wire world of aircraft manufacturers, where pilots are reduced to button pushers and passive observers, these accidents are prone to happen more in the future.

The button-pushing pilots are particularly irate. Gregory Travis, who is both a pilot and software developer, summed up his feelings in this acerbic comment:

“Raise the nose, HAL.”

“I’m sorry, Dave, I can’t do that.”

Even Donald Trump tweeted on the issue:

Airplanes are becoming far too complex to fly. Pilots are no longer needed, but rather computer scientists from MIT. I see it all the time in many products. Always seeking to go one unnecessary step further, when often old and simpler is far better. Split second decisions are….

….needed, and the complexity creates danger. All of this for great cost yet very little gain. I don’t know about you, but I don’t want Albert Einstein to be my pilot. I want great flying professionals that are allowed to easily and quickly take control of a plane!

There’s considerable irony in the complaint that the 737 is too automated; in many respects the aircraft is in fact quaintly old-fashioned. The basic design goes back more than 50 years, and even in the latest MAX models quite a lot of 1960s technology survives. The primary flight controls are hydraulic, with a spider web of high-pressure tubing running directly from the control yokes in the cockpit to the ailerons, elevator, and rudder. If the hydraulic systems should fail, there’s a purely mechanical backup, with cables and pulleys to operate the various control surfaces. For stabilizer trim the primary actuator is an electric motor, but again there’s a mechanical fallback, with crank wheels near the pilots’ knees pulling on cables that run all the way back to the tail.

Other aircraft are much more dependent on computers and electronics. The 737′s principal competitor, the Airbus A320, is a thoroughgoing fly-by-wire vehicle. The pilot flies the computer, and the computer flies the airplane. Specifically, the pilot decides where to go—up, down, left, right—but the computer decides how to get there, choosing which control surfaces to deflect and by how much. Boeing’s own more recent designs, the 777 and 787, also rely on digital controls. Indeed, the latest models from both companies go a step beyond fly-by-wire to fly-by-network. Most of the communication from sensors to computers and onward to control surfaces consists of digital packets flowing through a variant of Ethernet. The airplane is a computer peripheral.

Thus if you want to gripe about the dangers and indignities of automation on the flight deck, the 737 is not the most obvious place to start. And a Luddite campaign to smash all the avionics and put pilots back in the seat of their pants would be a dangerously off-target response to the current predicament. There’s no question the 737 MAX has a critical problem. It’s a matter of life and death for those who would fly in it and possibly also for the Boeing Company. But the problem didn’t start with MCAS. It started with earlier decisions that made MCAS necessary. Furthermore, the problem may not end with the remedy that Boeing has proposed—a software update that will hobble MCAS and leave more to the discretion of pilots.

### Maxing out the 737

The 737 flew its first passengers in 1968. It was (and still is) the smallest member of the Boeing family of jet airliners, and it is also the most popular by far. More than 10,000 have been sold, and Boeing has orders for another 4,600. Of course there have been changes over the years, especially to engines and instruments. A 1980s update came to be known as 737 Classic, and a 1997 model was called 737 NG, for “next generation.” (Now, with the MAX, the NG has become the previous generation.) Through all these revisions, however, the basic structure of the airframe has hardly changed.

Ten years ago, it looked like the 737 had finally come to the end of its life. Boeing announced it would develop an all-new design as a replacement, with a hull built of lightweight composite materials rather than aluminum. Competitive pressures forced a change of course. Airbus had a head start on the A320neo, an update that would bring more efficient engines to their entry in the same market segment. The revised Airbus would be ready around 2015, whereas Boeing’s clean-slate project would take a decade. Customers were threatening to defect. In particular, American Airlines—long a Boeing loyalist—was negotiating a large order of A320neos.

In 2011 Boeing scrapped the plan for an all-new design and elected to do the same thing Airbus was doing: bolt new engines onto an old airframe. This would eliminate most of the up-front design work, as well as the need to build tooling and manufacturing facilities. Testing and certification by the FAA would also go quicker, so that the first deliveries might be made in five or six years, not too far behind Airbus.

A 737-800 (a pre-MAX model) burns about 800 gallons of jet fuel per hour aloft. That comes to $2,000 at$2.50 per gallon. If the airplane flies 10 hours a day, the annual fuel bill is $7.3 million. Fourteen percent of that is just over$1 million.The new engines mated to the 737 promised a 14 percent gain in fuel efficiency, which might save an airline a million dollars a year in operating costs. The better fuel economy would also increase the airplane’s range. And to sweeten the deal Boeing proposed to keep enough of the airframe unchanged that the new model would operate under the same “type certificate” as the old one. A pilot qualified to fly the 737 NG could step into the MAX without extensive retraining.

Sources: (left) Bryan via Wikimedia, CC BY 2.0; (right) Steve Lynes via Wikimedia, CC BY 2.0.

The original 1960s 737 had two cigar-shaped engines, long and skinny, tucked up under the wings (left photo above). Since then, jet engines have grown fat and stubby. They derive much of their thrust not from the jet exhaust coming out of the tailpipe but from “bypass” air moved by a large-diameter fan. Such engines would scrape on the ground if they were mounted under the wings of the 737; instead they are perched on pylons that extend forward from the leading edge of the wing. The engines on the MAX models (right photo) are the fattest yet, with a fan 69 inches in diameter. Compared with the NG series, the MAX engines are pushed a few inches farther forward and hang a few inches lower.

A New York Times article by David Gelles, Natalie Kitroeff, Jack Nicas, and Rebecca R. Ruiz describes the plane’s development as hurried and hectic.

Months behind Airbus, Boeing had to play catch-up. The pace of the work on the 737 Max was frenetic, according to current and former employees who spoke with The New York Times…. Engineers were pushed to submit technical drawings and designs at roughly double the normal pace, former employees said.

The Times article also notes: “Although the project had been hectic, current and former employees said they had finished it feeling confident in the safety of the plane.”

### Pitch instability

Sometime during the development of the MAX series, Boeing got an unpleasant surprise. The new engines were causing unwanted pitch-up movements under certain flight con­ditions. When I first read about this problem, soon after the Lion Air crash, I found the following explanation is an article by Sean Broderick and Guy Norris in Aviation Week and Space Technology (Nov. 26–Dec. 9, 2018, pp. 56–57):

Like all turbofan-powered airliners in which the thrust lines of the engines pass below the center of gravity (CG), any change in thrust on the 737 will result in a change of flight path angle caused by the vertical component of thrust.

In other words, the low-slung engines not only push the airplane forward but also tend to twirl it around the pitch axis. It’s like a motorcycle doing wheelies. Because the MAX engines are mounted farther below and in front of the center of gravity, they act through a longer lever arm and cause more severe pitch-up motions.

I found more detail on this effect in an earlier Aviation Week article, a 2017 pilot report by Fred George, describing his first flight at the controls of the new MAX 8.

The aircraft has sufficient natural speed stability through much of its flight envelope. But with as much as 58,000 lb. of thrust available from engines mounted well below the center of gravity, there is pronounced thrust-versus-pitch coupling at low speeds, especially with aft center of gravity (CG) and at light gross weights. Boeing equips the aircraft with a speed-stability augmen­tation function that helps to compensate for the coupling by automatically trimming the horizontal stabilizer according to indicated speed, thrust lever position and CG. Pilots still must be aware of the effect of thrust changes on pitching moment and make purposeful control-wheel and pitch-trim inputs to counter it.

The reference to an “augmentation function” that works by “automatically trimming the horizontal stabilizer” sounded awfully familiar, but it turns out this is not MCAS. The system that compensates for thrust-pitch coupling is known as speed-trim. Like MCAS, it works “behind the pilot’s back,” making adjustments to control surfaces that were not directly commanded. There’s yet another system of this kind called mach-trim that silently corrects a different pitch anomally when the aircraft reaches transonic speeds, at about mach 0.6. Neither of these systems is new to the MAX series of aircraft; they have been part of the control algorithm at least since the NG came out in 1997. MCAS runs on the same computer as speed-trim and mach-trim and is part of the same software system, but it is a distinct function. And according to what I’ve been reading in the past few weeks, it addresses a different problem—one that seems more sinister.

Most aircraft have the pleasant property of static stability. When an airplane is properly trimmed for level flight, you can let go of the controls—at least briefly—and it will continue on a stable path. Moreover, if you pull back on the control yoke to point the nose up, then let go again, the pitch angle should return to neutral. The layout of the airplane’s various airfoil surfaces accounts for this behavior. When the nose goes up, the tail goes down, pushing the underside of the horizontal stabilizer into the airstream. The pressure of the air against this tail surface provides a restoring force that brings the tail back up and the nose back down. (That’s why it’s called a stabilizer!) This negative feedback loop is built in to the structure of the airplane, so that any departure from equilibrium creates a force that opposes the disturbance.

However, the tail surface, with its helpful stablizing influence, is not the only structure that affects the balance of aerodynamic forces. Jet engines are not designed to contribute lift to the airplane, but at high angles of attack they can do so, as the airstream impinges on the lower surface of each engine’s outer covering, or nacelle. When the engines are well forward of the center of gravity, the lift creates a pitch-up turning moment. If this moment exceeds the counterbalancing force from the tail, the aircraft is unstable. A nose-up attitude generates forces that raise the nose still higher, and positive feedback takes over.

Is the 737 MAX vulnerable to such runaway pitch excursions? The possibility had not occurred to me until I read a commentary on MCAS on the Boeing 737 Technical Site, a web publication produced by Chris Brady, a former 737 pilot and flight instructor. He writes:

MCAS is a longitudinal stability enhancement. It is not for stall prevention or to make the MAX handle like the NG; it was introduced to counteract the non-linear lift of the LEAP-1B engine nacelles and give a steady increase in stick force as AoA increases. The LEAP engines are both larger and relocated slightly up and forward from the previous NG CFM56-7 engines to accommodate their larger fan diameter. This new location and size of the nacelle cause the vortex flow off the nacelle body to produce lift at high AoA; as the nacelle is ahead of the CofG this lift causes a slight pitch-up effect (ie a reducing stick force) which could lead the pilot to further increase the back pressure on the yoke and send the aircraft closer towards the stall. This non-linear/reducing stick force is not allowable underFAR = Federal Air Regulations. Part 25 deals with airworthiness standards for transport category airplanes. FAR §25.173 “Static longitudinal stability”. MCAS was therefore introduced to give an automatic nose down stabilizer input during steep turns with elevated load factors (high AoA) and during flaps up flight at airspeeds approaching stall.

Brady cites no sources for this statement, and as far as I know Boeing has neither confirmed nor denied. But Aviation Week, which earlier mentioned the thrust-pitch linkage, has more recently (issue of March 20) gotten behind the nacelle-lift instability hypothesis:

The MAX’s larger CFM Leap 1 engines create more lift at high AOA and give the aircraft a greater pitch-up moment than the CFM56-7-equipped NG. The MCAS was added as a certification requirement to minimize the handling difference between the MAX and NG.

Assuming the Brady account is correct, an interesting question is when Boeing noticed the instability. Were the designers aware of this hazard from the outset? Did it emerge during early computer simulations, or in wind tunnel testing of scale models? A story by Dominic Gates in the Seattle Times hints that Boeing may not have recognized the severity of the problem until flight tests of the first completed aircraft began in 2015.

According to Gates, the safety analysis that Boeing submitted to the FAA specified that MCAS would be allowed to move the horizontal stabilizer by no more than 0.6 degree. In the airplane ultimately released to the market, MCAS can go as far as 2.5 degrees, and it can act repeatedly until reaching the mechanical limit of motion at about 5 degrees. Gates writes:

That limit was later increased after flight tests showed that a more powerful movement of the tail was required to avert a high-speed stall, when the plane is in danger of losing lift and spiraling down.

The behavior of a plane in a high angle-of-attack stall is difficult to model in advance purely by analysis and so, as test pilots work through stall-recovery routines during flight tests on a new airplane, it’s not uncommon to tweak the control software to refine the jet’s performance.

The high-AoA instability of the MAX appears to be a property of the aerodynamic form of the entire aircraft, and so a direct way to suppress it would be to alter that form. For example, enlarging the tail surface might restore static stability. But such airframe modifications would have delayed the delivery of the airplane, especially if the need for them was discovered only after the first prototypes were already flying. Structural changes might also jeopardize inclusion of the new model under the old type certificate. Modifying software instead of aluminum must have looked like an attractive alternative. Someday, perhaps, we’ll learn how the decision was made.

By the way, according to Gates, the safety document filed with the FAA specifying a 0.6 degree limit has yet to be amended to reflect the true range of MCAS commands.

### Flying while unstable

Instability is not necessarily the kiss of death in an airplane. There have been at least a few successful unstable designs, starting with the 1903 Wright Flyer. The Wright brothers deliberately put the horizontal stabilizer in front of the wing rather than behind it because their earlier experiments with kites and gliders had shown that what we call stability can also be described as sluggishness. The Flyer’s forward control surfaces (known as canards) tended to amplify any slight nose-up or nose-down motions. Maintaining a steady pitch attitude demanded high alertness from the pilot, but it also allowed the airplane to respond more quickly when the pilot wanted to pitch up or down. (The pros and cons of the design are reviewed in a 1984 paper by Fred E. C. Culick and Henry R. Jex.)

Orville at the controls, Wilbur running alongside, at Kitty Hawk on December 17, 1903. In this view we are seeing the airplane from the stern. The canards—dual adjustable horizontal surfaces at the front—seem to be calling for nose-up pitch. (Photo from WikiMedia.

Another dramatically unstable aircraft was the Grumman X-29, a research platform designed in the 1980s. The X-29 had its wings on backwards; to make matters worse, the primary surfaces for pitch control were canards mounted in front of the wings, as in the Wright Flyer. The aim of this quirky project was to explore designs with exceptional agility, sacrificing static stability for tighter maneuvering. No unaided human pilot could have mastered such a twitchy vehicle. It required a digital fly-by-wire system that sampled the state of the airplane and adjusted the control surfaces up to 80 times per second. The controller was successful—perhaps too much so. It allowed the airplane to be flown safely, but in taming the instability it also left the plane with rather tame handling characteristics.

I have a glancing personal connection with the X-29 project. In the 1980s I briefly worked as an editor with members of the group at Honeywell who designed and built the X-29 control system. I helped prepare publications on the control laws and on their implementation in hardware and software. That experience taught me just enough to recognize something odd about MCAS: It is way too slow to be suppressing aerodynamic instability in a jet aircraft. Whereas the X-29 controller had a response time of 25 milliseconds, MCAS takes 10 seconds to move the 737 stabilizer through a 2.5-degree adjustment. At that pace, it cannot possibly keep up with forces that tend to flip the nose upward in a positive feedback loop.

There’s a simple explanation. MCAS is not meant to control an unstable aircraft. It is meant to restrain the aircraft from entering the regime where it becomes unstable. This is the same strategy used by other mechanisms of stall prevention—intervening before the angle of attack reaches the critical point. However, if Brady is correct about the instability of the 737 MAX, the task is more urgent for MCAS. Instability implies a steep and slippery slope. MCAS is a guard rail that bounces you back onto the road when you’re about to drive over the cliff.

Which brings up the question of Boeing’s announced plan to fix the MCAS problem. Reportedly, the revised system will not keep reactivating itself so persistently, and it will automatically disengage if it detects a large difference between the two AoA sensors. These changes should prevent a recurrence of the recent crashes. But do they provide adequate protection against the kind of mishap that MCAS was designed to prevent in the first place? With MCAS shut down, either manually or automatically, there’s nothing to stop an unwary or misguided pilot from wandering into the corner of the flight envelope where the MAX becomes unstable.

Without further information from Boeing, there’s no telling how severe the instability might be—if indeed it exists at all. The Brady article at the Boeing 737 Technical Site implies the problem is partly pilot-induced. Normally, to make the nose go higher and higher you have to pull harder and harder on the control yoke. In the unstable region, however, the resistance to pulling suddenly fades, and so the pilot may unwittingly pull the yoke to a more extreme position.

Is this human interaction a necessary part of the instability, or is it just an exacer­bating factor? In other words, without the pilot in the loop, would there still be positive feedback causing runaway nose-up pitch? I have yet to find answers.

Another question: If the root of the problem is a deceptive change in the force resisting a nose-up movements of the control yoke, why not address that issue directly? The elevator feel computer and the elevator feel and centering unit pro­vide “fake” forces to the pilot’s control yoke. Figure borrowed from B737 NG Flight controls, a presentation by theoryce. The presentation is for the 737 NG, not the MAX series; it’s possible the architecture has changed.In the 737 (and most other large aircraft) the forces that the pilot “feels” through the control yoke are not simple reflections of the aerodynamic forces acting on the elevator and other control surfaces. The feedback forces are largely synthetic, generated by an “elevator feel computer” and an “elevator feel and centering unit,” devices that monitor the state of the airplane and gen­erate appro­priate hydraulic pressures push­ing the yoke one way or another. Those systems could have been given the addi­tional task of maintaining or increasing back force on the yoke when the angle of attack approaches the instability. Artificially en­hanced resis­tance is already part of the stall warning system. Why not extend it to MCAS? (There may be a good answer; I just don’t know it.)

### Where’s the off switch?

Even after the spurious activation of MCAS on Lion Air 610, the crash and the casualties would have been avoided if the pilots had simply turned the damn thing off. Why didn’t they? Apparently because they had never heard of MCAS, and didn’t know it was installed on the airplane they were flying, and had not received any instruction on how to disable it. There’s no switch or knob in the cockpit labeled “MCAS ON/OFF.” The Flight Crew Operation Manual does not mention it (except in a list of abbreviations), and neither did the transitional training program the pilots had completed before switching from the 737 NG to the MAX. The training consisted of either one or two hours (reports differ) with an iPad app.

Boeing’s explanation of these omissions was captured in a Wall Street Journal story:

One high-ranking Boeing official said the company had decided against disclos­ing more details to cockpit crews due to concerns about inundating average pilots with too much information—and significantly more technical data—than they needed or could digest.

To call this statement disingenuous would be disingenuous. What it is is preposterous. In the first place, Boeing did not withhold “more details”; they failed to mention the very existence of MCAS. And the too-much-information argument is silly. I don’t have access to the Flight Crew Operation Manual for the MAX, but the NG edition runs to more than 1,300 pages, plus another 800 for the Quick Reference Handbook. A few paragraphs on MCAS would not have sunk any pilot who wasn’t already drowning in TMI. Moreover, the manual carefully documents the speed-trim and mach-trim features, which seem to fall in the same category as MCAS: They act autonomously, and offer the pilot no direct interface for monitoring or adjusting them.

In the aftermath of the Lion Air accident, Boeing stated that the procedure for disabling MCAS was spelled out in the manual, even though MCAS itself wasn’t mentioned. That procedure is given in a checklist for “runaway stabilizer trim.” It is not complicated: Hang onto the control yoke, switch off the autopilot and autothrottles if they’re on; then, if the problem persists, flip two switches labeled “STAB TRIM” to the “CUTOUT” position. Only the last step will actually matter in the case of an MCAS malfunction.

This checklist is considered a “memory item”; pilots must be able to execute the steps without looking it up in the handbook. The Lion Air crew should certainly have been familiar with it. But could they recognize that it was the right checklist to apply in an airplane whose behavior was unlike anything they had seen in their training or previous 737 flying experience? According to the handbook, the condition that triggers use of the runaway checklist is “Uncommanded stabilizer trim movement occurs continuously.” The MCAS commands were not continuous but repetitive, so some leap of inference would have been needed to make this diagnosis.

By the time of the Ethiopian crash, 737 pilots everywhere knew all about MCAS and the procedure for disabling it. A preliminary report issued last week by Ethiopian Airlines indicates that after a few minutes of wrestling with the control yoke, the pilots on Flight 302 did invoke the checklist procedure, and moved the STAB TRIM switches to CUTOUT. The stabilizer then stopped responding to MCAS nose-down commands, but the pilots were unable to regain control of the airplane.

It’s not entirely clear why they failed or what was going on in the cockpit in those last minutes. One factor may be that the cutout switch disables not only automatic pitch trim movements but also manual ones requested through the buttons on the control yoke. The switch cuts all power to the electric motor that moves the stabilizer. In this situation the only way to adjust the trim is to turn the hand crank wheels near the pilots’ knees. During the crisis on Flight 302 that mechanism may have been too slow to correct the trim in time, or the pilots may have been so fixated on pulling the control yoke back with maximum force that they did not try the manual wheels. It’s also possible that they flipped the switches back to the NORMAL setting, restoring power to the stabilizer motor. The report’s narrative doesn’t mention this possibility, but the graph from the flight data recorder suggests it (see below).

### The single point of failure

There’s room for debate on whether the MCAS system is a good idea when it is operating correctly, but when it activates mistakenly and sends an airplane diving into the sea, no one would defend it. By all appearances, the rogue behavior in both the Lion Air and the Ethiopian accidents was triggered by a malfunction in a single sensor. That’s not supposed to happen in aviation. It’s unfathomable that any aircraft manufacturer would knowingly build a vehicle in which the failure of a single part would lead to a fatal accident.

Protection against single failures comes from redundancy, and the 737 is so committed to this principle that it almost amounts to two airplanes wrapped up in a single skin. Aircraft that rely more heavily on automation generally have three of everything—sensors, computers, and actuators.The cockpit has stations for two pilots, who look at separate sets of instruments and operate separate sets of controls. The left and right instrument panels receive signals from separate sets of sensors, and those signals are processed by separate computers. Each side of the cockpit has its own inertial guidance system, its own navigation computer, its own autopilot. There are two electric power supplies and two hydraulic systems—plus mechanical backups in case of a dual hydraulic failure. The two control yokes normally move in unison—they are linked under the floor—but if one yoke should get stuck, the connection can be broken, allowing the other pilot to continue flying the airplane.

There’s one asterisk in this roster of redundancy: A device called the flight control computer, or FCC, apparently gets special treatment. There are two FCCs, but according to the Boeing 737 Technical Site only one of them operates during any given flight. All the other duplicated components run in parallel, receiving independent inputs, doing independent computations, emitting independent control actions. But for each flight just one FCC does all the work, and the other is put on standby. The scheme for choosing the active computer seems strangely arbitrary. Each day when the airplane is powered up, the left side FCC gets control for the first flight, then the right side unit takes over for the second flight of the day, and the two sides alternate until the power is shut off. After a restart, the alternation begins again with the left FCC.

Aspects of this scheme puzzle me. I don’t understand why redundant FCC units are treated differently from other components. If one FCC dies, does the other automatically take over? Can the pilots switch between them in flight? If so, would that be an effective way to combat MCAS misbehavior? I’ve tried to find answers in the manuals, but I don’t trust my interpretation of what I read.

I’ve also had a hard time learning anything about the FCC itself. I don’t know who makes it, or what it looks like, or how it is programmed. On a website called Closet Wonderfuls an item identified as a 737 flight control computer is on offer for \$43.82, with free shipping.A website called Airframer lists many suppliers of parts and materials for the 737, but there’s no entry for a flight control computer. It has a Honeywell label. I’m tempted, but I’m pretty sure this is not the unit installed in the latest MAX models. I’ve learned that the FCC was once the FCE, for flight control elec­tronics, suggesting it was an analog device, doing its integrations and differ­entiations with capacitors and resis­tors. By now I’m sure the FCC has caught up with the digital age, but it might still be special-purpose, custom-built hardware. Or it might be an off-the-shelf Intel CPU in a fancy box, maybe even running Linux or Windows. I just don’t know.

In the context of the MAX crashes, the flight control computer is important for two reasons. First, it’s where MCAS lives; this is the computer on which the MCAS software runs. Second, the curious procedure for choosing a different FCC on alternating flights also winds up choosing which AoA sensor is providing input to MCAS. The left and right sensors are connected to the corresponding FCCs.

If the two FCCs are used in alternation, that raises an interesting question about the history of the aircraft that crashed in Indonesia. The preliminary crash report describes trouble with various instruments and controls on five flights over four days (including the fatal flight). All of the problems were on the left side of the aircraft or involved a dis­agreement between the left and right sides.
The flight in the gray row is not mentioned in the preliminary report, but the airplane had to get from Manado to Denpasar for the following day’s flight.

date route trouble reports maintenance
Oct 26 Tianjin → Manado left side: no airspeed
or altitude indications
test left Stall Management and
Yaw Damper computer; passed
? Manado → Denpasar ? ?
Oct 27 Denpasar → Manado left side: no airspeed
or altitude indications

speed trim and mach trim
warning lights
test left Stall Management and
Yaw Damper computer; failed

reset left Air Data and Inertial
Reference Unit

retest left Stall Management and
Yaw Damper computer; passed

clean electrical connections
Oct 27 Manado → Denpasar left side: no airspeed
or altitude indications

speed trim and mach trim
warning lights

autothrottle disconnect
test left Stall Management and
Yaw Damper computer; failed

reset left Air Data and Inertial
Reference Unit

replace left AoA sensor
Oct 28 Denpasar → Jakarta left/right disagree warning
on airspeed and altitude

stick shaker

[MCAS activation]
flush left pitot tube
and static port

clean electrical connectors
on elevator “feel” computer
Oct 29 Jakarta → Pangkal Pinang stick shaker

[MCAS activation]

Which of the five flights had the left-side FCC as active computer? The final two flights (red), where MCAS activated, were both first-of-the-day flights and so presumably under control of the left FCC. For the rest it’s hard to tell, especially since maintenance operations may have entailed full shutdowns of the aircraft, which would have reset the alternation sequence.

The revised MCAS software will reportedly consult signals from both AoA sensors. What will it do with the additional information? Only one clue has been published so far: If the readings differ by more than 5.5 degrees, MCAS will shut down. What if the readings differ by 4 or 5 degrees? A recent paper by Daniel Ossmann of the German Aerospace Center dis­cusses algorithmic detection of fail­ures in AoA sensors.Which sensor will MCAS choose to believe? Conservative (or pessimistic) engineering practice would seem to favor the higher reading, in order to provide better protection against instability and a stall. But that choice also raises the risk of dangerous “corrections” mandated by a faulty sensor.

The present MCAS system, with its alternating choice of left and right, has a 50 percent chance of disaster when a single random failure causes an AoA sensor to spew out falsely high data. With the same one-sided random failure, the updated MCAS will have a 100 percent chance of ignoring a pilot’s excursion into stall territory. Is that an improvement?

### The broken sensor

Although a faulty sensor should not bring down an airplane, I would still like to know what went wrong with the AoA vane.

It’s no surprise that AoA sensors can fail. They are mechanical devices operating in a harsh environment: winds exceeding 500 miles per hour and temperatures below –40. A common failure mode is a stuck vane, often caused by ice (despite a built-in de-icing heater). But a seized vane would produce a constant output, regardless of the real angle of attack, which is not the symptom seen in Flight 610. The flight data recorder shows small fluctuations in the signals from both the left and the right instruments. Furthermore, the jiggles in the two curves are closely aligned, suggesting they are both tracking the same movements of the aircraft. In other words, the left-hand sensor appears to be functioning; it’s just giving measurements offset by a constant deviation of roughly 20 degrees.

Is there some other failure mode that might produce the observed offset? Sure: Just bend the vane by 20 degrees. Maybe a catering truck or an airport jetway blundered into it. Another creative thought is that the sensor might have been installed wrong, with the entire unit rotated by 20 degrees. Several writers on a website called the Professional Pilots Rumour Network explored this possibility, but they ultimately concluded it was impossible. The manufacturer, doubtless aware of the risk, placed the mounting screws and locator pins asymmetrically, so the unit will only go into the hull opening one way.

You might get the same effect through an assembly error during the manufacture of the sensor. The vane could be incorrectly attached to the shaft, or else the internal transducer that converts angular position into an electrical signal might be mounted wrong. Did the designers also ensure that such mistakes are impossible? I don’t know; I haven’t been able to find any drawings or photographs of the sensor’s innards.

Looking for other ideas about what might have gone wrong, I made a quick, scattershot survey of FAA airworthiness directives that call for servicing or replacing AoA sensors. I found dozens of them, including several that discuss the same sensor installed on the 737 MAX (the Rosemount 0861). But none of the reports I read describes a malfunction that could cause a consistent 20-degree error.

For a while I thought that the fault might lie not in the sensor itself but farther along the data path. It could be something as simple as a bad cable or connector. Signals from the AoA sensor go to the Air Data and Inertial Reference Unit (ADIRU), where the sine and cosine components are combined and digitized to yield a number representing the measured angle of attack. The ADIRU also receives inputs from other sensors, including the pitot tubes for measuring airspeed and the static ports for air pressure. And it houses the gyroscopes and accelerometers of an inertial guidance system, which can keep track of aircraft motion without reference to external cues. (There’s a separate ADIRU for each side of the airplane.) Maybe there was a problem with the digitizer—a stuck bit rather than a stuck vane.

Further information has undermined this idea. For one thing, the AoA sensor removed by the Lion Air maintenance crew on October 27 is now in the hands of investigators. According to news reports, it was “deemed to be defective,” though I’ve heard no hint of what the defect might be. Also, it turns out that one element of the control system, the Stall Management and Yaw Damper (SMYD) computer, receives the raw sine and cosine voltages directly from the sensor, not a digitized angle calculated by the ADIRU. It is the SMYD that controls the stick-shaker function. On both the Lion Air and the Ethiopian flights the stick shaker was active almost continuously, so those undigitized sine and cosine voltages must have been indicating a high angle of attack. In other words the error already existed before the signals reached the ADIRU.

I’m still stumped by the fixed angular offset in the Lion Air data, but the question now seems a little less important. The release of the preliminary report on Ethiopian Flight 302 shows that the left-side AoA sensor on that aircraft also failed badly, but in a way that looks totally different. Here are the relevant traces from the flight data recorder:

The readings from the AoA sensors are the uppermost lines, red for the left sensor and blue for the right. At the left edge of the graph they differ somewhat when the airplane has just begun to move, but they fall into close coincidence once the roll down the runway has built up some speed. At takeoff, however, they suddenly diverge dramtically, as the left vane begins reading an utterly implausible 75 degrees nose up. Later it comes down a few degrees but otherwise shows no sign of the ripples that would suggest a response to airflow. At the very end of the flight there are some more unexplained excursions.

By the way, in this graph the light blue trace of automatic trim commands offers another clue to what might have happened in the last moments of Flight 302. Around the middle of the graph, the STAB TRIM switches were pulled, with the result that an automatic nose-down command had no effect on the stabilizer position. But at the far right, another automatic nose-down command does register in the trim-position trace, suggesting that the cutout switches may have been turned on again.

### Still more stumpers

There’s so much I still don’t understand.

Puzzle 1. If the Lion Air and Ethiopian accidents were both caused by faulty AoA sensors, then there were three parts with similar defects in brand new aircraft (including the replacement sensor installed by Lion Air on October 27). A recent news item says the replacement was not a new part but one that had been refurbished by a Florida shop called XTRA Aerospace. This fact offers us somewhere else to point the accusatory finger, but presumably the two sensors installed by Boeing were not retreads, so XTRA can’t be blamed for all of them.

There are roughly 400 MAX aircraft in service, with 800 AoA sensors. Is a failure rate of 3 out of 800 unusual or unacceptable? Does that judgment depend on whether or not it’s the same defect in all three cases?

Puzzle 2. Let’s look again at the traces for pitch trim and angle of attack in the Lion Air 610 data. The conflicting manual and automatic commands in the second half of the flight have gotten lots of attention, but I’m also baffled by what was going on in the first few minutes.

During the roll down the runway, the pitch trim system was set near its maximum pitch-up position (dark blue line). Immediately after takeoff, the automatic trim system began calling for further pitch-up movement, and the stabilizer probably reached its mechanical limit. At that point the pilots manually trimmed it in the pitch-down direction, and the automatic system replied with a rapid sequence of up adjustments. In other words, there was already a tug-of-war underway, but the pilots and the automated controls were pulling in directions opposite to those they would choose later on. All this happened while the flaps were still deployed, which means that MCAS could not have been active. Some other element of the control system must have been issuing those automatic pitch-up orders. Deepening the mystery, the left side AoA sensor was already feeding its spurious high readings to the left-side flight control computer. If the FCC was acting on that data, it should not have been commanding nose-up trim.

Puzzle 3. The AoA readings are not the only peculiar data in the chart from the Lion Air preliminary report. Here are the altitude and speed traces:

The left-side altitude readings (red) are low by at least a few hundred feet. The error looks like it might be multiplicative rather than additive, perhaps 10 percent. The left and right computed airspeeds also disagree, although the chart is too squished to allow a quantitative comparison. It was these discrepancies that initially upset the pilots of Flight 610; they could see them on their instruments. (They had no angle of attack indicators in the cockpit, so that conflict was invisible to them.)

Altitude, airspeed, and angle of attack are all measured by different sensors. Could they all have gone haywire at the same time? Or is there some common point of failure that might explain all the weird behavior? In particular, is it possible a single wonky AoA sensor caused all of this havoc? My guess is yes. The sensors for altitude and airspeed and even temperature are influenced by angle of attack. The measured speed and pressure are therefore adjusted to compensate for this confounding variable, using the output of the AoA sensor. That output was wrong, and so the adjustments allowed one bad data stream to infect all of the air data measurements.

### Man or machine

Six months ago, I was writing about another disaster caused by an out-of-control control system. In that case the trouble spot was a natural gas distribution network in Massa­chusetts, where a misconfigured pressure-regulating station caused fires and explosions in more than 100 buildings, with one fatality and 20 serious injuries. I lamented: “The special pathos of technological tragedies is that the engines of our destruction are machines that we ourselves design and build.”

In a world where defective automatic controls are blowing up houses and dropping aircraft out of the sky, it’s hard to argue for more automation, for adding further layers of complexity to control systems, for endowing machines with greater autonomy. Public sentiment leans the other way. Like President Trump, most of us trust pilots more than we trust computer scientists. We don’t want MCAS on the flight deck. We want Chesley Sullenberger III, the hero of USAir Flight 1549, who guided his crippled A320 to a dead-stick landing in the Hudson River and saved all 155 souls on board. No amount of cockpit automation could have pulled off that feat.

Nevertheless, a cold, analytical view of the statistics suggests a different reaction. The human touch doesn’t always save the day. On the contrary, pilot error is responsible for more fatal crashes than any other cause. One survey lists pilot error as the initiating event in 40 percent of fatal accidents, with equipment failure accounting for 23 percent. No one is (yet) advocating a pilotless cockpit, but at this point in the history of aviation technology that’s a nearer prospect than a computer-free cockpit.

The MCAS system of the 737 MAX represents a particularly awkward compromise between fully manual and fully automatic control. The software is given a large measure of responsibility for flight safety and is even allowed to override the decisions of the pilot. And yet when the system malfunctions, it’s entirely up to the pilot to figure out what went wrong and how to fix it—and the fix had better be quick, before MCAS can drive the plane into the ground.

Two lost aircraft and 346 deaths are strong evidence that this design was not a good idea. But what to do about it? Boeing’s plan is a retreat from automatic control, returning more responsibility and authority to the pilots:

• Flight control system will now compare inputs from both AOA sensors. If the sensors disagree by 5.5 degrees or more with the flaps retracted, MCAS will not activate. An indicator on the flight deck display will alert the pilots.
• If MCAS is activated in non-normal conditions, it will only provide one input for each elevated AOA event. There are no known or envisioned failure conditions where MCAS will provide multiple inputs.
• MCAS can never command more stabilizer input than can be counter­acted by the flight crew pulling back on the column. The pilots will continue to always have the ability to override MCAS and manually control the airplane.

A statement from Dennis Muilenburg, Boeing’s CEO, says the software update “will ensure accidents like that of Lion Air Flight 610 and Ethiopian Airlines Flight 302 never happen again.” I hope that’s true, but what about the accidents that MCAS was designed to prevent? I also hope we will not be reading about a 737 MAX that stalled and crashed because the pilots, believing MCAS was misbehaving, kept hauling back on the control yokes.

If Boeing were to take the opposite approach—not curtailing MCAS but enhancing it with still more algorithms that fiddle with the flight controls—the plan would be greeted with hoots of outrage and derision. Indeed, it seems like a terrible idea. MCAS was installed to prevent pilots from wandering into hazardous territory. A new supervisory system would keep an eye on MCAS, stepping in if it began acting suspiciously. Wouldn’t we then need another custodian to guard the custodians, ad infinitum? Moreoever, with each extra layer of complexity we get new side effects and unintended consequences and opportunities for something to break. The system becomes harder to test, and impossible to prove correct.

Those are serious objections, but the problem being addressed is also serious.

Suppose the 737 MAX didn’t have MCAS but did have a cockpit indicator of angle of attack. On the Lion Air flight, the captain would have felt the stick-shaker warning him of an incipient stall and would have seen an alarmingly high angle of attack on his instrument panel. His training would have impelled him to do the same thing MCAS did: Push the nose down to get the wings working again. Would he have continued pushing it down until the plane crashed? Surely not. He would have looked out the window, he would have cross-checked the instruments on the other side of the cockpit, and after some scary moments he would have realized it was a false alarm. (In darkness or low visibility, where the pilot can lose track of the horizon, the outcome might be worse.)

I see two lessons in this hypothetical exercise. First, erroneous sensor data is dangerous, whether the airplane is being flown by a computer or by Chesley Sullenberger. A prudently designed instrument and control system would take steps to detect (and ideally correct) such errors. At the moment, redundancy is the only defense against these failures—and in the unpatched version of MCAS even that protection is compromised. It’s not enough. One key to the superiority of human pilots is that they exercise judgment and sometimes skepticism about what the instruments tell them. That kind of reasoning is not beyond the reach of automated systems. There’s plenty of information to be exploited. For example, inconsistencies between AoA sensors, pitot tubes, static pressure ports, and air temperature probes not only signal that something’s wrong but can offer clues about which sensor has failed. The inertial reference unit provides an independent check on aircraft attitude; even GPS signals might be brought to bear. Admittedly, making sense of all this data and drawing a valid conclusion from it—a problem known as sensor fusion—is a major challenge.

Second, a closed-loop controller has yet another source of information: an implicit model of the system being controlled. If you change the angle of the horizontal stabilizer, the state of the airplane is expected to change in known ways—in angle of attack, pitch angle, airspeed, altitude, and in the rate of change in all these parameters. If the result of the control action is not consistent with the model, something’s not right. To persist in issuing the same commands when they don’t produce the expected results is not reasonable behavior. Autopilots include rules to deal with such situations; the lower-level control laws that run in manual-mode flight could incorporate such sanity checks as well.

I don’t claim to have the answer to the MCAS problem. And I don’t want to fly in an airplane I designed. (Neither do you.) But there’s a general principle here that I believe should be taken to heart: If an autonomous system makes life-or-death decisions based on sensor data, it ought to verify the validity of the data.

### Update 2019-04-11

Boeing continues to insist that MCAS is “not a stall-protection function and not a stall-prevention function. It is a handling-qualities function. There’s a misconception it is something other than that.” This statement comes from Mike Sinnett, who is vice president of product development and future airplane development at Boeing; it appears in an Aviation Week article by Guy Norris published online April 9.

I don’t know exactly what “handling qualities” means in this context. To me the phrase connotes something that might affect comfort or aesthetics or pleasure more than safety. An airplane with different handling qualities would feel different to the pilot but could still be flown without risk of serious mishap. Is Sinnett implying something along those lines? If so—if MCAS is not critical to the safety of flight—I’m surprised that Boeing wouldn’t simply disable it temporarily, as a way of getting the fleet back in the air while they work out a permanent solution.

The Norris article also quote Sinnett as saying: “The thing you are trying to avoid is a situation where you are pulling back and all of a sudden it gets easier, and you wind up overshooting and making the nose higher than you want it to be.” That situation, with the nose higher than you want it to be, sounds to me like an airplane that might be approaching a stall.

A story by Jack Nicas, David Gelles, and James Glanz in today’s New York Times offers a quite different account, suggesting that “handling qualities” may have motivated the first version of MCAS, but stall risks were part of the rationale for later beefing it up.

The system was initially designed to engage only in rare circumstances, namely high-speed maneuvers, in order to make the plane handle more smoothly and predictably for pilots used to flying older 737s, according to two former Boeing employees who spoke on the condition of anonymity because of the open investigations.

For those situations, MCAS was limited to moving the stabilizer—the part of the plane that changes the vertical direction of the jet—about 0.6 degrees in about 10 seconds.

It was around that design stage that the F.A.A. reviewed the initial MCAS design. The planes hadn’t yet gone through their first test flights.

After the test flights began in early 2016, Boeing pilots found that just before a stall at various speeds, the Max handled less predictably than they wanted. So they suggested using MCAS for those scenarios, too, according to one former employee with direct knowledge of the conversations

Finally, another Aviation Week story by Guy Norris, published yesterday, gives a convincing account of what happened to the angle of attack sensor on Ethiopian Airlines Flight 302. According to Norris’s sources, the AoA vane was sheared off moments after takeoff, probably by a bird strike. This hypothesis is consistent with the traces extracted from the flight data recorder, including the strange-looking wiggles at the very end of the flight. I wonder if there’s hope of finding the lost vane, which shouldn’t be far from the end of the runway.

Posted in computing, technology | 10 Comments

## Divisive factorials!

The other day I was derailed by this tweet from Fermat’s Library:

The moment I saw it, I had to stop in my tracks, grab a scratch pad, and check out the formula. The result made sense in a rough-and-ready sort of way. Since the multiplicative version of $$n!$$ goes to infinity as $$n$$ increases, the “divisive” version should go to zero. And $$\frac{n^2}{n!}$$ does exactly that; the polynomial function $$n^2$$ grows slower than the exponential function $$n!$$ for large enough $$n$$:

$\frac{1}{1}, \frac{4}{2}, \frac{9}{6}, \frac{16}{24}, \frac{25}{120}, \frac{36}{720}, \frac{49}{5040}, \frac{64}{40320}, \frac{81}{362880}, \frac{100}{3628800}.$

But why does the quotient take the particular form $$\frac{n^2}{n!}$$? Where does the $$n^2$$ come from?

To answer that question, I had to revisit the long-ago trauma of learning to divide fractions, but I pushed through the pain. Proceeding from left to right through the formula in the tweet, we first get $$\frac{n}{n-1}$$. Then, dividing that quantity by $$n-2$$ yields

$\cfrac{\frac{n}{n-1}}{n-2} = \frac{n}{(n-1)(n-2)}.$

Continuing in the same way, we ultimately arrive at:

$n \mathbin{/} (n-1) \mathbin{/} (n-2) \mathbin{/} (n-3) \mathbin{/} \cdots \mathbin{/} 1 = \frac{n}{(n-1) (n-2) (n-3) \cdots 1} = \frac{n}{(n-1)!}$

To recover the tweet’s stated result of $$\frac{n^2}{n!}$$, just multiply numerator and denominator by $$n$$. (To my taste, however, $$\frac{n}{(n-1)!}$$ is the more perspicuous expression.)

I am a card-carrying factorial fanboy. You can keep your fancy Fibonaccis; this is my favorite function. Every time I try out a new programming language, my first exercise is to write a few routines for calculating factorials. Over the years I have pondered several variations on the theme, such as replacing $$\times$$ with $$+$$ in the definition (which produces triangular numbers). But I don’t think I’ve ever before considered substituting $$\mathbin{/}$$ for $$\times$$. It’s messy. Because multiplication is commutative and associative, you can define $$n!$$ simply as the product of all the integers from $$1$$ through $$n$$, without worrying about the order of the operations. With division, order can’t be ignored. In general, $$x \mathbin{/} y \ne y \mathbin{/}x$$, and $$(x \mathbin{/} y) \mathbin{/} z \ne x \mathbin{/} (y \mathbin{/} z)$$.

The Fermat’s Library tweet puts the factors in descending order: $$n, n-1, n-2, \ldots, 1$$. The most obvious alternative is the ascending sequence $$1, 2, 3, \ldots, n$$. What happens if we define the divisive factorial as $$1 \mathbin{/} 2 \mathbin{/} 3 \mathbin{/} \cdots \mathbin{/} n$$? Another visit to the schoolroom algorithm for dividing fractions yields this simple answer:

$1 \mathbin{/} 2 \mathbin{/} 3 \mathbin{/} \cdots \mathbin{/} n = \frac{1}{2 \times 3 \times 4 \times \cdots \times n} = \frac{1}{n!}.$

In other words, when we repeatedly divide while counting up from $$1$$ to $$n$$, the final quotient is the reciprocal of $$n!$$. (I wish I could put an exclamation point at the end of that sentence!) If you’re looking for a canonical answer to the question, “What do you get if you divide instead of multiplying in $$n!$$?” I would argue that $$\frac{1}{n!}$$ is a better candidate than $$\frac{n}{(n - 1)!}$$. Why not embrace the symmetry between $$n!$$ and its inverse?

Of course there are many other ways to arrange the n integers in the set $$\{1 \ldots n\}$$. How many ways? As it happens, $$n!$$ of them! Thus it would seem there are $$n!$$ distinct ways to define the divisive $$n!$$ function. However, looking at the answers for the two permutations discussed above suggests there’s a simpler pattern at work. Whatever element of the sequence happens to come first winds up in the numerator of a big fraction, and the denominator is the product of all the other elements. As a result, there are really only $$n$$ different outcomes—assuming we stick to performing the division operations from left to right. For any integer $$k$$ between $$1$$ and $$n$$, putting $$k$$ at the head of the queue creates a divisive $$n!$$ equal to $$k$$ divided by all the other factors. We can write this out as:

$\cfrac{k}{\frac{n!}{k}}, \text{ which can be rearranged as } \frac{k^2}{n!}.$

And thus we also solve the minor mystery of how $$\frac{n}{(n-1)!}$$ became $$\frac{n^2}{n!}$$ in the tweet.

It’s worth noting that all of these functions converge to zero as $$n$$ goes to infinity. Asymptotically speaking, $$\frac{1^2}{n!}, \frac{2^2}{n!}, \ldots, \frac{n^2}{n!}$$ are all alike.

Ta dah! Mission accomplished. Problem solved. Done and dusted. Now we know everything there is to know about divisive factorials, right?

Well, maybe there’s one more question. What does the computer say? If you take your favorite factorial algorithm, and do as the tweet suggests, replacing any appearance of the $$\times$$ (or *) operator with /, what happens? Which of the $$n$$ variants of divisive $$n!$$ does the program produce?

Here’s my favorite algorithm for computing factorials, in the form of a Julia program:

function mul!(n)
if n == 1
return 1
else
return n * mul!(n - 1)
end
end


This is the algorithm that has introduced generations of nerds to the concept of recursion. In narrative form it says: If $$n$$ is $$1$$, then $$mul!(n)$$ is $$1$$. Otherwise, evaluate the function $$mul!(n-1)$$, then multiply the result by $$n$$. You might ask what happens if $$n$$ is zero or negative. You might ask, but please don’t. For present purposes, $$n \in \mathbb{N}$$.Starting with any positive $$n$$, the sequence of recursive calls must eventually bottom out with $$n = 1$$.

The function can be written more tersely using Julia’s one-liner style of definition:.

mul!(n)  =  n == 1 ? 1 : n * mul!(n - 1)

The right side of the assignment statement is a conditional expression, or ternary operator, which has the form a ? b : c. Here a is a boolean test clause, which must return a value of either true or false. If a is true, clause b is evaluated, and the result becomes the value of the entire expression. Otherwise clause c is evaluated.

Just to be sure I’ve got this right, here are the first 10 factorials, as calculated by this program:

[mul!(n) for n in 1:10]
10-element Array{Int64,1}:
1
2
6
24
120
720
5040
40320
362880
3628800

Now let’s edit that definition and convert the single occurence of * to a /, leaving everything else (except the name of the function) unchanged.

div!(n)  =  n == 1 ? 1 : n / div!(n - 1)

And here’s what comes back when we run the program for values of $$n$$ from $$1$$ through $$20$$:

[div!(n) for n in 1:20]
20-element Array{Real,1}:
1
2.0
1.5
2.6666666666666665
1.875
3.2
2.1875
3.657142857142857
2.4609375
4.063492063492063
2.70703125
4.432900432900433
2.9326171875
4.773892773892774
3.14208984375
5.092152292152292
3.338470458984375
5.391690662278897
3.523941040039063
5.675463855030418 

Huh? That sure doesn’t look like it’s converging to zero—not as $$\frac{1}{n!}$$ or as $$\frac{n}{n - 1}$$. As a matter of fact, it doesn’t look like it’s going to converge at all. The graph below suggests the sequence is made up of two alternating components, both of which appear to be slowly growing toward infinity as well as diverging from one another.

In trying to make sense of what we’re seeing here, it helps to change the output type of the div! function. Instead of applying the division operator /, which returns the quotient as a floating-point number, we can substitute the // operator, which returns an exact rational quotient, reduced to lowest terms.

div!(n)  =  n == 1 ? 1 : n // div!(n - 1)

Here’s the sequence of values for n in 1:20:

20-element Array{Real,1}:
1
2//1
3//2
8//3
15//8
16//5
35//16
128//35
315//128
256//63
693//256
1024//231
3003//1024
2048//429
6435//2048
32768//6435
109395//32768
65536//12155
230945//65536
262144//46189 

The list is full of curious patterns. It’s a double helix, with even numbers and odd numbers zigzagging in complementary strands. The even numbers are not just even; they are all powers of $$2$$. Also, they appear in pairs—first in the numerator, then in the denominator—and their sequence is nondecreasing. But there are gaps; not all powers of $$2$$ are present. The odd strand looks even more complicated, with various small prime factors flitting in and out of the numbers. (The primes have to be small—smaller than $$n$$, anyway.)

This outcome took me by surprise. I had really expected to see a much tamer sequence, like those I worked out with pencil and paper. All those jagged, jitterbuggy ups and downs made no sense. Nor did the overall trend of unbounded growth in the ratio. How could you keep dividing and dividing, and wind up with bigger and bigger numbers?

At this point you may want to pause before reading on, and try to work out your own theory of where these zigzag numbers are coming from. If you need a hint, you can get a strong one—almost a spoiler—by looking up the sequence of numerators or the sequence of denominators in the Online Encyclopedia of Integer Sequences.

Here’s another hint. A small edit to the div! program completely transforms the output. Just flip the final clause, changing n // div!(n - 1) into div!(n - 1) // n.

div!(n)  =  n == 1 ? 1 : div!(n - 1) // n

Now the results look like this:

10-element Array{Real,1}:
1
1//2
1//6
1//24
1//120
1//720
1//5040
1//40320
1//362880
1//3628800

This is the inverse factorial function we’ve already seen, the series of quotients generated when you march left to right through an ascending sequence of divisors $$1 \mathbin{/} 2 \mathbin{/} 3 \mathbin{/} \cdots \mathbin{/} n$$.

It’s no surprise that flipping the final clause in the procedure alters the outcome. After all, we know that division is not commutative or associative. What’s not so easy to see is why the sequence of quotients generated by the original program takes that weird zigzag form. What mechanism is giving rise to those paired powers of 2 and the alternation of odd and even?

I have found that it’s easier to explain what’s going on in the zigzag sequence when I describe an iterative version of the procedure, rather than the recursive one. (This is an embarrassing admission for someone who has argued that recursive definitions are easier to reason about, but there you have it.) Here’s the program:

function div!_iter(n)
q = 1
for i in 1:n
q = i // q
end
return q
end

I submit that this looping procedure is operationally identical to the recursive function, in the sense that if div!(n) and div!_iter(n) both return a result for some positive integer n, it will always be the same result. Here’s my evidence:

[div!(n) for n in 1:20]    [div!_iter(n) for n in 1:20]
1                         1//1
2//1                       2//1
3//2                       3//2
8//3                       8//3
15//8                      15//8
16//5                      16//5
35//16                     35//16
128//35                    128//35
315//128                   315//128
256//63                    256//63
693//256                   693//256
1024//231                  1024//231
3003//1024                 3003//1024
2048//429                  2048//429
6435//2048                 6435//2048
32768//6435                32768//6435
109395//32768              109395//32768
65536//12155               65536//12155
230945//65536              230945//65536
262144//46189              262144//46189

To understand the process that gives rise to these numbers, consider the successive values of the variables $$i$$ and $$q$$ each time the loop is executed. Initially, $$i$$ and $$q$$ are both set to $$1$$; hence, after the first passage through the loop, the statement q = i // q gives $$q$$ the value $$\frac{1}{1}$$. Next time around, $$i = 2$$ and $$q = \frac{1}{1}$$, so $$q$$’s new value is $$\frac{2}{1}$$. On the third iteration, $$i = 3$$ and $$q = \frac{2}{1}$$, yielding $$\frac{i}{q} \rightarrow \frac{3}{2}$$. If this is still confusing, try thinking of $$\frac{i}{q}$$ as $$i \times \frac{1}{q}$$. The crucial observation is that on every passage through the loop, $$q$$ is inverted, becoming $$\frac{1}{q}$$.

If you unwind these operations, and look at the multiplications and divisions that go into each element of the series, a pattern emerges:

$\frac{1}{1}, \quad \frac{2}{1}, \quad \frac{1 \cdot 3}{2}, \quad \frac{2 \cdot 4}{1 \cdot 3}, \quad \frac{1 \cdot 3 \cdot 5}{2 \cdot 4} \quad \frac{2 \cdot 4 \cdot 6}{1 \cdot 3 \cdot 5}$

The general form is:

$\frac{1 \cdot 3 \cdot 5 \cdot \cdots \cdot n}{2 \cdot 4 \cdot \cdots \cdot (n-1)} \quad (\text{odd } n) \qquad \frac{2 \cdot 4 \cdot 6 \cdot \cdots \cdot n}{1 \cdot 3 \cdot 5 \cdot \cdots \cdot (n-1)} \quad (\text{even } n).$

The functions $$1 \cdot 3 \cdot 5 \cdot \cdots \cdot n$$ for odd $$n$$ and $$2 \cdot 4 \cdot 6 \cdot \cdots \cdot n$$ for even $$n$$ have a name! They are known as double factorials, with the notation $$n!!$$. Terrible terminology, no? Better to have named them “semi-factorials.” And if I didn’t know better, I would read $$n!!$$ as “the factorial of the factorial.” The double factorial of n is defined as the product of n and all smaller positive integers of the same parity. Thus our peculiar sequence of zigzag quotients is simply $$\frac{n!!}{(n-1)!!}$$.

A 2012 article by Henry W. Gould and Jocelyn Quaintance (behind a paywall, regrettably) surveys the applications of double factorials. They turn up more often than you might guess. In the middle of the 17th century John Wallis came up with this identity:

$\frac{\pi}{2} = \frac{2 \cdot 2 \cdot 4 \cdot 4 \cdot 6 \cdot 6 \cdots}{1 \cdot 3 \cdot 3 \cdot 5 \cdot 5 \cdot 7 \cdots} = \lim_{n \rightarrow \infty} \frac{((2n)!!)^2}{(2n + 1)!!(2n - 1)!!}$

An even weirder series, involving the cube of a quotient of double factorials, sums to $$\frac{2}{\pi}$$. That one was discovered by (who else?) Srinivasa Ramanujan.

Gould and Quaintance also discuss the double factorial counterpart of binomial coefficients. The standard binomial coefficient is defined as:

$\binom{n}{k} = \frac{n!}{k! (n-k)!}.$

The double version is:

$\left(\!\binom{n}{k}\!\right) = \frac{n!!}{k!! (n-k)!!}.$

Note that our zigzag numbers fit this description and therefore qualify as double factorial binomial coefficients. Specifically, they are the numbers:

$\left(\!\binom{n}{1}\!\right) = \left(\!\binom{n}{n - 1}\!\right) = \frac{n!!}{1!! (n-1)!!}.$

The regular binomial $$\binom{n}{1}$$ is not very interesting; it is simply equal to $$n$$. But the doubled version $$\left(\!\binom{n}{1}\!\right)$$, as we’ve seen, dances a livelier jig. And, unlike the single binomial, it is not always an integer. (The only integer values are $$1$$ and $$2$$.)

Seeing the zigzag numbers as ratios of double factorials explains quite a few of their properties, starting with the alternation of evens and odds. We can also see why all the even numbers in the sequence are powers of 2. Consider the case of $$n = 6$$. The numerator of this fraction is $$2 \cdot 4 \cdot 6 = 48$$, which acquires a factor of $$3$$ from the $$6$$. But the denominator is $$1 \cdot 3 \cdot 5 = 15$$. The $$3$$s above and below cancel, leaving $$\frac{16}{5}$$. Such cancelations will happen in every case. Whenever an odd factor $$m$$ enters the even sequence, it must do so in the form $$2 \cdot m$$, but at that point $$m$$ itself must already be present in the odd sequence.

Is the sequence of zigzag numbers a reasonable answer to the question, “What happens when you divide instead of multiply in $$n!$$?” Or is the computer program that generates them just a buggy algorithm? My personal judgment is that $$\frac{1}{n!}$$ is a more intuitive answer, but $$\frac{n!!}{(n - 1)!!}$$ is more interesting.

Furthermore, the mere existence of the zigzag sequence broadens our horizons. As noted above, if you insist that the division algorithm must always chug along the list of $$n$$ factors in order, at each stop dividing the number on the left by the number on the right, then there are only $$n$$ possible outcomes, and they all look much alike. But the zigzag solution suggests wilder possibilities. We can formulate the task as follows. Take the set of factors $$\{1 \dots n\}$$, select a subset, and invert all the elements of that subset; now multiply all the factors, both the inverted and the upright ones. If the inverted subset is empty, the result is the ordinary factorial $$n!$$. If all of the factors are inverted, we get the inverse $$\frac{1}{n!}$$. And if every second factor is inverted, starting with $$n - 1$$, the result is an element of the zigzag sequence.

These are only a few among the many possible choices; in total there are $$2^n$$ subsets of $$n$$ items. For example, you might invert every number that is prime or a power of a prime $$(2, 3, 4, 5, 7, 8, 9, 11, \dots)$$. For small $$n$$, the result jumps around but remains consistently less than $$1$$:

If I were to continue this plot to larger $$n$$, however, it would take off for the stratosphere. Prime powers get sparse farther out on the number line.

Here’s a question. We’ve seen factorial variants that go to zero as $$n$$ goes to infinity, such as $$1/n!$$. We’ve seen other variants grow without bound as $$n$$ increases, including $$n!$$ itself, and the zigzag numbers. Are there any versions of the factorial process that converge to a finite bound other than zero?

function greedy_balance(n)
q = 1
while n > 0
q = q > 1 ? q /= n : q *= n
n -= 1
end
return q
end

We loop through the integers from $$n$$ down to $$1$$, calculating the running product/quotient $$q$$ as we go. At each step, if the current value of $$q$$ is greater than $$1$$, we divide by the next factor; otherwise, we multiply. This scheme implements a kind of feedback control or target-seeking behavior. If $$q$$ gets too large, we reduce it; too small and we increase it. I conjectured that as $$n$$ goes to infinity, $$q$$ would settle into an ever-narrower range of values near $$1$$.

Running the experiment gave me another surprise:

That sawtooth wave is not quite what I expected. One minor peculiarity is that the curve is not symmetric around $$1$$; the excursions above have higher amplitude than those below. But this distortion is more visual than mathematical. Because $$q$$ is a ratio, the distance from $$1$$ to $$10$$ is the same as the distance from $$1$$ to $$\frac{1}{10}$$, but it doesn’t look that way on a linear scale. The remedy is to plot the log of the ratio:

Now the graph is symmetric, or at least approximately so, centered on $$0$$, which is the logarithm of $$1$$. But a larger mystery remains. The sawtooth waveform is very regular, with a period of $$4$$, and it shows no obvious signs of shrinking toward the expected limiting value of $$\log q = 0$$. Numerical evidence suggests that as $$n$$ goes to infinity the peaks of this curve converge on a value just above $$q = \frac{5}{3}$$, and the troughs approach a value just below $$q = \frac{3}{5}$$. (The corresponding base-$$10$$ logarithms are roughly $$\pm0.222$$. I have not worked out why this should be so. Perhaps someone will explain it to me.

The failure of this greedy algorithm doesn’t mean we can’t find a divisive factorial that converges to $$q = 1$$. If we work with the logarithms of the factors, this procedure becomes an instance of a well-known compu­tational problem called the number partitioning problem. You are given a set of real numbers and asked to divide it into two sets whose sums are equal, or as close to equal as possible. It’s a certifiably hard problem, but it has also been called (PDF) “the easiest hard problem.”For any given $$n$$, we might find that inverting some other subset of the factors gives a better approximation to $$n! = 1$$. For small $$n$$, we can solve the problem by brute force: Just look at all $$2^n$$ subsets and pick the best one.

I have computed the optimal partitionings up to $$n = 30$$, where there are a billion possibilities to choose from.

The graph is clearly flatlining. You could use the same method to force convergence to any other value between $$0$$ and $$n!$$.

And thus we have yet another answer to the question in the tweet that launched this adventure. What happens when you divide instead of multiply in n!? Anything you want.

Posted in computing, mathematics | 5 Comments

## A Room with a View

On my visit to Baltimore for the Joint Mathematics Meetings a couple of weeks ago, I managed to score a hotel room with a spectacular scenic view. My seventh-floor perch overlooked the Greene Street substation of the Baltimore Gas and Electric Company, just around the corner from the Camden Yards baseball stadium.

Some years ago, writing about such technological landscapes, I argued that you can understand what you’re looking at if you’re willing to invest a little effort:

At first glance, a substation is a bewildering array of hulking steel machines whose function is far from obvious. Ponderous tanklike or boxlike objects are lined up in rows. Some of them have cooling fins or fans; many have fluted porcelain insulators poking out in all directions…. If you look closer, you will find there is a logic to this mélange of equipment. You can make sense of it. The substation has inputs and outputs, and with a little study you can trace the pathways between them.

If I were writing that passage now, I would hedge or soften my claim that an electrical substation will yield its secrets to casual observation. Each morning in Baltimore I spent a few minutes peering into the Greene Street enclosure. I was able to identify all the major pieces of equipment in the open-air part of the station, and I know their basic functions. But making sense of the circuitry, finding the logic in the arrangement of devices, tracing the pathways from inputs to outputs—I have to confess, with a generous measure of chagrin, that I failed to solve the puzzle. I think I have the answers now, but finding them took more than eyeballing the hardware.

Basics first. A substation is not a generating plant. BGE does not “make” electricity here. The substation receives electric power in bulk from distant plants and repackages it for retail delivery. At Greene Street the incoming supply is at 115,000 volts (or 115 kV). The output voltage is about a tenth of that: 13.8 kV. How do I know the voltages? Not through some ingenious calculation based on the size of the insulators or the spacing between conductors. In an enlargement of one of my photos I found an identify­ing plate with the blurry and partially obscured but still legible notation “115/13.8 KV.”

The biggest hunks of machinery in the yard are the transformers (photo below), which do the voltage conversion. Each transformer is housed in a steel tank filled with oil, which serves as both insulator and coolant. Immersed in the oil bath are coils of wire wrapped around a massive iron core. Stacks of radiator panels, with fans mounted underneath, help cool the oil when the system is under heavy load. A bed of crushed stone under the transformer is meant to soak up any oil leaks and reduce fire hazards.

Electricity enters and leaves the transformer through the ribbed gray posts, called bushings, mounted atop the casing. A bushing is an insulator with a conducting path through the middle. It works like the rubber grommet that protects the power cord of an appliance where it passes through the steel chassis. The high-voltage inputs attach to the three tallest bushings, with red caps; the low-voltage bushings, with dark gray caps, are shorter and more closely spaced. Notice that each high-voltage input travels over a single slender wire, whereas each low-voltage output has three stout conductors. That’s because reducing the voltage to one-tenth increases the current tenfold.

What about the three slender gray posts just to the left of the high-voltage bushings? They are lightning arresters, shunting sudden voltage surges into the earth to protect the transformer from damage.

Perhaps the most distinctive feature of this particular substation is what’s not to be seen. There are no tall towers carrying high-voltage transmission lines to the station. Clearing a right of way for overhead lines would be difficult and destructive in an urban center, so the high-voltage “feeders” run under­ground. In the photo at right, near the bottom left corner, a bundle of three metal-sheathed cables emerges from the earth. Each cable, about as thick as a human forearm, has a copper or aluminum conductor running down the middle, surrounded by insulation. I suspect these cables are insulated with layers of paper impregnated with oil under pressure; some of the other feeders entering the station may be of a newer design, with solid plastic insulation. Each cable plugs into the bottom of a ceramic bushing, which carries the current to a copper wire at the top. (You can tell the wire is copper because of the green patina.)

Connecting the feeder input to the transformer is a set of three hollow aluminum conductors called bus bars, held high overhead on steel stanchions and ceramic insulators. At both ends of the bus bars are mechanical switches that open like hinged doors to break the circuit. I don’t know whether these switches can be opened when the system is under power or whether they are just used to isolate components for maintenance after a feeder has been shut down. Beyond the bus bars, and hidden behind a concrete barrier, we can glimpse the bushings atop a different kind of switch, which I’ll return to below.

At this point you might be asking, why does everything come in sets of three—the bus bars, the feeder cables, the terminals on the transformer? It’s because electric power is distributed as three-phase alternating current. Each conductor carries a voltage oscillating at 60 Hertz, with the three waves offset by one-third of a cycle. If you recorded the voltage between each of the three pairs of conductors (AB, AC, BC), you’d see a waveform like the one above at left.

At the other end of the conducting pathway, connected to three more bus bars on the low-voltage side of the transformer, is an odd-looking stack of three large drums. These

are current-limiting reactors (no connection with nuclear reactors). They are coils of thick conductors wound on a stout concrete armature. Under normal operating conditions they have little effect on the transmission of power, but in the milliseconds following a short circuit, the sudden rush of current generates a strong magnetic field in the coils, absorbing the energy of the fault current and preventing damage to other equipment.

So those are the main elements of the substation I was able to spot from my hotel window. They all made sense individually, and yet I realized over the course of a few days that I didn’t really understand how it all works together. My doubts are easiest to explain with the help of a bird’s eye view of the substation layout, cribbed from Google Maps:

My window vista was from off to the right, beyond the eastern edge of the compound. In the Google Maps view, the underground 115 kV feeders enter at the bottom or southern edge, and power flows northward through the transformers and the reactor coils, finally entering the building that occupies the northeast corner of the lot. Neither Google nor I can see inside this windowless building, but I know what’s in there, in a general way. That’s where the low-voltage (13.8 kV) distribution lines go underground and fan out to their various destinations in the neighborhood.

Let’s look more closely at the outdoor equipment. There are four high-voltage feeders, four transformers, and four sets of reactor coils. Apart from minor differences in geometry (and one newer-looking, less rusty transformer), these four parallel pathways all look alike. It’s a symmetric four-lane highway. Thus my first hypothesis was that four independent 115 kV feeders supply power to the station, presumably bringing it from larger substations and higher-voltage transmission lines outside the city.

However, something about the layout continued to bother me. If we label the four lanes of the highway from left to right, then on the high-voltage side, toward the bottom of the map view, it looks like there’s something connecting lanes 1 and 2 and, and there’s a similar link between lanes 3 and 4. From my hotel window the view of this device is blocked by a concrete barricade, and unfortunately the Google Maps image does not show it clearly either. (If you zoom in for a closer view, the goofy Google compression algorithm will turn the scene into a dreamscape where all the components have been draped in Saran Wrap.) Nevertheless, I’m quite sure of what I’m looking at. The device connecting the pairs of feeders is a high-voltage three-phase switch, or circuit breaker, something like the ones seen in the image at right (photographed at another substation, in Missouri.) The function of this device is essentially the same as that of a circuit breaker in your home electrical panel. You can turn it off manually to shut down a circuit, but it may also “trip” automatically in response to an overload or a short circuit. The concrete barriers flanking the two high-voltage breakers at Greene Street hint at one of the problems with such switches. Interrupting a current of hundreds of amperes at more than 100,000 volts is like stopping a runaway truck: It requires absorbing a lot of energy. The switch does not always survive the experience.

When I first looked into the Greene Street substation, I was puzzled by the absence of breakers at the input end of each main circuit. I expected to see them there to protect the transformers and other components from overloads or lightning strikes. I think there are breakers on the low-voltage side, tucked in just behind the transformers and thus not clearly visible from my window. But there’s nothing on the high side. I could only guess that such protection is provided by breakers near the output of the next substation upstream, the one that sends the 115 kV feeders into Greene Street.

That leaves the question of why pairs of circuits within the substation are cross-linked by breakers. I drew a simplified diagram of how things are wired up:

Two adjacent 115 kV circuits run from bottom to top; the breaker between them connects corresponding conductors—left to left, middle to middle, right to right. But what’s the point of doing so?

I had some ideas. If one transformer were out of commission, the pathway through the breaker could allow power to be rerouted through the remaining transformer (assuming it could handle the extra load). Indeed, maybe the entire design simply reflects a high level of redundancy. There are four incoming feeders and four transformers, but perhaps only two are expected to operate at any given time. The breaker provides a means of switching between them, so that you could lose a circuit (or maybe two) and still keep all the lights on. After all, this is a substation supplying power to many large facilities—the convention center (where the math meetings were held), a major hospital, large hotels, the ball park, theaters, museums, high-rise office buildings. Reliability is important here.

After further thought, however, this scheme seemed highly implausible. There are other substation layouts that would allow any of the four feeders to power any of the four transformers, allowing much greater flexibility in handling failures and making more efficient use of all the equipment. Linking the incoming feeders in pairs made no sense.

I would love to be able to say that I solved this puzzle on my own, just by dint of analysis and deduction, but it’s not true. When I got home and began looking at the photographs, I was still baffled. The answer eventually came via Google, though it wasn’t easy to find. Before revealing where I went wrong, I’ll give a couple of hints, which might be enough for you to guess the answer.

Hint 1. I was led astray by a biased sample. I am much more familiar with substations out in the suburbs or the countryside, partly because they’re easier to see into. Most of them are surrounded by a chain-link fence rather than a brick wall. But country infrastructure differs from the urban stuff.

Hint 2. I was also fooled by geometry when I should have been thinking about topology. To understand what you’re seeing in the Greene Street compound, you have to get beyond individual components and think about how it’s all connected to the rest of the network.

The web offers marvelous resources for the student of infrastructure, but finding them can be a challenge. You might suppose that the BGE website would have a list of the company’s facilities, and maybe a basic tutorial on where Baltimore’s electricity comes from. There’s nothing of the sort (although the utility’s parent company does offer thumbnail descriptions of some of their generating plants). Baltimore City websites were a little more helpful—not that they explained any details of substation operation, but they did report various legal and regulatory filings concerned with proposals for new or updated facilities. From those reports I learned the names of several BGE installations, which I could take back to Google to use as search terms.

One avenue I pursued was figuring out where the high-voltage feeders entering Greene Street come from. I discovered a substation called Pumphrey about five miles south of the city, near BWI airport, which seemed to be a major nexus of transmission lines. In particular, four 115 kV feeders travel north from Pumphrey to a substation in the Westport neighborhood, which is about a mile south of downtown. The Pumphrey-Westport feeders are overhead lines, and I had seen them already. Their right of way parallels the light rail route I had taken into town from the airport. Beyond the Westport substation, which is next to a light rail stop of the same name, the towers disappear. An obvious hypothesis is that the four feeders dive underground at Westport and come up at Greene Street. This guess was partly correct: Power does reach Greene Street from Westport, but not exclusively.

At Westport BGE has recently built a small, gas-fired generating plant, to help meet peak demands. The substation is also near the Baltimore RESCO waste-to-energy power plant (photo above), which has become a local landmark. (It’s the only garbage burner I know that turns up on postcards sold in tourist shops.) Power from both of these sources could also make its way to the Greene Street substation, via Westport.

I finally began to make sense of the city’s wiring diagram when I stumbled upon some documents published by the PJM Interconnection, the administrator and coordinator of the power “pool” in the mid-Atlantic region. PJM stands for Pennsylvania–New Jersey–Maryland, but it covers a broader territory, including Delaware, Ohio, West Virginia, most of Virginia, and parts of Kentucky, Indiana, Michigan, and Illinois. Connecting to such a pool has important advantages for a utility. If an equipment failure means you can’t meet your customers’ demands for electricity, you can import power from elsewhere in the pool to make up the shortage; conversely, if you have excess generation, you can sell the power to another utility. The PJM supervises the market for such exchanges.

The idea behind power pooling is that neighbors can prop each other up in times of trouble; however, they can also knock each other down. As a condition of membership in the pool, utilities have to maintain various standards for engineering and reliability. PJM committees review plans for changes or additions to a utility’s network. It was a set of Powerpoint slides prepared for one such committee that first alerted me to my fundamental misconception. One of the slides included the map below, tracing the routes of 115 kV feeders (green lines) in and around downtown Baltimore.

I had been assuming—even though I should have known better—that the distribution network is essentially treelike, with lines radiating from each node to other nodes but never coming back together. For low-voltage distribution lines in sparsely settled areas, this assumption is generally correct. If you live in the suburbs or in a small town, there is one power line that runs from the local substation to your neighborhood; if a tree falls on it, you’re in the dark until the problem is fixed. There is no alternative route of supply. But that is not the topology of higher-voltage circuits. The Baltimore network consists of rings, where power can reach most nodes by following either of two pathways.

In the map we can see the four 115 kV feeders linking Pumphrey to Westport. From Westport, two lines run due north to Greene Street, then make a right turn to another station named Concord Street. As far as I can tell, there is no Concord Street in Baltimore. There’s a Concord Road, but it’s miles away in the northwest corner of the city. The substation is actually at 750 East Pratt Street, occupying the lower floors of an 18-story office tower.They continue east to Monument Street, then north again to Erdman, where the ring receives additional power from other lines coming down from the north. The ring then continues west to Center Street and returns to Westport, closing the loop. The arrangement has some clear advantages for reliability. You can break any one link in a ring without cutting power to any of the substations; the power simply flows around the ring in the other direction.

This double-ring architecture calls for a total reinterpretation of how the Greene Street substation works. I had imagined the four 115 kV inputs as four lanes of one-way traffic, all pouring into the substation and dead-ending in the four transformers. In reality we have just two roadways, both of which enter the substation and then leave again, continuing on to further destinations. And they are not one-way; they can both carry traffic in either direction. The transformers are like exit ramps that siphon off a portion of the traffic while the main stream passes by.

At Greene Street, two of the underground lines entering the compound come from Westport, but the other two proceed to Concord Street, the next station around the ring. What about the breakers that sit between the incoming and outgoing branches of each circuit? They open up the ring to isolate any section that experiences a serious failure. For example, a short circuit in one of the cables running between Greene Street and Concord Street would cause breakers at both of those stations to open up, but both stations would continue to receive power coming around the other branch of the loop.

This revised interpretation was confirmed by another document made available by PJM, this one written by BGE engineers as an account of their engineering practices for transmission lines and substations. It includes a schematic diagram of a typical downtown Baltimore substation. The diagram makes no attempt to reproduce the geometric layout of the components; it rearranges them to make the topology clearer.

The two 115 kV feeders that run through the substation are shown as horizontal lines; the solid black squares in the middle are the breakers that join the pairs of feeders and thereby close the two rings that run through all the downtown substations. The transformers are the W-shaped symbols at the ends of the branch lines.

A mystery remains. The symbol represents a disconnect switch, a rather simple mechanical device that generally cannot be operated when the power line is under load. The symbol is identified in the BGE document as a circuit switcher, a more elaborate device capable of interrupting a heavy current. In the Greene Street photos, however, the switches at the two ends of the high-voltage bus bars appear almost identical. I’m not seeing any circuit switchers there. But, as should be obvious by now, I’m capable of misinterpreting what I see.

Posted in technology | 1 Comment

## Glauber’s dynamics

Roy J. Glauber, Harvard physics professor for 65 years, longtime Keeper of the Broom at the annual Ig Nobel ceremony, and winner of a non-Ig Nobel, has died at age 93. Glauber is known for his work in quantum optics; roughly speaking, he developed a mathematical theory of the laser at about the same time that device was invented, circa 1960. His two main papers on the subject, published in Physical Review in 1963, did not meet with instant acclaim; the Nobel committee’s recognition of their worth came more than 40 years later, in 2005. A third paper from 1963, titled “Time-dependent statistics of the Ising model,” also had a delayed impact. It is the basis of a modeling algorithm now called Glauber dynamics, which is well known in the cloistered community of statistical mechanics but deserves wider recognition.

Before digging into the dynamics, however, let us pause for a few words about the man himself, drawn largely from the obituaries in the New York Times and the Harvard Crimson.

Glauber was a member of the first class to graduate from the Bronx High School of Science, in 1941. From there he went to Harvard, but left in his sophomore year, at age 18, to work in the theory division at Los Alamos, where he helped calculate the critical mass of fissile material needed for a bomb. After the war he finished his degree at Harvard and went on to complete a PhD under Julian Schwinger. After a few brief adventures in Princeton and Pasadena, he was back at Harvard in 1952 and never left. A poignant aspect of his life is mentioned briefly in a 2009 interview, where Glauber discusses the challenge of sustaining an academic career while raising two children as a single parent.

Here’s a glimpse of Glauber dynamics in action. Click the Go button, then try fiddling with the slider.

3.00

In the computer program that drives this animation, the slider controls a variable representing temperature. At high temperature (slide the control all the way to the right), you’ll see a roiling, seething mass of colored squares, switching rapidly and randomly between light and dark shades. There are no large-scale or long-lived structures. Occasionally the end point is not a monochromatic field. Instead the panel is divided into broad stripes—horizontal, vertical, or diagonal. This is an artifact of the finite size of the lattice and the use of wraparound boundary conditions. On an infinite lattice, the stripes would not occur.At low temperature (slide to the left), the tableau congeals into a few writhing blobs of contrasting color. Then the minority blobs are likely to evaporate, and you’ll be left with an unchanging, monochromatic panel. Between these extremes there’s some interesting behavior. Adjust the slider to a temperature near 2.27 and you can expect to see persistent fluctuations at all possible scales, from isolated individual blocks to patterns that span the entire array.

What we’re looking at here is a simulation of a model of a ferromagnet—the kind of magnet that sticks to the refrigerator. The model was introduced almost 100 years ago by Wilhelm Lenz and his student Ernst Ising. They were trying to understand the thermal behavior of ferromagnetic materials such as iron. If you heat a block of magnetized iron above a certain temperature, called the Curie point, it loses all traces of magnetization. Slow cooling below the Curie point allows it to spontaneously magnetize again, perhaps with the poles in a different orientation. The onset of ferromagnetism at the Curie point is an abrupt phase transition.

Lenz and Ising created a stripped-down model of a ferromagnet. In the two-dimensional version shown here, each of the small squares represents the spin vector of an unpaired electron in an iron atom. The vector can point in either of two directions, conventionally called up and down, which for graphic convenience are represented by two contrasting colors. There are $$100 \times 100 = 10{,}000$$ spins in the array. This would be a minute sample of a real ferromagnet. On the other hand, the system has $$2^{10{,}000}$$ possible states—quite an enormous number.

The essence of ferromagnetism is that adjacent spins “prefer” to point in the same direction. To put that more formally: The energy of neighboring spins is lower when they are parallel, rather than antiparallel. For the array as a whole, the energy is minimized if all the spins point the same way, either up or down. Each spin contributes a tiny magnetic moment. When the spins are parallel, all the moments add up and the system is fully magnetized.

If energy were the only consideration, the Ising model would always settle into a magnetized configuration, but there is a countervailing influence: Heat tends to randomize the spin directions. At infinite temperature, thermal fluctuations completely overwhelm the spins’ tendency to align, and all states are equally likely. Because the vast majority of those $$2^{10{,}000}$$ configurations have nearly equal numbers of up and down spins, the magnetization is negligible. At zero temperature, nothing prevents the system from condensing into the fully magnetized state. The interval between these limits is a battleground where energy and entropy contend for supremacy. Clearly, there must be a transition of some kind. For Lenz and Ising in the 1920s, the crucial question was whether the transition comes at a sharply defined critical temperature, as it does in real ferromagnets. A more gradual progression from one regime to the other would signal the model’s failure to capture important aspects of ferromagnet physics.

In his doctoral dissertation Ising investigated the one-dimensional version of the model—a chain or ring of spins, each one holding hands with its two nearest neighbors. The result was a disappointment: He found no abrupt phase transition. And he speculated that the negative result would also hold in higher dimensions. The Ising model seemed to be dead on arrival.

It was revived a decade later by Rudolf Peierls, who gave suggestive evidence for a sharp transition in the two-dimensional lattice. Then in 1944 Lars Onsager “solved” the two-dimensional model, showing that the phase transition does exist. The phase diagram looks like this:

As the system cools, the salt-and-pepper chaos of infinite temperature evolves into a structure with larger blobs of color, but the up and down spins remain balanced on average (implying zero magnetization) down to the critical temperature $$T_C$$. At that point there is a sudden bifurcation, and the system will follow one branch or the other to full magnetization at zero temperature.

If a model is classified as solved, is there anything more to say about it? In this case, I believe the answer is yes. The solution to the two-dimensional Ising model gives us a prescription for calculating the probability of seeing any given configuration at any given temperature. That’s a major accomplishment, and yet it leaves much of the model’s behavior unspecified. The solution defines the probability distribution at equilibrium—after the system has had time to settle into a statistically stable configuration. It doesn’t tell us anything about how the lattice of spins reaches that equilibrium when it starts from an arbitrary initial state, or how the system evolves when the temperature changes rapidly.

It’s not just the solution to the model that has a few vague spots. When you look at the finer details of how spins interact, the model itself leaves much to the imagination. When a spin reacts to the influence of its nearest neighbors, and those neighbors are also reacting to one another, does everything happen all at once? Suppose two antiparallel spins both decide to flip at the same time; they will be left in a configuration that is still antiparallel. It’s hard to see how they’ll escape repeating the same dance over and over, like people who meet head-on in a corridor and keep making mirror-image evasive maneuvers. This kind of standoff can be avoided if the spins act sequentially rather than simultaneously. But if they take turns, how do they decide who goes first?

Within the intellectual traditions of physics and mathematics, these questions can be dismissed as foolish or misguided. After all, when we look at the procession of the planets orbiting the sun, or at the colliding molecules in a gas, we don’t ask who takes the first step; the bodies are all in continuous and simultaneous motion. Newton gave us a tool, calculus, for understanding such situations. If you make the steps small enough, you don’t have to worry so much about the sequence of marching orders.

However, if you want to write a computer program simulating a ferromagnet (or simulating planetary motions, for that matter), questions of sequence and synchrony cannot be swept aside. With conventional computer hardware, “let everything happen at once” is not an option. The program must consider each spin, one at a time, survey the surrounding neighborhood, apply an update rule that’s based on both the state of the neighbors and the temperature, and then decide whether or not to flip. Thus the program must choose a sequence in which to visit the lattice sites, as well as a sequence in which to visit the neighbors of each site, and those choices can make a difference in the outcome of the simulation. So can other details of implementation. Do we look at all the sites, calculate their new spin states, and then update all those that need to be flipped? Or do we update each spin as we go along, so that spins later in the sequence will see an array already modified by earlier actions? The original definition of the Ising model is silent on such matters, but the programmer must make a commitment one way or another.

This is where Glauber dynamics enters the story. Glauber presented a version of the Ising model that’s somewhat more explicit about how spins interact with one another and with the “heat bath” that represents the influence of temperature. It’s a theory of Ising dynamics because he describes the spin system not just at equilibrium but also during transitional stages. I don’t know if Glauber was the first to offer an account of Ising dynamics, but the notion was certainly not commonplace in 1963.

There’s no evidence Glauber was thinking of his method as an algorithm suitable for computer implementation. The subject of simulation doesn’t come up in his 1963 paper, where his primary aim is to find analytic expressions for the distribution of up and down spins as a function of time. (He did this only for the one-dimensional model.) Nevertheless, Glauber dynamics offers an elegant approach to programming an interactive version of the Ising model. Assume we have a lattice of $$N$$ spins. Each spin $$\sigma$$ is indexed by its coordinates $$x, y$$ and takes on one of the two values $$+1$$ and $$-1$$. Thus flipping a spin is a matter of multiplying $$\sigma$$ by $$-1$$. The algorithm for a updating the lattice looks like this:

Repeat $$N$$ times:

1. Choose a spin $$\sigma_{x, y}$$ at random.
2. Sum the values of the four neighboring spins, $$S = \sigma_{x+1, y} + \sigma_{x-1, y} + \sigma_{x, y+1} + \sigma_{x, y-1}$$. The possible values of $$S$$ are $$\{-4, -2, 0, +2, +4\}$$.
3. Calculate $$\Delta E = 2 \, \sigma_{x, y} \, S$$, the change in interaction energy if $$\sigma_{x, y}$$ were to flip.
4. If $$\Delta E \lt 0$$, set $$\sigma_{x, y} = -\sigma_{x, y}$$.
5. Otherwise, set $$\sigma_{x, y} = -\sigma_{x, y}$$ with probability $$\exp(-\Delta E/T)$$, where $$T$$ is the temperature.

Display the updated lattice.

Step 4 says: If flipping a spin will reduce the overall energy of the system, flip it. Step 5 says: Even if flipping a spin raises the energy, go ahead and flip it in a randomly selected fraction of the cases. The probability of such spin flips is the Boltzmann factor $$\exp(-\Delta E/T)$$. This quantity goes to $$0$$ as the temperature $$T$$ falls to $$0$$, so that energetically unfavorable flips are unlikely in a cold lattice. The probability approaches $$1$$ as $$T$$ goes to infinity, which is why the model is such a seething mass of fluctuations at high temperature.

(If you’d like to take a look at real code rather than pseudocode—namely the JavaScript program running the simulation above—it’s on GitHub.)

Glauber dynamics belongs to a family of methods called Markov chain Monte Carlo algorithms (MCMC). The idea of Markov chains was an innovation in probability theory in the early years of the 20th century, extending classical probability to situations where the the next event depends on the current state of the system. Monte Carlo algorithms emerged at post-war Los Alamos, not long after Glauber left there to resume his undergraduate curriculum. He clearly kept up with the work of Stanislaw Ulam and other former colleagues in the Manhattan Project.

Within the MCMC family, the distinctive feature of Glauber dynamics is choosing spins at random. The obvious alternative is to march methodically through the lattice by columns and rows, examining every spin in turn. That procedure can certainly be made to work, but it requires care in implementation. At low temperature the Ising process is very nearly deterministic, since unfavorable flips are extremely rare. When you combine a deterministic flip rule with a deterministic path through the lattice, it’s easy to get trapped in recurrent patterns. For example, a subtle bug yields the same configuration of spins on every step, shifted left by a single lattice site, so that the pattern seems to slide across the screen. Another spectacular failure gives rise to a blinking checkerboard, where every spin is surrounded by four opposite spins and flips on every time step. Avoiding these errors requires much fussy attention to algorithmic details. (My personal experience is that the first attempt is never right.)

Choosing spins by throwing random darts at the lattice turns out to be less susceptible to clumsy mistakes. Yet, at first glance, the random procedure seems to have hazards of its own. In particular, choosing 10,000 spins at random from a lattice of 10,000 sites does not guarantee that every site will be visited once. On the contrary, a few sites will be sampled six or seven times, and you can expect that 3,679 sites (that’s $$1/e \times 10{,}000)$$ will not be visited at all. Doesn’t that bias distort the outcome of the simulation? No, it doesn’t. After many iterations, all the sites will get equal attention.

The nasty bit in all Ising simulation algorithms is updating pairs of adjacent sites, where each spin is the neighbor of the other. Which one goes first, or do you try to handle them simultaneously? The column-and-row ordering maximizes exposure to this problem: Every spin is a member of such a pair. Other sequential algorithms—for example, visiting all the black squares of a checkerboard followed by all the white squares—avoid these confrontations altogether, never considering two adjacent spins in succession. Glauber dynamics is the Goldilocks solution. Pairs of adjacent spins do turn up as successive elements in the random sequence, but they are rare events. Decisions about how to handle them have no discernible influence on the outcome.

Years ago, I had several opportunities to meet Roy Glauber. Regrettably, I failed to take advantage of them. Glauber’s office at Harvard was in the Lyman Laboratory of Physics, a small isthmus building connecting two larger halls. In the 1970s I was a frequent visitor there, pestering people to write articles for Scientific American. It was fertile territory; for a few years, the magazine found more authors per square meter in Lyman Lab than anywhere else in the world. But I never knocked on Glauber’s door. Perhaps it’s just as well. I was not yet equipped to appreciate what he had to say.

Now I can let him have the last word. This is from the introduction to the paper that introduced Glauber dynamics:

If the mathematical problems of equilibrium statistical mechanics are great, they are at least relatively well-defined. The situation is quite otherwise in dealing with systems which undergo large-scale changes with time. The principles of nonequilibrium statistical mechanics remain in largest measure unformulated. While this lack persists, it may be useful to have in hand whatever precise statements can be made about the time-dependent hehavior of statistical systems, however simple they may be.

Posted in computing, physics | 1 Comment

## Empty nest season

I am past the stage of life when the kids go off to school in the fall, but nonetheless the house is feeling a bit desolate at this time of year. My companions of the summer have gone to earth or flown away.

Last spring a pair of robins built two-and-a-half nests on a sheltered beam just outside my office door. They raised two chicks that fledged by the end of June, and then two more in August. Both clutches of eggs were incubated in the same nest (middle photo below), which was pretty grimy by the end of the season. A second nest (upper photo) served as a hangout for the nonbrooding parent. I came to think of it as the man-cave, although I’m not at all sure about the sex of those birds. As for the half nest, I don’t know why that project was abandoned, or why it was started in the first place.

Elsewhere, a light fixture in the carport has served as a nesting platform for a phoebe each summer we’ve lived here. Is it the same bird every year? I like to think so, but if I can’t even identify a bird’s sex I have little hope of recognizing individuals. This year, after the tenant decamped, I discovered an egg that failed to hatch.

We also had house wrens in residence—noisy neighbors, constantly partying or quarreling, I can never tell the difference. It was like living next to a frat house. I have no photo of their dwelling: It fell apart in my hands.

Under the eaves above our front door we hosted several small colonies of paper wasps. All summer I watched the slow growth of these structures with their appealing symmetries and their equally interesting imperfections. (Skilled labor shortage? Experiments in noneuclidean geometry?) I waited until after the first frost to cut down the nests, thinking they were abandoned, but I discovered a dozen moribund wasps still huddling behind the largest apartment block. They were probably fertile females looking for a place to overwinter. If they survive, they’ll likely come back to the same spot next year—or so I’ve learned from Howard E. Evans, my go-to source of wasp wisdom.

Another mysterious dwelling unit clung to the side of a rafter in the carport. It was a smooth, fist-size hunk of mud with no visible entrances or exits. When I cracked it open, I found several hollow chambers, some empty, some occupied by decomposing larvae or prey. Last year in the same place we had a few delicate tubes built by mud-dauber wasps, but this one is an industrial-strength creation I can’t identify. Any ideas?

The friends I’ll miss most are not builders but squatters. All summer we have shared our back deck with a population of minifrogs—often six or eight at a time—who took up residence in tunnel-like spaces under flowerpots. In nice weather they would join us for lunch alfresco.

As of today two frogs are still hanging on, and I worry they will freeze in place. I should move the flowerpots, I think, but it seems so inhospitable.

May everyone return next year.