Not Up to Norm

by Brian Hayes

Published 20 June 2009

I have a new column out in American Scientist, on “compressive sensing”:

When you take a photograph with a digital camera, the sensor behind the lens has just a few milliseconds to gather in a huge array of data. A 10-megapixel camera captures some 30 megabytes—one byte each for the red, green and blue channels in each of the 10 million pixels. Yet the image you download from the camera is often only about 3 megabytes. A compression algorithm based on the JPEG standard squeezes the file down to a tenth of its original size. This saving of storage space is welcome, but it provokes a question: Why go to the trouble of capturing 30 megabytes of data if you’re going to throw away 90 percent of it before anyone even sees the picture? Why not design the sensor to select and retain just the 3 megabytes that are worth keeping?

The answer to this question leads to some ingenious mathematics and technology—but I’ll leave all that for the column itself. Here, regrettably, I need to try to patch up a few soft spots in my telling of the story. To make the best of the situation, perhaps I can find something interesting to say about my mistakes.

Here’s the first trouble spot. I wrote:

If we have N equations in N unknowns (and if the equations are all distinct) the system is certain to have a unique solution.

The equations in question are linear, and they have coefficients restricted to the set {0,1}. In my first draft, the parenthesis read “(and if the equations are all linearly independent),” but I wanted to avoid that bit of jargon, and I persuaded myself that because of the restriction on the values of the coefficients, distinct equations would necessarily be linearly independent. That’s wrong. Take the system:

\[ \begin{split}
0x + 0y + 1z& = 2 \\
1x + 1y + 0z& = 3 \\
1x + 1y + 1z& = 5 \\
\end{split}\]

We have three distinct equations in three unknowns with all coefficients in {0,1}, but the system is not linearly independent because the third equation is the sum of the first two. And the system has infinitely many solutions (namely all those that satisfy \(x + y = 3).\)

So I’m left wondering: Is there some way I could have explained the point correctly without a long digression on linear independence, the rank of a matrix and other niceties of linear algebra? Peter Renz, who pointed out my error in the first place, suggested this phrasing: “What is needed is that each equation adds information to the whole, so that no equation can be derived from the others.” I like that way of putting it, but on the other hand it doesn’t tell you how to look at a system of equations and determine whether each one adds information. Still, maybe it’s the best we can do.

The second expository pothole reads as follows:

For each vector element \(x\), the least-squares rule calculates \(x^2\) and then sums all the results. The search for a sparsest vector can be framed in similar terms, the only change being that \(x^2\) is replaced by \(x^0\). The zeroth power of \(0\) is \(0\), but for any other value of \(x\), \(x^0\) is equal to \(1\).

A reader noted that he had been taught—and had gone on to teach others—that \(0^0\) is an “indeterminate form,” without a definite value. And indeed it’s true: \(0^0 = 1\) if we take the expession as the limit of \(x^0\) as \(x\) approaches \(0\), but \(0^0 = 0\) if we view it as the limit of \(0^x\) as \(x\) goes to \(0\) from above. Given this discontinuity, I should not have declared so glibly that \(0^0 = 0\), as if there could be no controversy about it. Instead I could have phrased it along the lines of, “In this context it’s convenient to adopt the convention that \(0^0\) is equal to \(0\).”

But, on looking at the situation a little more closely, the whole business seems really murky. Here’s a passage from Concrete Mathematics, by Graham, Knuth and Patashnik:

Some textbooks leave the quantity \(0^0\) undefined, because the functions \(x^0\) and \(0^x\) have different limiting values when \(x\) decreases to \(0\). But this is a mistake. We must define

\(x^0 = 1\) for all \(x\),

if the binomial theorem is to be valid when \(x=0, y=0\), and/or \(x=-y\). The theorem is too important to be arbitrarily restricted! By contrast, the function \(0^x\) is quite unimportant.

In spite of this authoritative pronouncement, however, workers in certain other fields adopt the opposite convention. The tangled sentences that got me into this pickle were written an attempt to explain—again without introducing some hard-core terminology—the difference between the \(L_2\) and the \(L_0\) vector norms. And it appears that the usual definition of the \(L_0\) norm just doesn’t work unless \(0^0 = 0\). But I had never bothered to think clearly about this until my reader raised the question.

Suppose we’re given the four-dimensional vector \(\mathbf{x} = (2,0,1,5)\), and we’re asked to define its length \(\|\mathbf{x}\|\). The most familiar definition of length is the Euclidean “square root of the sum of the squares” formula, which corresponds to the \(L_2\) norm:

\[\|\mathbf{x}\|_2 = (2^2 + 0^2 + 1^2 + 5^2)^\frac{1}{2} \approx 5.477.\]

The \(L_1\) norm is defined in the same way but with first powers rather than second powers:

\[\|\mathbf{x}\|_1 = (2^1 + 0^1 + 1^1 + 5^1)^1 = 8.\]

The \(L_1\) norm is really just the simple sum of the vector components. In the same way that the \(L_2\) norm measures Euclidean distance, the \(L_1\) norm implements the Manhattan, or taxicab, metric.

Following the same scheme, we can invent lots of higher norms. Here’s the length of our example vector in the \(L_5\) norm:

\[\|\mathbf{x}\|_5 = (2^5 + 0^5 + 1^5 + 5^5)^\frac{1}{5} \approx 5.011.\]

We can generalize this idea in a formula that defines the \(L_p\) norm for any positive integer \(p\):

\[\|\mathbf{x}\|_p = \left( \sum_{i=1}^n |x_i|^p \right)^\frac{1}{p}.\]

As \(p\) increases, the \(L_p\) calculation gives greater and greater weight to the larger components of the vector. By taking the limit as \(p \to \infty\), we can even define the \(L_\infty\) norm, which is essentially a max operation—only the largest element contributes. For our example vector, \(\|(2,0,1,5)\|_\infty = 5\).

What about going in the other direction, to \(p = 0\)? In a fuzzy-headed, hand-waving way, it’s clear that what we want in this case is the sum of the zeroth-powers of the vector components. In other words, we simply want to count the components, ignoring differences in their magnitudes. Unfortunately, though, we can’t just plug \(p=0\) into the formula. That would give:

\[\|\mathbf{x}\|_0 = \left( \sum_{i=1}^n |x_i|^0 \right)^\frac{1}{0},\]

which just won’t do. The most serious problem, of course, is the \(\frac{1}{0}\) exponent, but there’s also the question of what meaning to assign to \(0^0\). As I understand it, the communities of workers who care about using the \(L_0\) norm solve these problems by introducing a special-case definition such as this:

\[\|\mathbf{x}\|_0 = \sum_{i=1}^n |x_i|^0,\]

along with the further proviso that \(0^0 = 0\). Accordingly, \(\|\mathbf{x}\|_0\) is a count of the nonzero components of the vector \(\mathbf{x}\). For our running example,

\[\|\mathbf{x}\|_0 = (2^0 + 0^0 + 1^0 + 5^0) = (1 + 0 + 1 + 1) = 3.\]

I guess this works, but if we need a special case anyway, wouldn’t it be easier just to define the \(L_0\) norm directly as the count of the nonzero elements? In formal terms this could be done using some sort of delta function instead of \(x^0\). Taking this approach, the \(0^0\) issue would never enter the case—and maybe I’d have a better chance of staying out of trouble.

Responses from readers:

A comment from Igor Carron, 21 June 2009 at 2:12 am
Brian,

You really want to deal with underdetermined systems (i.e systems where you have more unknowns than equations). In this case, you have an infinity of solutions. The point of the solvers used in compressive sensing is to find the sparsest solution of all these solutions. In the L2 sense, you are looking with the one with the smallest norm, here you are looking for the one with the smaller number of coefficients.

With regards to L_0, it really is just a semi-norm, an extension of what is seen for p >=1. The definition of l_0 is specifically that the cardinal of the set (i.e. number of elements not equal to zero). By the way, L_1 is really the sum of the absolute values of the elements in the vector.

Since trying to compute the problem using l_0 is NP-Hard, i.e. takes forever, the neat trick was to find that replacing l_0 by l_1 worked.

Hope it helps,

Cheers,

Igor.
A comment from JK, 21 June 2009 at 7:18 am
The L_p norm can be defined for any positive real p > 0.

Doesn’t defining L_0(v) = lim_{p -> 0}(L_p(v)) give what we want?
A comment from Jim Ward, 23 June 2009 at 9:10 am
If you type y = x^x into Wolfram alpha, the real part has a nice continuous graph at 0.
A comment from David S. Mazel, 23 June 2009 at 9:34 pm
Kudos to you, Brian, for writing about your experiences in trying to express mathematical concepts as clearly as possible. It takes a certain amount of humility and comfort with oneself to express the thoughts you have shown here. Bravo!

That said, I have to quote, as best I can, Albert Einstein:

“Make everything as simple as possible but not simpler.”

I think you may have simplified just a little too much. :-)

Please note: The bit-player website is no longer equipped to accept and publish comments from readers, but the author is still eager to hear from you. Send comments, criticism, compliments, or corrections to brian@bit-player.org.

Tags for this article: mathematics.

Publication history

First publication: 20 June 2009

Converted to Eleventy framework: 22 April 2025