bit-player | An amateur's outlook on computation and mathematics

four images from my personal files, all with filenames of IMG_1134

Lots of digital cameras come with a default file-naming scheme of IMG_nnnn, where nnnn is a four-digit number assigned sequentially, starting with 0001. The four images above are photos I made with four different cameras over a span of more than a decade, but they all have the same filename: IMG_1134.JPG. As you might guess, name collisions cause occasional trouble when I decide to merge folders. I could easily fix the problem at the source—all of the cameras allow for an override of the default naming scheme—but I’ve never bothered. And I’m not the only one.

The other day it occurred to me that the prevalence of IMG_nnnn filenames offers a way to “randomly” select a sample of images from public archives such as Flickr. Here’s a collage of results produced by searching Google Images with the query string “IMG_1134″:

IMG 1134 Google images collage

Below is a similar collage generated by running the same search query on Flickr:

IMG 1134 Flickr collage

And finally I offer a specimen of what Bing Images delivers for “IMG_1134″:

a screen grab from Bing Images search results for the query IMG_1134

In what sense are these selections random? The rationale is this: If we choose the 1,134th picture taken with many cameras by many photographers, I wouldn’t expect much correlation in the subject matter of those images. And indeed the samples above do seem to cover a pretty wide range.

On the other hand, I wouldn’t try to argue that the method produces a totally unbiased sample. The most serious problem is that the list of images returned by a search engine is not itself randomized. Google and Bing rank images according to the estimated prominence of the web pages they appear on. Flickr offers three sorting orders—”relevant,” “recent” and “interesting”—but does not have a “random” option.

In the samples above I dealt with this issue in various ad hoc ways. For the Google Images sample I made the selections based on computer-generated pseudorandom numbers. In the case of Flickr I took every seventh image. On Bing I scrolled down past the first few thousand thumbnail images and then copied an arbitrary rectangular region of the screen. These are feeble attempts at randomization; no doubt the selections still favor higher-ranked or more popular images.

Another source of bias is that we’re only seeing images that people chose to save and to share. The pictures you take with your thumb on the lens are likely to be deleted; unflattering portraits of your spouse will not be uploaded.

Still another bias is subtler and more interesting: Not all nnnn‘s are the same. A camera that gets used for a few weeks when it’s fresh out of the box and then gathers dust in the closet over subsequent months and years will never reach the IMG_9000s, and maybe not even the IMG_1000s. To test this hypothesis I ran a search on Google Images for each string of the form “IMG_nn00,” from “IMG_0100″ through “IMG_9900.” For each search string I recorded the number of hits reported by Google:

Image count by nnnn

Over much of the range, the number of images decays in a way that looks vaguely logarithmic—but then something weird happens at about nnnn = 6000. I suspect that the weirdness comes from a glitch in Google’s classification and counting algorithms rather than from the habits of the worldwide community of photographers. To test this idea I reran the same script on Flickr, with these results:

Flickr count by nnnn

The scale is different but the shape of the distribution is very similar—without any spiky weirdness at the high end. A logarithmic curve fits quite well.

As the number of IMG_nnnn images falls off with increasing nnnn, the nature of the images may also change. In particular, one might suppose that photographers who practice their craft enough to make several thousand images would produce more accomplished and more interesting results. In other words, by looking at the higher-numbered images, we may weed out the dilettantes.

I grew curious about IMG_0001. What kinds of pictures do people take when they first unpack a new camera, charge up the battery, and begin clicking the shutter? I expected a lot of self-conscious and self-referential images like these:

IMG_0001 samples: new camera taking its own picture

Obviously I found a few, but I had to sift through several hundred Flickr IMG_0001 thumbnails to come up with these four. (There were also a few pictures of camera boxes and packaging.) To my surprise, the vast majority of IMG_0001 photos did not look at all like test images made with a new toy. Not unless people unpack and test their new cameras inside the Hagia Sophia or on the south rim of the Grand Canyon or at a basketball game.

So what’s going on with the 0001s? Perhaps through some metadata mixup, multiple images are all getting labeled “IMG_0001,” even though that’s not the filename coming out of the camera. But if that’s the case, “IMG_0001″ ought to be an outlier, with substantially more exemplars than neighboring file names IMG_0002, IMG_0003, etc. The data indicate otherwise:

counts of images from IMG_0000 through IMG_0100 on Flickr

An alternative explanation is that photographers frequently reset the camera’s file counter to 0001. Here’s still another possibility: Maybe some of those IMG_0001 pictures are not the camera’s first image but its 10,000th. I’ve never gone all the way around with a camera, so I don’t know exactly what happens when the counter turns over. Is IMG_9999 followed by IMG_0000? (I note that a Flickr search for “IMG_0000″ returns 2,430 results. It’s the tiny leftmost lollipop in the graph above.)

• • •

Even if searching for IMG_nnnn offers only a crude level of randomization, I think it may still hold some promise of showing us what average or typical photographs look like. How many are portraits and how many are landscapes? Taken indoors or out? What fraction of all photos show children? Pets? What are the proportions of men and women?

Based on a casual perusal of a few thousand photos, I was led to make some tentative observations:

The genre of family snapshots—children’s birthday parties, vacation trips to Disney World—is not nearly as prominent in these collections as I would have expected. I guess those pictures are all on Facebook.
Food, on the other hand, is a much more popular subject than I ever would have guessed. In a sample of 100 Bing images, 21 showed comestibles of some kind. Is it really true that a fifth of all the world’s pixels are being used to show what we ate for lunch?
Vehicles are not quite as everpresent as meals, but there sure are a lot of cars, bikes, trains, trams and aircraft to be seen. It seems a lot of people want to show off their ride.

Addendum: Here’s the data behind the three graphs above.