If I had a hammer

Since I’m a bit-player rather than a bit-worker, I generally stick to toy-sized problems. However, in recent weeks I’ve been fooling around with multimegabyte elevation maps, and I’ve had trouble scaling up. What I’ve found most challenging is not writing programs to digest lots of megabytes; instead, it’s the trivial-seeming data-preparation tasks that gave me fits.

At one point I had five plain text files of about 30 megabytes each that I needed to edit in minor ways (e.g., removing headers) and then concatenate. What is the right tool for that chore? I tried opening the files in various text editors, but the programs weren’t up to the job. Some quit when I attempted to load a big file. Others opened the file but then quit when I tried to scroll through the text. Some programs didn’t quit, but they became so lethargic and unresponsive that eventually I quit.

Note that I’m talking about big files but not huge ones. At most they run to a few hundred megabytes, a volume that ought to fit in memory on a machine with a few gigabytes of RAM.

Surely this is a common problem. Is there some obvious answer that everyone but I has always known?

Eventually I did manage to finish what needed doing. I discovered that a very modest Macintosh hex editor called 0xED—meant for editing binary files more than text files—would do the trick instantly. 0xED opened the largest files, let me scroll through and make changes, and saved the new version back to disk—all without fuss. But I still have the feeling that I’m pounding nails with a monkey wrench, and I’d like to know if there’s a hammer designed just for this task.

This entry was posted in computing.

15 Responses to If I had a hammer

  1. Zac says:

    I don’t know if I’ve ever loaded a file *quite* that big, but I would expect that vim or vi would work fine … never had a problem with vim on big files so far.

  2. MCH says:

    I’ve always used hex editors to handle big/huge files (as they typically only read the portion of the file you’re working on and not the whole thing) but that works best if you’re willing to make minor changes. I think among text editors, Emacs is by far the best choice if one is concerned about efficiency.

  3. Tophe says:

    For things like removing headers from many large files, I prefer to just use the unix commands tail and cat. They perform not much worse than a file copy, which is essentially what you are doing.

    For what you describe, a command similar to this would work:

    tail --lines=+5 *.dat > output.result

    It just copies all dat files in the current directory into the output.result file, leaving out the first 4 lines of each file. (+5 means to start outputing the fifth line)

  4. Tophe says:

    Oops, I missed the -q on the command line, the one above will prepend the original filename to each section in the output.

  5. Derek R says:

    I agree with Tophe. In general, you can use the unix text utilities to process gigantic files. They’re mostly line oriented, so they don’t load the entire file into memory. Some useful apps are grep, cut, tail, head, sed, tr, cat, tac, etc.

  6. I. J. Kennedy says:

    I work with large text files and use Large Text File Viewer from swiftgear.com.
    Find it here: http://www.swiftgear.com/ltfviewer/features.html

  7. David F. says:

    I’ve had emacs baulk at opening files of this size (I seem to have had better results with vi, but haven’t tried any serious comparisons). I’m usually trying to pull something out from near the end, rather than editing it, so I tend to use tail to cut off a smaller section to open in emacs.

    I second the use of traditional unix text tools. There’s an awful lot of power in their simplicity.

  8. Jess says:

    Yeah, no matter how “powerful” the visual editor, there will be some file size that makes it puke (especially if it’s loading the whole thing in memory). In addition, no matter how great the macro system, there will be some repetition required.

    That’s why everyone is recommending unix’s line-editing tools. head, tail, cat, grep, cut, paste, sort, uniq, tr, etc. are all useful piped together in particular situations. But if you want a hammer, familiarize your self with sed. Every file will become a nail. (this sort of assumes a familiarity with regular expressions)

    Alternatively you can use your preferred scripting language, although that process typically ends up being “heavier” than tools and pipes on the command line. I’d suggest python, but perl and awk have also been popular.

  9. brian says:

    Thanks for all the helpful suggestions. It’s interesting that so many of the recommended solutions come from the Software Antiques Roadshow. Interesting but not entirely surprising. Programs like sed and vi come from an era when memory was a scarce resource, and so it had to be used efficiently.

    About emacs: It’s what I tried first. I was able to load large files into a buffer without much fuss, but insertion, deletion and scrolling were excruciatingly slow (many minutes per character). I think the problem may be that emacs expects a file to be broken into lines of reasonable length. The files I received had only a few thousand lines, but roughly 50,000 characters per line.

  10. Jess says:

    Oooh… 50k characters per line might disqualify the line-oriented command line tools. Is this truly binary data, or just character data that doesn’t care about line breaks? If the latter, you can probably just insert a bunch of line breaks and use that.

    If it’s really binary data and line breaks are an artifact of interpreting it as character data (adding breaks would change the data in that case), then you might want to memory-map the files. See mmap() on POSIX systems, or in python. I think java has something similar as well.

  11. Carl Witty says:

    Hmm… “30 megabytes”, “50K characters per line”, and “a few thousand lines” are not mutually consistent.

    I just did an experiment with Emacs on a file of the described size (600 lines of 50K characters per line == 30 megabytes, assuming that “a few thousand lines” was the incorrect part), and it seemed fine to me. Typing occurs at normal speed; scrolling is occasionally a little slow, but never took more than a couple of seconds.

    This is the Debian emacs package 22.2+2-5, running on x86 Linux.

    My file consisted mostly of lots of ‘a’ characters; maybe the contents of your file matter? Or maybe you ended up in some major mode that was trying to do syntax highlighting, or some other “clever” functionality? My file was in Fundamental mode.

    In fact, editing that big file in emacs was faster and more responsive than editing this comment. Part of the problem here seems to be that it’s doing a HTTP request to http://www.gravatar.com after every keystroke; is that intentional?

  12. brian says:

    @Carl Witty: Sorry for the confusion. Yes, I started out with ~30MB files, but there were five of them to be concatenated. This is image data in ASCII format. Each pixel consists of one to five decimal digits, possibly with a minus sign, padded with spaces to fill a field of seven characters. 7200 pixels per row, 3,000 rows in the assembled file.

    I was trying this with Aquamacs, an OS X port of Emacs.

    Version info: “GNU Emacs 22.3.2 (i386-apple-darwin9.5.0, Carbon Version 1.6.0) of 2009-01-11 on plume.sr.unh.edu – Aquamacs Distribution 1.6″

    I tried several times, with results that weren’t entirely consistent. In general, I could do anything I wanted near the beginning of the file, but getting to the end of the buffer (either by scrolling or by M->) was very slow. Mode was “Text” (also tried “Text Wrap”, but that was worse).

    As for http://www.gravatar.com, thanks for alerting me. I had no idea. I’ll see if I can figure out what’s up, and fix it.

  13. Dave In Tucson says:

    When you get to documents that large, the editing tool you want is called #!/usr/bin/perl.

    D∈T

  14. Matt says:

    Hi,
    I’m not sure what you’re trying to do exactly, but if it involved doing it lots of times the first tool I’d turn to is Python.