Comments on: If I had a hammer

By: Matt

Matt — Thu, 26 Feb 2009 00:41:24 +0000

Hi,
I’m not sure what you’re trying to do exactly, but if it involved doing it lots of times the first tool I’d turn to is Python.

By: Dave In Tucson

Dave In Tucson — Sat, 14 Feb 2009 02:55:03 +0000

When you get to documents that large, the editing tool you want is called #!/usr/bin/perl.

D∈T

By: brian

brian — Fri, 13 Feb 2009 19:26:09 +0000

@Carl Witty: Sorry for the confusion. Yes, I started out with ~30MB files, but there were five of them to be concatenated. This is image data in ASCII format. Each pixel consists of one to five decimal digits, possibly with a minus sign, padded with spaces to fill a field of seven characters. 7200 pixels per row, 3,000 rows in the assembled file.

I was trying this with Aquamacs, an OS X port of Emacs.

Version info: “GNU Emacs 22.3.2 (i386-apple-darwin9.5.0, Carbon Version 1.6.0) of 2009-01-11 on plume.sr.unh.edu - Aquamacs Distribution 1.6″

I tried several times, with results that weren’t entirely consistent. In general, I could do anything I wanted near the beginning of the file, but getting to the end of the buffer (either by scrolling or by M->) was very slow. Mode was “Text” (also tried “Text Wrap”, but that was worse).

As for http://www.gravatar.com, thanks for alerting me. I had no idea. I’ll see if I can figure out what’s up, and fix it.

By: Carl Witty

Carl Witty — Fri, 13 Feb 2009 17:28:22 +0000

Hmm… “30 megabytes”, “50K characters per line”, and “a few thousand lines” are not mutually consistent.

I just did an experiment with Emacs on a file of the described size (600 lines of 50K characters per line == 30 megabytes, assuming that “a few thousand lines” was the incorrect part), and it seemed fine to me. Typing occurs at normal speed; scrolling is occasionally a little slow, but never took more than a couple of seconds.

This is the Debian emacs package 22.2+2-5, running on x86 Linux.

My file consisted mostly of lots of ‘a’ characters; maybe the contents of your file matter? Or maybe you ended up in some major mode that was trying to do syntax highlighting, or some other “clever” functionality? My file was in Fundamental mode.

In fact, editing that big file in emacs was faster and more responsive than editing this comment. Part of the problem here seems to be that it’s doing a HTTP request to http://www.gravatar.com after every keystroke; is that intentional?

By: Jess

Jess — Fri, 13 Feb 2009 15:49:24 +0000

Oooh… 50k characters per line might disqualify the line-oriented command line tools. Is this truly binary data, or just character data that doesn’t care about line breaks? If the latter, you can probably just insert a bunch of line breaks and use that.

If it’s really binary data and line breaks are an artifact of interpreting it as character data (adding breaks would change the data in that case), then you might want to memory-map the files. See mmap() on POSIX systems, or in python. I think java has something similar as well.

By: brian

brian — Fri, 13 Feb 2009 14:22:45 +0000

Thanks for all the helpful suggestions. It’s interesting that so many of the recommended solutions come from the Software Antiques Roadshow. Interesting but not entirely surprising. Programs like sed and vi come from an era when memory was a scarce resource, and so it had to be used efficiently.

About emacs: It’s what I tried first. I was able to load large files into a buffer without much fuss, but insertion, deletion and scrolling were excruciatingly slow (many minutes per character). I think the problem may be that emacs expects a file to be broken into lines of reasonable length. The files I received had only a few thousand lines, but roughly 50,000 characters per line.

By: Jess

Jess — Fri, 13 Feb 2009 03:08:58 +0000

Yeah, no matter how “powerful” the visual editor, there will be some file size that makes it puke (especially if it’s loading the whole thing in memory). In addition, no matter how great the macro system, there will be some repetition required.

That’s why everyone is recommending unix’s line-editing tools. head, tail, cat, grep, cut, paste, sort, uniq, tr, etc. are all useful piped together in particular situations. But if you want a hammer, familiarize your self with sed. Every file will become a nail. (this sort of assumes a familiarity with regular expressions)

Alternatively you can use your preferred scripting language, although that process typically ends up being “heavier” than tools and pipes on the command line. I’d suggest python, but perl and awk have also been popular.

By: David F.

David F. — Fri, 13 Feb 2009 00:00:10 +0000

I’ve had emacs baulk at opening files of this size (I seem to have had better results with vi, but haven’t tried any serious comparisons). I’m usually trying to pull something out from near the end, rather than editing it, so I tend to use tail to cut off a smaller section to open in emacs.

I second the use of traditional unix text tools. There’s an awful lot of power in their simplicity.

By: I. J. Kennedy

I. J. Kennedy — Thu, 12 Feb 2009 16:20:46 +0000

I work with large text files and use Large Text File Viewer from swiftgear.com.
Find it here: http://www.swiftgear.com/ltfviewer/features.html

By: Derek R

Derek R — Thu, 12 Feb 2009 15:51:04 +0000

I agree with Tophe. In general, you can use the unix text utilities to process gigantic files. They’re mostly line oriented, so they don’t load the entire file into memory. Some useful apps are grep, cut, tail, head, sed, tr, cat, tac, etc.

By: Tophe

Tophe — Thu, 12 Feb 2009 12:44:10 +0000

Oops, I missed the -q on the command line, the one above will prepend the original filename to each section in the output.

By: Tophe

Tophe — Thu, 12 Feb 2009 12:40:34 +0000

For things like removing headers from many large files, I prefer to just use the unix commands tail and cat. They perform not much worse than a file copy, which is essentially what you are doing. For what you describe, a command similar to this would work: tail --lines=+5 *.dat > output.result It just copies all dat files in the current directory into the output.result file, leaving out the first 4 lines of each file. (+5 means to start outputing the fifth line)

By: MCH

MCH — Thu, 12 Feb 2009 12:13:38 +0000

I’ve always used hex editors to handle big/huge files (as they typically only read the portion of the file you’re working on and not the whole thing) but that works best if you’re willing to make minor changes. I think among text editors, Emacs is by far the best choice if one is concerned about efficiency.

By: Zac

Zac — Thu, 12 Feb 2009 07:07:04 +0000

I don’t know if I’ve ever loaded a file *quite* that big, but I would expect that vim or vi would work fine … never had a problem with vim on big files so far.

By: John Cowan

John Cowan — Thu, 12 Feb 2009 05:58:07 +0000

Emacs.