Munging CSV Files With Standard UNIX Tools
Munging CSV Files With Standard UNIX Tools
Why CSV? And why dusty, old tools?
Join the DZone community and get the full member experience.Join For Free
This post briefly discusses working with CSV (comma-separated value) files using command-line tools that are usually available on any Unix-like system. This will raise two objections: Why CSV, and why dusty, old tools?
You may also like: 10 Tips on Working Fast With UNIX or Linux
In theory, and occasionally in practice, CSV can be a mess. But CSV is the de-facto standard format for exchanging data. Some people like this, some lament this, but that's the way it is.
A minor variation on comma-separated values is tab-separated values .
Why Standard Utilities?
Why use standard Unix utilities? I'll point out some of their quirks, which are arguments for using something else. But the assumption here is that you don't want to use something else.
Maybe you already know the standard utilities and don't think that learning more specialized tools is worth the effort?
Maybe you're already at the command line and in a command-line state of mind, and don't want to interrupt your workflow by doing something else?
Maybe you're on a computer where you don't have the ability to install any software, and so, you need to work with what's there?
Whatever your reasons, we'll go with the assumption that we're committed to using commands that have been around for decades.
cut, sort, and awk
The tools I want to look at are
awk. I wrote about cut the other day, and apparently, the post struck a chord with some readers. This post is a follow-up to that one.
These three utilities are standard on Unix-like systems. You can also download them for Windows from GOW. The port of
sort will be named
gsort in order to not conflict with the native Windows
sort function. There's no need to rename the other two utilities since they don't have counterparts that ship with Windows.
sort command is simple and useful. There are just a few options you'll need to know about. The utility sorts the fields as text by default, but the
-n tells it to sort numerically.
Since we're talking about CSV files, you'll need to know that
-t, is the option to tell
sort that fields are separated by commas rather than white space. And to specify which field to sort on, you give it the
The last utility,
awk, is more than a utility. It's a small programming language. But it works so well from the command line that you can almost think of it as a command-line utility. It's very common to pipe output to an awk program that's only a few characters long.
You can get started quickly with
awk by reading Greg Grothous' article: Why you should learn just a little awk.
Now for the bad news: These programs are inconsistent in their options. The two most common things you'll need to do when working with CSV files is to set your field delimiter to a comma and specify what field you want to grab. Unfortunately, this is done differently in every utility.
--delimiter to specify the field delimiter and
--fields to specify fields. Makes sense.
--field-separator to specify the field delimiter and
--key to specify the field. When you're talking about sorting things, it's common to call the fields keys, and so, the way
sort specifies fields makes sense in context. I see no reason for
-t other than
-f was already taken. (In sorting, you talk about folding upper case to lower case, so
-f stands for fold.)
--field-separator to specify the field delimiter. At least the verbose option is consistent with
-F for the short option instead of
-f? The latter was already taken for file. To tell
awk to read a program from a file rather than the command line, you use the
awk handles fields differently than
sort. Because it is a programming language designed to parse delimited text files, each field has a built-in variable:
$1 holds the content of the first field,
$2 the second, etc.
The following compact table summarizes how you tell each utility that you're working with comma-separated files and that you're interested in the second field.
|------+-----+-----| | cut | -d, | -f2 | | sort | -t, | -k2 | | awk | -F, | $2 | |------+-----+-----|
Some will object that the inconsistencies documented above are a good example of why you shouldn't work with CSV files using
awk. You could use other command-line utilities designed for working with CSV files, pull your CSV file into R or Pandas, or import it somewhere to work with it in SQL, etc.
The alternatives are all appropriate for different uses. The premise here is that in some circumstances, the inconsistencies cataloged above are a regrettable but acceptable price to pay to stay at the command line.
 Things get complicated if you have a CSV file and fields contain commas inside strings. Tab-separated files are more convenient in this case, unless, of course, your strings contain tabs. The utilities mentioned here all support tab as a delimiter by default.
Published at DZone with permission of John Cook , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.