Unix commands for dealing with structured text

DZone 's Guide to

Unix commands for dealing with structured text

· Web Dev Zone ·
Free Resource

The power of the shell and of the Unix philosophy is to provide many optimized tools that do one thing well, leaving the super user with the option to tie them together in documented but also new ways.

Here is a list of commands and their configurations that I use when dealing with medium amounts of data through the command line: not something that would require a database and indexing, but enough that human intervention will be impossible (for example datasets in the order of thousands of entities).

The basics, with options

head and tail are fundamental to take a look at the output of a process without inspecting all of it:

$ cat numbers_1_1000.txt
$ cat numbers_1_1000.txt | head -n 2
$ cat numbers_1_1000.txt | tail -n 2

You can use head and tail in isolation over a file, but I show them here receiving a pipe because that's how would you use them to check the output of a series of pipes that you have built.
wc -l gives you instead a quick overview of how many elements are you dealing with:

$ cat numbers_1_1000.txt | wc -l

grep is one of the most basic tools for filtering the lines that match a particular expression out of a stream (as most of these tools it is line-oriented). However grep -o* lets you not only select lines, but also cut them vertically to keep just the expressions you were searching for:

$ cat json.txt
$ grep -o '"ip":"[0-9.]\+"' json.txt


sort -u lets you instead process the previous output of some commands for uniqueoccurrences:

$ cat ips.txt
$ cat ips.txt | sort -u

It firsts sorts the data via string comparison, and then removed the duplicated , contiguous lines (sorting is a necessary step and results also in optimization).

If you're interested in how any actual duplicates are in the dataset, you should instead resort to uniq -c:

$ cat ips.txt | sort | uniq -c

Csv files

When you start representing data with a standard format such as comma separated values, you can leverage the fact that the tools can understand this format as it was a tabular data structure, but by processing one line at a time.

The cut command is useful to select only some columns of a .csv file (or anything that is organized in columns with a fixed separator):

$ cut -d ',' ips_ua.txt -f 2
"Internet Explorer"

In this example, I am using the .txt files shown earlier in this article.

The join command instead is capable of processing multiple files, joining them row by row as SQL queries would:

$ cat ips_ua.txt,Firefox,"Internet Explorer"
$ cat ips_country.txt,IT,UK
$ join -t , -1 1 -2 1 <(sort ips_ua.txt) <(sort ips_country.txt),Firefox,IT
$ join -t , -1 1 -2 1 <(sort ips_ua.txt) <(sort ips_country.txt) -v 2,UK
$ join -t , -1 1 -2 1 <(sort ips_ua.txt) <(sort ips_country.txt) -v 1,"Internet Explorer"

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}