Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Unix commands for dealing with structured text

DZone's Guide to

Unix commands for dealing with structured text

· Web Dev Zone
Free Resource

Add user login and MFA to your next project in minutes. Create a free Okta developer account, drop in one of our SDKs to your application and get back to building.

The power of the shell and of the Unix philosophy is to provide many optimized tools that do one thing well, leaving the super user with the option to tie them together in documented but also new ways.

Here is a list of commands and their configurations that I use when dealing with medium amounts of data through the command line: not something that would require a database and indexing, but enough that human intervention will be impossible (for example datasets in the order of thousands of entities).

The basics, with options

head and tail are fundamental to take a look at the output of a process without inspecting all of it:

$ cat numbers_1_1000.txt
1
2
3
...
$ cat numbers_1_1000.txt | head -n 2
1
2
$ cat numbers_1_1000.txt | tail -n 2
999
1000

You can use head and tail in isolation over a file, but I show them here receiving a pipe because that's how would you use them to check the output of a series of pipes that you have built.
wc -l gives you instead a quick overview of how many elements are you dealing with:

$ cat numbers_1_1000.txt | wc -l
1000

grep is one of the most basic tools for filtering the lines that match a particular expression out of a stream (as most of these tools it is line-oriented). However grep -o* lets you not only select lines, but also cut them vertically to keep just the expressions you were searching for:

$ cat json.txt
{"key":"value","ip":"127.0.0.1","another":"anotherValue"}
$ grep -o '"ip":"[0-9.]\+"' json.txt
"ip":"127.0.0.1"

Duplicates

sort -u lets you instead process the previous output of some commands for uniqueoccurrences:

$ cat ips.txt
127.0.0.3
127.0.0.1
127.0.0.2
127.0.0.1
$ cat ips.txt | sort -u
127.0.0.1
127.0.0.2
127.0.0.3

It firsts sorts the data via string comparison, and then removed the duplicated , contiguous lines (sorting is a necessary step and results also in optimization).

If you're interested in how any actual duplicates are in the dataset, you should instead resort to uniq -c:

$ cat ips.txt | sort | uniq -c
  2 127.0.0.1
  1 127.0.0.2
  1 127.0.0.3

Csv files

When you start representing data with a standard format such as comma separated values, you can leverage the fact that the tools can understand this format as it was a tabular data structure, but by processing one line at a time.

The cut command is useful to select only some columns of a .csv file (or anything that is organized in columns with a fixed separator):

$ cut -d ',' ips_ua.txt -f 2
Firefox
"Internet Explorer"

In this example, I am using the .txt files shown earlier in this article.

The join command instead is capable of processing multiple files, joining them row by row as SQL queries would:

$ cat ips_ua.txt
127.0.0.1,Firefox
127.0.0.2,"Internet Explorer"
$ cat ips_country.txt
127.0.0.1,IT
127.0.0.3,UK
$ join -t , -1 1 -2 1 <(sort ips_ua.txt) <(sort ips_country.txt)
127.0.0.1,Firefox,IT
$ join -t , -1 1 -2 1 <(sort ips_ua.txt) <(sort ips_country.txt) -v 2
127.0.0.3,UK
$ join -t , -1 1 -2 1 <(sort ips_ua.txt) <(sort ips_country.txt) -v 1
127.0.0.2,"Internet Explorer"

Launch your application faster with Okta’s user management API. Register today for the free forever developer edition!

Topics:

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}