Platinum Partner
architects,bigdata,tool,linux,tips and tricks,hadoop,tools & methods

Lexicographically Sorting Large Files in Linux

When I hear the word “sort” my first thought is usually “Hadoop”! Yes, sorting is one thing that Hadoop does well, but if you’re working with large files in Linux the built-in sort command is often all you need.

Let’s say you have a large file on a host with 2GB or more of main memory free. The following sortcommand is a efficient way to lexicographically-order large files.

LC_COLLATE=C sort --buffer-size=1G --temporary-directory=./tmp --unique bigfile.txt

Let’s break this command down and examine each part in detail.

sort image

Published at DZone with permission of {{ articles[0].authors[0].realName }}, DZone MVB. (source)

Opinions expressed by DZone contributors are their own.

{{ tag }}, {{tag}},

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}
{{ parent.authors[0].realName || parent.author}}

{{ parent.authors[0].tagline || parent.tagline }}

{{ parent.views }} ViewsClicks
Tweet

{{parent.nComments}}