Join the DZone community and get the full member experience.
Join For Free
Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now. Brought to you in partnership with Hortonworks.
When I hear the word “sort” my first thought is usually “Hadoop”! Yes, sorting is one thing that Hadoop does well, but if you’re working with large files in Linux the built-in sort command is often all you need.
Let’s say you have a large file on a host with 2GB or more of main memory free. The following sortcommand is a efficient way to lexicographically-order large files.
LC_COLLATE=C sort --buffer-size=1G --temporary-directory=./tmp --unique bigfile.txt
Let’s break this command down and examine each part in detail.
Hortonworks Sandbox is a personal, portable Apache Hadoop® environment that comes with dozens of interactive Hadoop and it's ecosystem tutorials and the most exciting developments from the latest HDP distribution, brought to you in partnership with Hortonworks.
Published at DZone with permission of
, DZone MVB
Opinions expressed by DZone contributors are their own.