DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Related

  • Why Is SQL Knowledge Vital for Data Scientists? A Sneak Peek
  • Basic CRUD Operations Using Hasura GraphQL With Distributed SQL on GKE
  • Java Application Troubleshooting the Sherlock Holmes Way
  • Django and React Single Page Application Development - Part 1

Trending

  • Five Free AI Tools for Programmers to 10X Their Productivity
  • An Introduction to Build Servers and Continuous Integration
  • Creating a Deep vs. Shallow Copy of an Object in Java
  • The Emergence of Cloud-Native Integration Patterns in Modern Enterprises
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Data Analysis with the Unix Shell

Data Analysis with the Unix Shell

Comsysto Gmbh user avatar by
Comsysto Gmbh
·
Apr. 26, 13 · Interview
Like (0)
Save
Tweet
Share
25.47K Views

Join the DZone community and get the full member experience.

Join For Free

currently, the hadoop based software company cloudera creates the new certification called data science essentials exam (ds-200) . one goal of the certification is to learn tools, techniques, and utilities for evaluating data from the command line. that’s why i am writing this blog post. the unix shell provides a huge set of commands that can be used for data analysis. a good introduction to unix commands can be found in this tutorial .

the data analyst friendly commands are: cat , find , grep , wc , cut , sort , uniq

this commands are called filters . data passes through a filter. moreover, a filter can modify data a bit on the way through them. all filters read data from the standard input and writes data to standard output . filter can use the standard output of another filter to be its standard input while using the pipe “|” operator. e.g. the cat command reads a file to standard output and the grep command uses this output of cat as standard input to search if the city ‘munich’ is in a city file. the example dataset is available on github .

bz@cs ~/data $ cat city | grep munich
    3070,munich [münchen],deu,bavaria,1194560

in the example above you can see the structure of the sample data set. the dataset is a comma separated list. the first number represents the id of an entry, followed by the name of a city, the countrycode, district and the last number represents the population of a city.

now, let’s answer an analytical question : what is the city with the biggest population in the data set? the second and the fifth column can be selected with the help of awk .  a wk creates a list where the population is on the first position and the city name is on the second position. the sort command can be used for sorting. therefore, it is possible to find out which city in the dataset has the biggest population.

bz@cs ~/data/ $ awk -v ofs="  " -f"," '{print $5, $2}' city | sort -n -r | head -n 1
    10500000  mumbai (bombay)

it is also possible to make joins in the unix shell with the command called join . the join command assumes that input data is sorted based on the key on which the join is going to take place. you can find another dataset on github which contains countries. this dataset is a comma separated list as well. the 14th column in the country dataset represents the capital id which is similar to the id in the city data set. this makes it possible to create a list of countries with their capitals.

bz@cs ~/data/ $ cat city | head -n 2
    1,kabul,afg,kabol,1780000
    2,qandahar,afg,qandahar,237500
bz@cs ~/data/ $ cat country | head -n 2
    afg,afghanistan,asia,southern and central asia,652090,1919,22720000,45.9,5976.00,,afganistan/afqanestan,islamic emirate,mohammad omar,1,af
    nld,netherlands,europe,western europe,41526,1581,15864000,78.3,371362.00,360478.00,nederland,constitutional monarchy,beatrix,5,nl
bz@cs ~/data/ $ join -t "," -1 1 -2 14 -o '1.2,2.2' city country | head -n 2
    kabul,afghanistan
    amsterdam,netherlands

finally, let’s get a deeper look in the city data set. the question for this example is: how is the distribution of cities in the city data set? a combination of the sort and the uniq commands allows us to create data for a density plot . this data can be streamed (>) to a file.

bz@cs ~/data/ $ cat city | cut -d , -f 3 | uniq -c | sort -r | head -n 4
    363 chn
    341 ind
    274 usa
    250 bra
bz@cs ~/data/ $ cat city | cut -d , -f 3 | uniq -c | sort -r > count_vs_country

gnuplot is a command which allows us to visualize the density data file. we have to tell gnuplot what it has to print and how it should be printed. you can use gnuplot while telnet or ssh session as well because plots can be printed in acsii-characters. therefore, the terminal type has to be set to ‘ dumb ‘


. gnu-plot

density plot for the distribution of city related to a country

i hope you enjoyed this little excurse in data analysis with the unix shell. it is useful for students which are currently working on the study guide of data science essentials (ds-200) beta. furthermore, i demonstrated how powerful the unix shell can be used for basic analytics. the unix shell is also able to do basic things like an analyst normally is executing in a statistical software as r .

Data science Unix shell Data analysis shell Command (computing) Database

Published at DZone with permission of Comsysto Gmbh, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Why Is SQL Knowledge Vital for Data Scientists? A Sneak Peek
  • Basic CRUD Operations Using Hasura GraphQL With Distributed SQL on GKE
  • Java Application Troubleshooting the Sherlock Holmes Way
  • Django and React Single Page Application Development - Part 1

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: