Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Spark SQL in 10 Steps

DZone's Guide to

Spark SQL in 10 Steps

Use the DataFrame API with Spark SQL to filter rows in a table, join two DataFrames to a third DataFrame, and save the new DataFrame to a Hive table.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

This example demonstrates how to use sqlContext.sql to create and load two tables and select rows from the tables into two DataFrames.

The following steps use the DataFrame API to filter the rows for salaries greater than 150,000 from one of the tables and shows the resulting DataFrame. Then, the two DataFrames are joined to create a third DataFrame. Finally, the new DataFrame is saved to a Hive table.

  1. At the command line, copy the Hue sample_07 and sample_08 CSV files to HDFS.

  2. $ hdfs dfs -put HUE_HOME/apps/beeswax/data/sample_07.csv /user/hdfs
    $ hdfs dfs -put HUE_HOME/apps/beeswax/data/sample_08.csv /user/hdfs

    ...where HUE_HOME defaults to /opt/cloudera/parcels/CDH/lib/hue (parcel installation) or to /usr/lib/hue (package installation).

  3. Start spark-shell.

  4. $ spark-shell
  5. Create Hive tables sample_07 and sample_08.

  6. scala> sqlContext.sql("CREATE TABLE sample_07 (code string,description string,total_emp
     int,salary int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TextFile")
    scala> sqlContext.sql("CREATE TABLE sample_08 (code string,description string,total_emp
     int,salary int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TextFile")
  7. In Beeline, show the Hive tables.

  8. [0: jdbc:hive2://hostname.com:> show tables;
    +------------+--+
    | tab_name |
    +------------+--+
    16 | Spark Guide
    Developing Spark Applications
    | sample_07 |
    | sample_08 |
    +------------+--+
  9. Load the data in the CSV files into the tables.

  10. scala> sqlContext.sql("LOAD DATA INPATH '/user/hdfs/sample_07.csv' OVERWRITE INTO TABLE
     sample_07")
    scala> sqlContext.sql("LOAD DATA INPATH '/user/hdfs/sample_08.csv' OVERWRITE INTO TABLE
     sample_08")
  11. Create DataFrames containing the contents of the sample_07 and sample_08 tables.

  12. scala> val df_07 = sqlContext.sql("SELECT * from sample_07")
    scala> val df_08 = sqlContext.sql("SELECT * from sample_08")
  13. Show all rows in df_07 with salary greater than 150,000.

  14. scala> df_07.filter(df_07("salary") > 150000).show()

    The output should be:

    +-------+--------------------+---------+------+
    | code| description|total_emp|salary|
    +-------+--------------------+---------+------+
    |11-1011| Chief executives| 299160|151370|
    |29-1022|Oral and maxillof...| 5040|178440|
    |29-1023| Orthodontists| 5350|185340|
    |29-1024| Prosthodontists| 380|169360|
    |29-1061| Anesthesiologists| 31030|192780|
    |29-1062|Family and genera...| 113250|153640|
    |29-1063| Internists, general| 46260|167270|
    |29-1064|Obstetricians and...| 21340|183600|
    |29-1067| Surgeons| 50260|191410|
    |29-1069|Physicians and su...| 237400|155150|
    +-------+--------------------+---------+------+
  15. Create the DataFrame df_09 by joining df_07 and df_08, retaining only the code and description columns.

  16. scala> val df_09 = df_07.join(df_08, df_07("code") ===
    df_08("code")).select(df_07.col("code"),df_07.col("description"))
    scala> df_09.show()

    The new DataFrame looks like:

    +-------+--------------------+
    | code| description|
    +-------+--------------------+
    |00-0000| All Occupations|
    |11-0000|Management occupa...|
    |11-1011| Chief executives|
    |11-1021|General and opera...|
    |11-1031| Legislators|
    |11-2011|Advertising and p...|
    |11-2021| Marketing managers|
    |11-2022| Sales managers|
    |11-2031|Public relations ...|
    |11-3011|Administrative se...|
    |11-3021|Computer and info...|
    |11-3031| Financial managers|
    |11-3041|Compensation and ...|
    |11-3042|Training and deve...|
    |11-3049|Human resources m...|
    |11-3051|Industrial produc...|
    |11-3061| Purchasing managers|
    |11-3071|Transportation, s...|
    |11-9011|Farm, ranch, and ...|
    +-------+--------------------+
  17. Save DataFrame df_09 as the Hive table sample_09.

  18. scala> df_09.write.saveAsTable("sample_09")
  19. In Beeline, show the Hive tables.

  20. [0: jdbc:hive2://hostname.com:> show tables;
    +------------+--+
    | tab_name |
    +------------+--+
    | sample_07 |
    | sample_08 |
    | sample_09 |
    +------------+--+

And you're done! You now know how to deal with tables using Spark SQL in just 10 steps.

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Topics:
spark sql ,tutorial ,dataframe ,hive ,big data

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}