Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Spark SQL in 10 Steps

DZone's Guide to

Spark SQL in 10 Steps

Use the DataFrame API with Spark SQL to filter rows in a table, join two DataFrames to a third DataFrame, and save the new DataFrame to a Hive table.

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

This example demonstrates how to use sqlContext.sql to create and load two tables and select rows from the tables into two DataFrames.

The following steps use the DataFrame API to filter the rows for salaries greater than 150,000 from one of the tables and shows the resulting DataFrame. Then, the two DataFrames are joined to create a third DataFrame. Finally, the new DataFrame is saved to a Hive table.

  1. At the command line, copy the Hue sample_07 and sample_08 CSV files to HDFS.

  2. $ hdfs dfs -put HUE_HOME/apps/beeswax/data/sample_07.csv /user/hdfs
    $ hdfs dfs -put HUE_HOME/apps/beeswax/data/sample_08.csv /user/hdfs

    ...where HUE_HOME defaults to /opt/cloudera/parcels/CDH/lib/hue (parcel installation) or to /usr/lib/hue (package installation).

  3. Start spark-shell.

  4. $ spark-shell
  5. Create Hive tables sample_07 and sample_08.

  6. scala> sqlContext.sql("CREATE TABLE sample_07 (code string,description string,total_emp
     int,salary int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TextFile")
    scala> sqlContext.sql("CREATE TABLE sample_08 (code string,description string,total_emp
     int,salary int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TextFile")
  7. In Beeline, show the Hive tables.

  8. [0: jdbc:hive2://hostname.com:> show tables;
    +------------+--+
    | tab_name |
    +------------+--+
    16 | Spark Guide
    Developing Spark Applications
    | sample_07 |
    | sample_08 |
    +------------+--+
  9. Load the data in the CSV files into the tables.

  10. scala> sqlContext.sql("LOAD DATA INPATH '/user/hdfs/sample_07.csv' OVERWRITE INTO TABLE
     sample_07")
    scala> sqlContext.sql("LOAD DATA INPATH '/user/hdfs/sample_08.csv' OVERWRITE INTO TABLE
     sample_08")
  11. Create DataFrames containing the contents of the sample_07 and sample_08 tables.

  12. scala> val df_07 = sqlContext.sql("SELECT * from sample_07")
    scala> val df_08 = sqlContext.sql("SELECT * from sample_08")
  13. Show all rows in df_07 with salary greater than 150,000.

  14. scala> df_07.filter(df_07("salary") > 150000).show()

    The output should be:

    +-------+--------------------+---------+------+
    | code| description|total_emp|salary|
    +-------+--------------------+---------+------+
    |11-1011| Chief executives| 299160|151370|
    |29-1022|Oral and maxillof...| 5040|178440|
    |29-1023| Orthodontists| 5350|185340|
    |29-1024| Prosthodontists| 380|169360|
    |29-1061| Anesthesiologists| 31030|192780|
    |29-1062|Family and genera...| 113250|153640|
    |29-1063| Internists, general| 46260|167270|
    |29-1064|Obstetricians and...| 21340|183600|
    |29-1067| Surgeons| 50260|191410|
    |29-1069|Physicians and su...| 237400|155150|
    +-------+--------------------+---------+------+
  15. Create the DataFrame df_09 by joining df_07 and df_08, retaining only the code and description columns.

  16. scala> val df_09 = df_07.join(df_08, df_07("code") ===
    df_08("code")).select(df_07.col("code"),df_07.col("description"))
    scala> df_09.show()

    The new DataFrame looks like:

    +-------+--------------------+
    | code| description|
    +-------+--------------------+
    |00-0000| All Occupations|
    |11-0000|Management occupa...|
    |11-1011| Chief executives|
    |11-1021|General and opera...|
    |11-1031| Legislators|
    |11-2011|Advertising and p...|
    |11-2021| Marketing managers|
    |11-2022| Sales managers|
    |11-2031|Public relations ...|
    |11-3011|Administrative se...|
    |11-3021|Computer and info...|
    |11-3031| Financial managers|
    |11-3041|Compensation and ...|
    |11-3042|Training and deve...|
    |11-3049|Human resources m...|
    |11-3051|Industrial produc...|
    |11-3061| Purchasing managers|
    |11-3071|Transportation, s...|
    |11-9011|Farm, ranch, and ...|
    +-------+--------------------+
  17. Save DataFrame df_09 as the Hive table sample_09.

  18. scala> df_09.write.saveAsTable("sample_09")
  19. In Beeline, show the Hive tables.

  20. [0: jdbc:hive2://hostname.com:> show tables;
    +------------+--+
    | tab_name |
    +------------+--+
    | sample_07 |
    | sample_08 |
    | sample_09 |
    +------------+--+

And you're done! You now know how to deal with tables using Spark SQL in just 10 steps.

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
spark sql ,tutorial ,dataframe ,hive ,big data

Published at DZone with permission of Srini Pesala. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}