With our ongoing commitment in supporting “Big Data”, we at Pentaho have beefed up our Data Integration tool by adding a bulk loading step to support Greenplum, the massively parallel processing database from EMC. The new “Greenplum Load” step provides a graphical user interface to Greenplum's “gpload” utility. The “gpload” utility provides an interface to the database's external table parallel loading feature.
I am going to show how to use gpload from the command line and then from Pentaho Data Integration. Running gpload from the command line is straightforward. Here I've placed the “customers-100.txt” sample file we distribute with PDI into a folder on our Greenplum server:
[gpadmin@greenplum test]$ ls -l
-rw-r--r-- 1 gpadmin gpadmin 556 Aug 3 20:31 control.cfg
-rw-r--r-- 1 gpadmin gpadmin 8754 Aug 3 20:24 customers-100.txt
Along with the data file is a control file. The contents:
- FORMAT: TEXT
- DELIMITER: ';'
- QUOTE: ''
- HEADER: FALSE
- ERROR_LIMIT: 50
- ERROR_TABLE: customer_100_load_errors
- TABLE: customers_100
- MODE: insert
A configuration file follows the YAML standard. From Wikipedia, “YAML is a human-readable data serialization format.” Wikipedia can describe it a bit more. The Greenplum Admin guide has excellent examples.
To load the customers_100 table I simply execute gpload:
[gpadmin@greenplum test]$ gpload -f control.cfg
2011-08-03 20:59:10|INFO|gpload session started 2011-08-03 20:59:10
2011-08-03 20:59:10|INFO|started gpfdist -p 8000 -P 9000 -f "customers-100.txt" -t 30
2011-08-03 20:59:10|WARN|1 bad row
2011-08-03 20:59:10|INFO|running time: 0.20 seconds
2011-08-03 20:59:10|INFO|rows Inserted = 100
2011-08-03 20:59:10|INFO|rows Updated = 0
2011-08-03 20:59:10|INFO|data formatting errors = 0
2011-08-03 20:59:10|INFO|gpload succeeded with warnings
We did have one error: our data file has a header row and I did not remove it. That's ok, all the other rows made it into the table.
So, to get all that done I created the table in the database using DB Visualizer, created the YAML file using gedit and then ran gpload a few times to iron out some data mapping issues. I shared the final successful run with you.
No, it didn't really go that way, but that is a realistic scenario for database and ETL developers when using bulk loading from the command line. I'll fill you in on how I set up the command line sample a little later on.
Now I'll bulkload with Pentaho Data Integration and the “Greenplum load” step. I'll fire up Spoon, the graphical user interface:
Displayed is a two step transformation I created. The “Read Customers 100” step will read the customers-100.txt file and parse the data into fields. The data that read from the files is passed to the “Greenplum Load” step that will:
- Create a temporary data file from the input stream
- Write the YAML control file based on the fields in the input stream and the step's configuration
- Execute gpload in the same manner as I showed in the command line example
Here is the new “Greenplum Load” step dialog with its “Field” pane enabled:
The list of fields was populated by clicking “Get fields”. There's no need to enter this manually as the step can read it from the input stream provided by the “Read Customers 100” step.
To create the “customers_100” table (if it does not exist) just click the “SQL” button to display the DDL in a dialog. Clicking “Execute” will create the table.
On to the “GP Configuration” tab:
Here I specify where the step can find the gpload program, where to write the control and data files, and what to place in the control file.
And now back to the entire transformation for an explanation on how it is executed.
We have a CentOS server running Greenplum 188.8.131.52. It is on this server that gpload is found. Being a “backend database” server we do not run the client tool, Spoon, from that server. We do have a distribution of PDI on that server so we can run a carte server. This server will run the transformation.
I specified “gp8084” as the carte server to use and clicked launch to execute the transformation.
The result of the execution:
We do not have gpload throwing out the header row as we did in the command line example. The “Reed Customers 100” step was configured to ignore the first row of the data file.
I admit it: I did not use gedit to create the control file for the command line example. I configured the step to be “Manual load (only creation of files)”. This generated YAML. While I was there I generated the table using the SQL button.
Thanks for reading! I intend to be back with more on the parallel load features of gpload and how clustering in Pentaho Data Integration can aid in parallel loading.