Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Importing Data Into Splice Machine

DZone's Guide to

Importing Data Into Splice Machine

Dive into importing and upserting data into Splice Machine. This guide covers the process as well as handling challenges like timestamps and special characters.

· Database Zone
Free Resource

Learn how to move from MongoDB to Couchbase Server for consistent high performance in distributed environments at any scale.

This topic describes how to import data into your Splice Machine database using built-in procedures for importing and upserting data.

About Bulk Loading of Data Into Splice Machine

There are several ways of inserting data into your Splice Machine database. You can:

  • Use an ETL tool that can connect with Splice Machine via ODBC or JDBC
  • Insert data into existing tables using SQL statements such as Insert

Those methods work, but they do not leverage the parallel processing power of your cluster. Many of our customers need to import large amounts of data using a bulk load operation. For example, when moving data from an existing database into Splice Machine after initial installation, or for situations like when a financial trading firm wants to update the data for stock trading transactions in the past 24 hours.

The Splice Machine SYSCS_UTIL.IMPORT_DATA and SYSCS_UTIL.UPSERT_DATA_FROM_FILE built-in system procedures allow you to perform bulk imports that utilize the parallel processing power of Hadoop’s scale-out architecture while distributing the data evenly into different region servers to prevent “hot spots.” We utilize a custom-built, asynchronous write pipeline, which is an extension to the native HBase pipeline, to accomplish this effort.

Our bulk import process provides a number of significant advantages in comparison to traditional Hadoop ETL processes:

  • Includes error and consistency checking of incoming data allows for rejection of incorrect data sets; for example, primary key violations and invalid data such as a wrong data type.
  • You can monitor progress using the Splice Machine Management Console.
  • Provides a consistent view of data even when numerous concurrent updates are occurring.
  • Our write processes do not block read processes.
  • Incremental updates and upserts are allowed.

Importing a Subset of Data From a File

When you import data from a file into a table, all of the data in the file is not necessarily imported. This can happen in either of these circumstances:

  • If the table into which you’re importing contains fewer columns than does the data file, the “extra” columns of data at the end of each line are ignored. For example, if your table contains columns (a, b, c) and your file contains columns (a, b, c, d, e), then the data in your file’s d and e columns will be ignored. See Example 1: Importing data into a table with fewer columns than in the file at the end of this topic.
  • If the insertColumnList in your import call specifies only a subset of the columns in the data file, then only those columns are imported; however, intervening columns in your table are then overwritten with default values. For example, if your table contains columns (a, b, c) and you only want to import columns (a, c), then the data in table’s b column will be replaced with the default value for that column. See Example 2: Importing a subset of data from a file into a table at the end of this topic

Note that this means that you cannot import multiple subsets into different tables from the same input file.

Using Our Import and Upsert Procedures

The SYSCS_UTIL.IMPORT_DATA and SYSCS_UTIL.UPSERT_DATA_FROM_FILE built-in system procedures are nearly identical in syntax and function. The difference is the SYSCS_UTIL.UPSERT_DATA_FROM_FILE procedure first determines if the database already contains a record that matches an incoming record:

  • If a matching record is found in the database, that record is updated with column values from the incoming record.
  • If no matching record is found in the database, the incoming record is added to the database as a new record, exactly as it would be if had you called SYSCS_UTIL.IMPORT_DATA.

The SYSCS_UTIL.IMPORT_DATA and SYSCS_UTIL.UPSERT_DATA_FROM_FILE built-in system procedures both import data to a subset of columns in a table. You choose the subset of columns by specifying insert columns.

Note:  The SYSCS_UTIL.SYSCS_IMPORT_DATA system procedure, which operates similarly to SYSCS_UTIL.IMPORT_DATAand was available in earlier versions of Splice Machine, is targeted for eventual deprecation and is no longer documented.

After a successful import completes, a simple report shows how many files were imported or checked, and how many record imports or checks succeeded or failed. (Note that you can ignore the numtasks value shown in this report).

IMPORTANT: On a cluster, the files to be imported MUST be on HDFS (or MapR-FS), and must be readable by the HBase user.

Syntax

call SYSCS_UTIL.IMPORT_DATA (
schemaName,
tableName,
insertColumnList | null,
fileOrDirectoryName,
columnDelimiter | null,
characterDelimiter | null,
timestampFormat | null,
dateFormat | null,
timeFormat | null,
badRecordsAllowed,
badRecordDirectory | null,
oneLineRecords | null,
charset | null
);

 

call SYSCS_UTIL.UPSERT_DATA_FROM_FILE (
schemaName,
tableName,
insertColumnList | null,
fileOrDirectoryName,
columnDelimiter | null,
characterDelimiter | null,
timestampFormat | null,
dateFormat | null,
timeFormat | null,
badRecordsAllowed,
badRecordDirectory | null,
oneLineRecords | null,
charset | null
);


  • schemaName

  • The name of the schema of the table in which to import.

  • tableName

  • The name of the table in which to import.

  • insertColumnList

The names, in single quotes, of the columns to import. If this is null, all columns are imported.

If you don’t specify a specify an insertColumnList and your input file contains more columns than are in the table, then the extra columns at the end of each line in the input file are ignored. For example, if your table contains columns (a, b, c) and your file contains columns (a, b, c, d, e), then the data in your file’s d and e columns will be ignored.

If you do specify an insertColumnList, and the number of columns doesn’t match your table, then any other columns in your table will be replaced by the default value for the table column (or NULL if there is no default for the column). For example, if your table contains columns (a, b, c) and you only want to import columns (a, c), then the data in table’s b column will be replaced with the default value for that column.

fileOrDirectoryName

Either a single file or a directory. If this is a single file, that file is imported; if this is a directory, all of the files in that directory are imported. Note that files can be compressed or uncompressed.

columnDelimiter

The character used to separate columns, Specify null if using the comma (,) character as your delimiter.

In addition to using single characters, you can specify the following special characters as delimiters:

Special character Display
\t Tab
\f Formfeed
\b Backspace
\\ Backslash

Control-a, ^a (or ^A)


If you are using a script file from the splice> command line, your script can contain the actual Control-a character as the value of this parameter.

characterDelimiter

Specifies which character is used to delimit strings in the imported data. You can specify null or the empty string to use the default string delimiter, which is the double-quote (“).

In addition to using single characters, you can specify the following special characters as delimiters:

Special character Display
\t Tab
\f Formfeed
\b Backspace
\\ Backslash

Control-a, ^a (or ^A)


If you are using a script file from the splice> command line, your script can contain the actual Control-a character as the value of this parameter.

If your input contains control characters such as newline characters, make sure that those characters are embedded within delimited strings.

To use the single quote (‘) character as your string delimiter, you need to escape that character. This means that you specify four quotes (””) as the value of this parameter. This is standard SQL syntax.

The Examples section below contains an example that uses the single quote as the string delimiter character.

timestampFormat

The format of timestamps stored in the file. You can set this to null if there are no timestamps in the file, or if the format of any timestamps in the file match the Java.sql.Timestamp default format, which is: "yyyy-MM-dd HH:mm:sd". See the About Timestamp Formats section below for more information about timestamps.

All of the timestamps in the file you are importing must use the same format.

dateFormat

The format of datestamps stored in the file. You can set this to null if there are no date columns in the file, or if the format of any dates in the file match pattern: "yyyy-MM-dd".

timeFormat

The format of timeFormats stored in the file. You can set this to null if there are no time columns in the file, or if the format of any times in the file match pattern: "HH:mm:ss".

badRecordsAllowed

The number of rejected (bad) records that are tolerated before the import fails. If this count of rejected records is reached, the import fails, and any successful record imports are rolled back.

  • If you specify -1 as the value of this parameter, all record import failures are tolerated and logged.
  • If you specify 0 as the value of this parameter, the import will fail if even one record is bad.

In previous releases of Splice Machine, this parameter was named failBadRecordCount, and a value of 0 meant that all record import failures were tolerated. This has been changed, and you now must specify a value of -1 for badRecordsAllowed to tolerate all bad records.

badRecordDirectory

The directory in which bad record information is logged. Splice Machine logs this information to the <import_file_name>.bad file in this directory; for example, bad records in an input file named foo.csv would be logged to a file named badRecordDirectory/foo.csv.bad.

The default value is the directory in which the import files are found.

oneLineRecords

A Boolean value that specifies whether each line in the import file contains one complete record:

  • If you specify true or null, then each record is expected to be found on a single line in the file.
  • If you specify false, records can span multiple lines in the file.

charset

The character encoding of the import file. The default value is UTF-8. Currently, any other value is ignored and UTF-8 is used.

Handling Generated Column Values in Imported Files

If you’re importing data into a table with generated columns (see generated-column-spec in the SQL Reference Manual), you should know that imported records are handled in exactly the same manner as are records inserted using the INSERT  statement.

Here’s a simple summary of what happens for generated columns, including DEFAULT values, in imported records:

  • If your importColumnList includes the column name and the imported column value is empty, NULL is inserted into the database table column.
  • If your importColumnList includes the column name and the imported column value is not empty, the column value is imported unless the value is not valid in the table.
  • If you importColumnList does not include the column name, the generated value is inserted into the database table column.

Generated Column Import Examples

To illustrate what happens with generated column values in imported records, we’ll use this simple database table created with this statement:

CREATE TABLE myTable (
   colA INT,
   colB CHAR(4) DEFAULT 'myDefaultVal',
   colC INT);


insertColumnList Values in import record Values inserted into database Notes
“A,B,C” 1,,2 [1,NULL,2]  
“A,B,C” 3,de,4 [3,de,4]  
“A,B,C” 1,2 Error: column B wrong type  
“A,B,C” 1,DEFAULT,2 [1,”DEFAULT”,2] DEFAULT is imported as a literal value
Empty 1,,2 [1,NULL,2]  
Empty 3,de,4 [3,de,4]  
Empty 1,2 Error: column B wrong type  
“A,C” 1,2 [1,myDefaultVal,2]  
“A,C” 3,4

[3,myDefaultVal,4]

Note that the value DEFAULT in the imported file is not interpreted to mean that the default value should be applied to that column; instead:

  • If the target column in your database has a string data type, such as CHAR or VARCHAR, the literal value “DEFAULT” is inserted into your database..
  • If the target column is not a string data type, an error will occur.

How to Use Generated Values for a Column in Some (But Not All) Imported Records

If you are importing a file into a table with a generated column, and you want to import some records with actual values and apply generated values to other records, you need to split your import file into two files and import each:

  • Import the file containing records with non-default values with the column name included in the insertColumnList.
  • Import the file containing records with default values with the column name excluded from the insertColumnList.

When you export a table with generated columns to a file, the actual column values are exported, so importing that same file into a different database will accurately replicate the original table values.

Record Import Failure Reasons

When upserting data from a file, the input file you generate must contain:

  • the columns to be changed
  • all NON_NULL columns

Typical reasons for a row (record) import to fail include:

  • Improper data expected for a column.
  • Improper number of columns of data.
  • A primary key violation: SYSCS_UTIL.IMPORT_DATA and SYSCS_UTIL.UPSERT_DATA_FROM_FILE will only work correctly if the table into which you are inserting/updating has primary keys.

Compacting Tables After Bulk Imports

We recommend that you run a full compaction on tables into which you have imported a large amount of data, using the SYSCS_UTIL.SYSCS_PERFORM_MAJOR_COMPACTION_ON_TABLE system procedure.

About Timestamp Formats

Splice Machine uses the following Java date and time pattern letters to construct timestamps:

Pattern Letter Description Format(s)
y year yy or yyyy
M month MM
d day in month dd
h hour (0-12) hh
H hour (0-23) HH
m minute in hour mm
s seconds ss
S tenths of seconds SSS (up to 6 decimal digits: SSSSSS)
z time zone text e.g. Pacific Standard time
Z time zone, time offset

e.g. -0800

The default timestamp format for Splice Machine imports is: yyyy-MM-dd HH:mm:ss, which uses a 24-hour clock, does not allow for decimal digits of seconds, and does not allow for time zone specification.

Note: The standard Java library does not support microsecond precision, so you cannot specify millisecond (S) values in a custom timestamp format and import such values with the desired precision.

Timestamps and Importing Data at Different Locations

Note that timestamp values are relative to the geographic location at which they are imported, or more specifically, relative to the timezone setting and daylight saving time status where the data is imported.

This means that timestamp values from the same data file may appear differently after being imported in different time zones.

Examples

The following tables shows valid examples of timestamps and their corresponding format (parsing) patterns:

Timestamp value Format Pattern Notes
2013-03-23 09:45:00 yyyy-MM-dd HH:mm:ss This is the default pattern.
2013-03-23 19:45:00.98-05 yyyy-MM-dd HH:mm:ss.SSZ This pattern allows up to 2 decimal digits of seconds, and requires a time zone specification.
2013-03-23 09:45:00-07 yyyy-MM-dd HH:mm:ssZ This patterns requires a time zone specification, but does not allow for decimal digits of seconds.
2013-03-23 19:45:00.98-0530 yyyy-MM-dd HH:mm:ss.SSZ This pattern allows up to 2 decimal digits of seconds, and requires a time zone specification.

2013-03-23 19:45:00.123

2013-03-23 19:45:00.12

yyyy-MM-dd HH:mm:ss.SSS

This pattern allows up to 3 decimal digits of seconds, but does not allow a time zone specification.

Note that if your data specifies more than 3 decimal digits of seconds, an error occurs.

2013-03-23 19:45:00.1298 yyyy-MM-dd HH:mm:ss.SSSS

This pattern allows up to 4 decimal digits of seconds, but does not allow a time zone specification.

Please see Working With Date and Time Values in the Splice Machine Documentation for information working with timestamps, dates, and times.

Examples

The examples in this section illustrate using different timestamp formats and different string delimiter characters.

Note that these examples work for either importing or upserting data: you can simply substitute UPSERT_DATA_FROM_FILE in place of IMPORT_DATA in any of the system procedure calls below.

Example 1: Using the Default Timestamp Formats

Use up to four default timestamp formats on any column in the same table without the need to specify the format string explicitly in the import command.

For example, given the CSV file below which contains different timestamp formats across
different columns:

Mike,2013-04-21 09:21:24.98-05,2013-03-23 16:24:37.651-05
Mike,2013-04-21 09:15:32,2013-04-21 09:15:32.005-05
Mike,2013-03-23 09:45:00-05,2013-03-23 09:45:00+07


Create your sample table with the following statement:

create table tabx (c1 char(30),c2 timestamp, c3 timestamp,
primary key(c2));


You can then import the data with the following call

call SYSCS_UTIL.IMPORT_DATA('splice', 'tabx', 'c1, c2',
'/path/to/ts.csv',
',', '''',
null, null, null, 0, null, true, null);


In this example, we specify null for the timestamp formats. Then, when each row is imported, the format of any timestamp in the data is automatically detected, since they all conform to the four default formats.

Example 2: Importing Data Into a Table With Fewer Columns Than in the File

If the table into which you’re importing data has fewer columns than the data file that you’re importing, the “extra” data columns in the file are ignored. For example, if you create a table with this statement:

CREATE TABLE playerTeams(ID int primary key, Team VARCHAR(32));


And your data file looks like this:

1,Cards,Molina,Catcher
2,Giants,Posey,Catcher
3,Royals,Perez,Catcher


When you import the file into playerTeams, only the first two columns are imported:

call SYSCS_UTIL.IMPORT_DATA('SPLICE','playerTeams',null, 'myData.csv',
null, null, null, null, null, 0, 'importErrsDir', true, null);

SELECT * FROM playerTeams ORDER by ID;
ID   |TEAM    
--------------
1    |Cards
2    |Giants
3    |Royals

3 rows selected

Example 3: Importing a Subset of Data From a File Into a Table

This example uses the same table and import file as does the previous example, and it produces the same results, The difference between these two examples is that this one explicitly imports only the first two columns (which are named ID and TEAM) of the file:

call SYSCS_UTIL.IMPORT_DATA('SPLICE','playerTeams', null, 'myData.csv',
'ID, TEAM', null, null, null, null, 0, 'importErrsDir', true, null);

SELECT * FROM playerTeams ORDER by ID;
ID   |TEAM    
--------------
1    |Cards
2    |Giants
3    |Royals

3 rows selected

Example 4: Specifying a Timestamp Format for an Entire Table

Use a single timestamp format for the entire table by explicitly specifying a single timeStampFormat.

Mike,2013-04-21 09:21:24.98-05
Mike,2013-04-21 09:15:32.78-04
Mike,2013-03-23 09:45:00.68-05


You can then import the data with the following call:

call SYSCS_UTIL.IMPORT_DATA('app','tabx','c1,c2',
'/path/to/ts3.csv',
',', '''', 
'yyyy-MM-dd HH:mm:ss.SSZ',
null, null, 0, null, true, null);


Note that for any import use case shown above, the time shown in the imported table depends on the timezone setting where the data is imported. In other words, given the same CSV file, if imported at locations with different time zones, the value in the table shown will be different. Additionally, daylight savings time may account for a 1-hour difference if the time zone is specified.

Example 5: Importing Strings With Embedded Special Characters

This example imports a CSV file that includes newline (Ctrl-M) characters in some of the input strings. We use the default double-quote as our character delimiter to import data such as the following:

1,This field is one line,Able
2,"This field has two lines
This is the second line of the field",Baker
3,This field is also just one line,Charlie


We then use the following call to import the data:

SYSCS_UTIL.IMPORT_DATA('SPLICE', 'MYTABLE', null, 'data.csv', '\t', null, null, null, null, 0, 'importErrsDir', true, null);


We can also explicitly specify double quotes (or any other character) as our delimiter character for strings:

SYSCS_UTIL.IMPORT_DATA('SPLICE', 'MYTABLE', null, 'data.csv', '\t', '"', null, null, null, 0, 'importErrsDir', true, null);

Example 6: Using Single Quotes to Delimit Strings

This example performs the same import as the previous example, simply substituting single quotes for double quotes as the character delimiter in the input:

1,This field is one line,Able
2,'This field has two lines
This is the second line of the field',Baker
3,This field is also just one line,Charlie
  


Note that you must escape single quotes in SQL, which means that you actually define the character delimiter parameter with four single quotes, as follow

SYSCS_UTIL.IMPORT_DATA('SPLICE', 'MYTABLE', null, 'data.csv', '\t', '''', null, null, null, 0, 'importErrsDir', true, null);
  


Want to deliver a whole new level of customer experience? Learn how to make your move from MongoDB to Couchbase Server.

Topics:
splice machine ,database ,tutorial ,data import ,database cluster

Published at DZone with permission of Erin Driggers, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}