Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Migrate From Any Relational Engine to Couchbase Using N1QL and SQSL

DZone's Guide to

Migrate From Any Relational Engine to Couchbase Using N1QL and SQSL

Using Couchbase's N1QL and a the client-side scripting language SQSL, you can migrate from just about any relational database under the sun to Couchbase.

· Database Zone
Free Resource

Learn how to move from MongoDB to Couchbase Server for consistent high performance in distributed environments at any scale.

Yes, you have read right: I have not misspelled SQL, there is indeed an extra middle "S"!

A quarter of a century ago, when I was in charge of designing and developing the ERP system at my first employer, people would often come to me and ask to implement a simple report which would consist at most of a couple of SELECT statements, minimal glue code, and some formatting.

Simple task, you would think, except that the publishing cycle for the application, considering the powerful technological means at our disposal (our development machine at the time was an NCR Tower32 200 with a 5 ¼" 100MB hard disk, 4MB of RAM and AT&T SYSV R2 shipped on a DC600 cartridge1), would be at the very least 24 hours, if not 48.

Wouldn't it be nice, I said to myself (yes, being a geek can be a very lonely affair!), if I could just store the SQL and formatting somewhere in the database, and have a scripting language runner do the rest of the work for me, or, rather, my users?

A few late nights later a prototype implementation had been deployed, with minimal control syntax, an expansion facility, and formatting capabilities. The Structured Query Scripting Language was born.

People have a nasty habit of changing employers, and I was no different, so move forward a few years — I had switched country and industry.

SQSL, still without a name and part of a larger application framework, had very much fallen to the bottom of my to-do list.

Still, all was not lost: When my new employer was acquired by a larger corporation, in a spontaneous fight for survival, the language gained some new major features like database engine vendor independence (each database engine would get its own driver in the form of a shared library), concurrent engine access, data redirection, and User Defined Functions.

A few years later still, the language was finally separated from the application framework it had begun life conjoined to, found a windows port (in fact, two) and finally gained a name, albeit not a very imaginative one.

Now SQSL is an ETL scripting environment that supports numerous databases, user-defined functions, and a rich set of control structures:

Image title

But it was in late 2015 that, having joined Couchbase, it occurred to me that the language could be a solid framework for ETL operations between SQL and NoSQL: N1QL fitted nicely with SQSL's syntax and all this latter needed was a driver for Couchbase Server and a few JSON UTFs.

This article shows how to put the two to good use.

1Some of my first employer's old machines have rightfully found their place at a local history of computer science museum: https://www.flickr.com/photos/31231773@N02/albums/72157624950108347 and https://www.flickr.com/photos/31231773@N02/sets/72157624949903613/

Language Overview

SQSL is a client-side SQL-like scripting language geared to developing sophisticated scripts running in a single place, or, rephrased, avoiding the ugly mess of mixing and matching various database shells and separate scripting languages in order to accomplish otherwise straightforward tasks.

It offers a wide range of features, sometimes borrowed from other languages, among which you find

  • database independence

  • control and simple statements

  • simple control constructs for parallel operation (Perl, Bourne, and Korn shells)

  • statement redirection

  • aggregates, data-driven operation (awk)

  • formatting (x4GL, Informix SQL, Oracle UFI / *SQL Plus...)

  • expansion facility (Bourne and Korn shells)

The aim is simple — if you know SQL and server-side scripting languages on the one hand, and traditional scripting languages on the other, then an SQSL script should be immediately obvious in meaning to you, even if you had never before come across 'S', 'Q', 'S' and 'L', in that particular order.

By way of example, consider this rather contrived "hello world!" script which selects some data using a database shell and passes it awk to transform and format it.

dbaccess <<"EOF" 2>/dev/null | awk '($1=="hello") { $1 = $1 FS "world" } ($1 != "") { printf "%s!\n", $1 }'
database stores_demo;
output to pipe 'cat' without headings
  select 'hello' from systables where tabid=1
EOF


The end result is


 hello world! 


The script has to first pipe the data to cat to remove the projection list labels and then to awk in order to remove blank lines, conditionally alter text (append world when it sees hello), and format it (it just appends an exclamation mark).

Compare it to this SQSL script, which does everything using one binary and within a single statement:

./cli - <<"EOF" 2>/dev/null
database stores_demo;
select "hello" from systables where tabid=1
  aggregate when column(1)="hello" then let column(1) ||= " world"  
  format full "%s!"
EOF

Or if you prefer a more awk like syntax:

./cli - <<"EOF" 2>/dev/null
database stores_demo;
select "hello" from systables where tabid=1
  aggregate ($1="hello") $1 ||= " world"
  format full "%s!"
EOF

The difference is even more apparent when more than one database engine is involved in the script: say we have to take different actions depending on data coming from two sources: this


connect to "stores_demo" source ifmx;
connect to "sample" source db2cli;
select count(*) from systables connection "stores_demo" into ifx;
select count(*) from sysibm.tables connection "sample" into db2;
if (db2<ifx) then
    display db2 format full "db2 has %i tables";
else
    display ifx format full "informix has %i tables";
end if


Looks a damned sight more immediate than two database shells, the bourne shell and awk all chipping in.

I'm sidetracking.

The point of this article is not to dissect the syntax of SQSL, so I will not be providing a detailed feature discussion. If you are interested, you can find the whole syntax reference here.

Instead, we will be covering increasingly elaborate data movement examples, introducing the relevant language features as we encounter them.

For now, the only point I'll make is that SQSL is implemented as a precompiler.

If the engines you are connecting to understand anything that resembles SQL, then the SQSL parser is able to extract the relevant dialects from the actual script, pass the extracted statements to the engines, and make sense of the rest:


This is precisely what we are doing with N1QL in this particular article.

Migrating Rows to Documents: A Simple Strategy

In general, terms migrating a database from a relational engine to a document store entails the following steps:

  • scan the tables system catalog. For each table:

    • determine the primary key columns

    • describe each column (gather name and type)

    • scan the table. For each row

      • assemble the document key from the primary key column values and the table name

      • assemble the output document from each column name, type and value

      • insert into the target bucket using the generated key and document

Let's Get Down to It

Now that we have an idea of the end result, we'll meander through a series of examples, each demonstrating individual steps of the process, starting with...

Migrating a Single Table

let fromconn="stores_demo";
connect to fromconn source ifmx;
connect to "couchbase://192.168.1.104:8091" source cb;
select * from customer
  connection fromconn
  insert into default (key, value)
    values($1, $2)
    using json:key("::", columns),
          json:row2doc(displaylabels, columns);


That's all there is to it: three statements (well, OK, four, counting the convenience LET).

This example moves the contents of the 'customer' table in Informix's standard stores demo database to Couchbase's default bucket using:

  • concurrent connections

  • statement redirection

  • placeholders and

  • UDFs

The first three statements are pretty straightforward: they just connect to an Informix database engine on one side and a Couchbase cluster on the other.

All the action happens in the last statement. The SQSL parser splits the statement into several parts:

  • a SELECT statement which it prepares against the connection specified in the connection clause (line 5)

  • an INSERT statement (line 6), which, lacking in this example a connection clause, is prepared against the current connection (in this case, our Couchbase cluster)

  • data transformation logic contained in the using clause (line 8)

Once the statement is parsed, the binary then proceeds to

  • retrieving rows from one connection using the implicitly prepared SELECT statement

  • executing the logic in the using clause (line 8) to assemble a document key and a document from each row examined

  • piping the transformed data to the N1QL INSERT statement which is being executed against the second connection

A few points are worth expanding:

  • In general terms, in statement redirection, data flows from one side to the other without any need of syntactic sugar, if the output INSERT clause expects the same number of items provided by the input SELECT clause.

  • In this example, this is clearly not the case - the SELECT clause is providing a not-yet-described tuple, while the INSERT clause expects a document key and a JSON document, which we have to construct from each incoming tuple. Thus data flow has to be explicitly defined by means of the using clause.

  • For statement redirection to work, the input clause must be cursor based and the output clause must have placeholders. SQSL will enforce this.

  • The using clause transforms the incoming row values using two User Defined Functions: json:key and json:row2doc (json is just the name of the containing library)

    • key() (line 8) in particular is a variadic function that generates a key using the first argument as a separator, and the other arguments as values.

      • the built-in function columns (line 8) returns a hash (AKA associative array) containing all the values for the current input row

    • row2doc() (line 9) is a variadic function that takes a list of field names and a list of values and constructs a JSON document out of them

And here's the finished result: we migrated from

to

And we even did all that without any prior knowledge of the source table schema!

We did say that we would be creating document keys from each table's primary key columns, but if you pay close attention to the image above, clearly, we did not.

Bear with me — we'll get there!

Table Move, Take Two

let fromconn="stores_demo";
connect to fromconn source ifmx;
connect to "couchbase://192.168.1.104:8091" source cb;
let key=0;
select * from customer
  aggregate key+=1
  connection fromconn
  insert into default (key, value)
    values($1, $2)
    using key::string, json:row2doc(displaylabels, columns);


An alternative to using document keys generated from the individual columns values is to use surrogate keys, or, keys that have no relation to the data.

In this case, we are using an incrementing integer value held in the key variable.

The aggregate clause (line 6) can be used to take action on each incoming row, possibly based on some conditions (whether or not based on incoming data values).

In this case, we simply use it to increment the key variable at every incoming row, and then using it as a document key in the using clause (line 10).

Couchbase document keys can only be strings, hence the cast on the key variable.

Migrate a Database

let fromconn="stores_demo";
connect to fromconn source ifmx;
connect to "couchbase://192.168.1.104:8091" source cb;

foreach select tabname
  from systables
  where tabid>=100 and tabtype="T"
  connection fromconn
  into tab;
    display tab;
    select * from <+get tab+>
      connection fromconn
      insert into default (key, value)
        values($1, $2)
        using json:key("::", tab, columns),
              json:row2doc("tabtype", displaylabels, tab, columns);
end foreach;


We can do one better, and move an entire database in one go.

This can be achieved by augmenting the previous example: we just scan the tables system catalog and reuse the same strategy as before for each table found.

Enter the FOREACH loop at line 5: this prepares and executes SELECT statement against the source engine, and for each table name retrieved (stored in the tab variable via the storage clause at line 9) it proceeds to execute the code in the FOREACH block.

The rest of the code looks pretty much the same as the previous example, with the notable exception of line 11: we need a mechanism to prepare the input SELECT statement using the correct table name, and this we do by means of the expansion facility (the bit in the <+ +> block in line 11).

This, among other things, evaluates the arguments within and injects the result in the statement text — in this case, it places the contents of the tab variable in the text of the prepared statement.

The last thing worthy of note is that now the using clause in line 15 is placing the table name in both the document key and document. This is because the contents of multiple source tables are being placed in the same bucket, and we need to differentiate in between document types, in order to be able to do meaningful queries.

Here's a sample output from the previous script.

At first sight, not very interesting, but I do have a point, I promise.

Database Move, Different Strategy

let fromconn="stores_demo";
connect to fromconn source ifmx;
connect to "couchbase://192.168.1.104:8091" source cb;

prepare insst from insert into default (key, value)
          values($1, $2)
          using json:key("::", tab, columns),
                json:row2doc("tabtype", displaylabels, tab, columns);

foreach select tabname
  from systables
  where tabid>=100 and tabtype="T"
  connection fromconn
  into tab;
    display tab;
    select * from <+get tab+>
      connection fromconn
      execute insst;
end foreach;
free insst;


The statement redirection clause can sport prepared statements, too.

In this example, we PREPARE an INSERT statement beforehand (line 5) and just EXECUTE it (line 18) as a target for the SELECT clause in line 16.

Clearly, at the time the INSERT is prepared, the tab variable is undefined and there even isn't a source statement to which the displaylabels and columns built-in functions (in lines 7 and 8, in the USING clause) could apply: as such you would rightfully expect the PREPARE statement to fail.

Lazy evaluation to the rescue!

The insst prepared statement will actually fail if executed on its own, but execute it as part of a redirected statement, and everything suddenly makes semantic sense: the using clause will use the current tab value and the values current for each individual row.

Explaining why using a prepared statement for the SELECT part of the redirected statement, although feasible, offers no advantage over the code above is left as an exercise for the reader.

Database Move, Different Source

let fromconn="sample";
let schema="DB2INST1";
connect to fromconn source db2cli;
connect to "couchbase://192.168.1.104:8091" source cb;

foreach select name
  from sysibm.systables
  where creator=? and type='T'
  using schema
  connection fromconn
  into tab;
    display tab;
    select * from <+get schema+>.<+get tab+>
      connection fromconn
      insert into default (key, value)
        values($1, $2)
        using json:key("::", tab, columns),
              json:row2doc("tabtype", displaylabels, tab, columns);
end foreach;

The previous example moved the whole Informix's stores demo database to Couchbase.

This one does the same with DB2's sample demo database.

Note that the process is exactly the same as before - aside from connecting to a different engine, the changes needed are a different SELECT statement (specific to the source database engine) for the FOREACH uses to find table names and, in this case, the use of schema names when selecting from source tables.

Note how DB2 system catalogs store column names in uppercase:


Parallel Migration

let fromconn="stores_demo";
connect to fromconn source ifmx;

let tabcount=0;
select tabname
  from systables
  where tabid>=100 and tabtype="T"
  aggregate tabcount+=1
  into tabs(tabcount);

clone tabcount into child, children(child);
    connect to fromconn source ifmx;
    connect to "couchbase://192.168.1.104:8091" source cb;
    select * from <+get tabs(child)+>
      connection fromconn
      insert into default (key, value)
        values($1, $2)
        using json:key("::", tabs(child), columns),
              json:row2doc("tabtype", displaylabels, tabs(child), columns);
  parent;
    display tabs(child) format full "moving %s";
    let pid2tab(children(child))=tabs(child);
end clone;
wait for children.* into p, r;
    display pid2tab(p), r format full "completed %s with code %d";
end wait;


We have seen how to migrate whole databases from different engines without knowing the database schema, but we can do better still: we can migrate databases in parallel.

The strategy used is simple — rather than looping through the tables handling each in turn, we'll fork a child for each table involved in the migration.

This could be achieved using a FOREACH loop, as in the earlier examples, and using the fork function to start each child, but we use a more elegant method instead: a CLONE loop.

This essentially forks the required number of children, each executing the code inside the block, or up to the PARENT clause, if present.

The PARENT clause can be used by the forking process to take action after each child has been forked.

In this case (line 20), all it does is just print the table being moved by each child.

After all children have been forked, the parent process just waits for them to complete by means of the WAIT loop at line 23.

Each child behaves as in each of the previous examples.

The only thing left to be explained is how each child determines which table to move: simples!

Note how we had earlier quickly built a list of tables in the tabs hash, using the SELECT statement at line 5 (a neat trick here was to move the target variable in the storage clause at line 9 by means of the aggregate clause — line 8 — thereby skipping a less efficient FOREACH loop).

The tabs hash is indexed via an integer key. Coincidentally, each child is identified by an integer key stored in the child variable - so all each child has to do is to pick the table matching its child id, via the expansion at line 14.

As for the earlier examples, here's a sample output:


Remember the output from the earlier database migration example?

Here's the point that I was trying to make — I do realize we didn't really have a lot to migrate, but even with the limited amount of work involved, parallel operations do save time!

...And for My Last Trick: Parallel Migration, Indexes and All!

let fromconn="stores_demo";
connect to fromconn source ifmx;

{ gather table list }
let tabcount=0;
select tabname, tabid
  from systables
  where tabid>=100 and tabtype="T"
  aggregate tabcount+=1
  into tabs(tabcount), tabids(tabcount);

{ fork children }
clone tabcount into child, children(child);
    connect to fromconn source ifmx;

    { gather table primary key column numbers }
    let primcount=0;
    select ikeyextractcolno(indexkeys, idx)
      from sysindices, sysconstraints, table (list{0, 1, 2, 3, 4, 5, 6, 7}) colnos(idx)
      where sysconstraints.constrtype="P"
        and sysconstraints.idxname=sysindices.idxname
        and sysconstraints.tabid=?
        and ikeyextractcolno(indexkeys, idx)>0
        using tabids(child)
        aggregate primcount+=1
        into primkeys(primcount);

    { do the data migration }
    connect to "couchbase://192.168.1.104:8091" source cb;
    select * from <+get tabs(child)+> connection fromconn
      insert into default (key, value)
    values($1, $2)
    using json:key("::", tabs(child),

          { only use primary key columns }
          columns<+ get case when primcount>0
                             then ".(
                                  <+ separator "," get primkeys.(1 to primcount) +>
                                  )"
                             else ""
                        end case +>),
          json:row2doc("tabtype", displaylabels, tabs(child), columns);
  parent;
    display tabs(child) format full "moving %s";
    let pid2tab(children(child))=tabs(child);
end clone;

{ wait for children }
wait for children.* into p, r;
    display pid2tab(p), r format full "completed %s with code %d";
end wait;

{ create indexes }
connect to "couchbase://192.168.1.104:8091" source cb as "cb";
set connection fromconn;

{ loop through tables }
for tabnum in 1 to tabcount do

    {loop through indexes, getting name and columns }
    foreach select idxname, part1, part2, part3, part4, part5, part6, part7, part8,
                   part9, part10, part11, part12, part13, part14, part15, part16
      from sysindexes
      where tabid=?
      using tabids(tabnum)

      { make sure that the index name has no spaces }
      aggregate $1=replace($1, " ", "_")
      into idxname, ikeys.(1 to 16);
        display idxname, tabs(tabnum) format full "creating %s on %s";
        create index `<+get idxname+>`
          on default(<+silent multi quotes "``" separator ","

                 <* get column names from catalog using column numbers *>
                 select colname from syscolumns
                   where tabid=<+get tabids(tabnum)+>
                     and colno in (<+multi separator "," get ikeys.(1 to 16)+>)
                   order by field(colno, list{<+multi separator "," get ikeys.(1 to 16)+>})+>)
          where tabtype="<+get tabs(tabnum)+>"
          connection "cb";
    end foreach;
end for


The examples that we have considered so far have a major issue - you can't join different types of documents because the keys contained in individual documents do not match the document keys they should be referring to (remember we were creating document keys out of the contents of the entire document), and even if they did, we hadn't yet created the supporting indexes.

This is where we put things right.

The first thing we have to do is determining which columns make up the primary key for each index, so that we generate the correct document keys.

This is done by the weird looking SELECT at line 18 - which populates the primkeys hash with the list of column numbers of the primary key.

Having the column list, we have to extract from the column hash the relevant values.

For this, we use a combination of hash expansion and multi-value expansions.

A quick shorthand to generate a partial list of elements of a hash is with the following dot notation

 identifier.(expression list)  

This will generate an expression list with just the hash elements whose keys are in the list following the identifier.

For instance, we can get the values of the first 16 columns with

 columns.(1 to 16) 

So the trick here is to expand the prinkeys hash to a list of its elements, and then applying that to the columns hash.

The expansion at line 38 does exactly that — generate a comma separated list of values, which is then applied to the columns hash by the expansion at line 36,  if the table in question has a primary key.

We now move to index creation.

I could create the indexes relevant to each table in each child, however, this is not a strategy that pays if, like me, you are working on a single node - the sole indexing service would not be very happy being swamped with index creation requests.

For this reason, in this example, we wait for the children to finish loading and then loop through the tables (line 58), and for each table, we loop through the indexes (line 61).

In line 68 we use an aggregate clause to make sure that the index names we fetch don't have spaces. The indexing services doesn't like them.

After that, we just assemble a N1QL CREATE INDEX statement (line 71) with similar techniques to what we have already seen.

An interesting point to note is that we get index field at line 72 names through an expansion directly SELECTing a list of column names.

After the migration, we can easily port an original relational query

select customer.*, orders.*
  from customer, orders
  where customer.customer_num=orders.customer_num

to N1QL:

select customer.*, orders.*, meta(customer).id cid, meta(orders).id oid
   from default orders
     join default customer
       on keys "customer::"||to_string(orders.customer_num)
   where orders.tabtype="orders"
     and customer.tabtype="customer"
     and orders.customer_num is not missing

Which, just to prove the point, uses the correct index:

Conclusion

We have quickly gone through several features of a very old project of mine, geared to migrating between two relational databases, and we have used it to migrate databases from the relational world to Couchbase server, using N1QL.

This we have done without having any knowledge of the schema of the source database.

Having covered Extract, enough Transform to create documents out of tuples, and Load, the next logical step is to explore more Transform, in order to move the data that we have just loaded to a denormalized model.

For this, could I point you to Manuel Hurtado's excellent ELT with Couchbase and N1QL article.

Author's note

You can find the complete stable source of the SQSL project at http://www.sqsl.org/sqslsource.htm.

The current development tree can be found at http://github.com/marcogrecopriolo/sqsl

SQSL comes in the form of libraries, since the original project aimed to embed the runner in larger applications.

The source shows how to put the libraries to good use with several sample applications, in the form, not very imaginatively, of command line shells: you can choose from command line interface to curses, to NW.js to NPAPI.

The examples above have been run using the CLI shell.

The runner connects to individual database engines using ad hoc data sources, in the form of shared libraries.

The number of available data sources is currently fairly limited.

The current implementation relies on Informix ESQL/c for decimal, date, datetime, and interval arithmetics. I have to write my own library.

The Couchbase data source and the JSON UTFs are alpha at best. The Couchbase data source does not yet parse incoming JSON documents.

There isn't an OSX port yet — that's just about the only port missing.

Want to deliver a whole new level of customer experience? Learn how to make your move from MongoDB to Couchbase Server.

Topics:
n1ql ,sql ,scripting ,database ,database migration ,tutorial

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}