While traditional analytics databases exists, Apache Hadoop is becoming the de facto data storage for big data. It's an open-source software framework for distributed storage and distributed processing of very large data sets. There is a need for the ability to transfer data from MariaDB/MySQL operational data store into Hadoop. While tools such as Apache sqoop exist to export data out of MariaDB/MySQL into Hadoop — its performance is not suitable for streaming or real-time data transfer as it operates as a batch application.
To address this need, the MariaDB MaxScale team has designed a modular solution with MariaDB MaxScale to stream binlog events coming from the Master database to the data lake via messaging systems such as Kafka’s distributed broker. The binlog events for inserts, updates and deletes are converted in AVRO or JSON format before it’s forwarded to the data lake. Kafka is used as a data ingestion pipeline for distributed data process environment. MariaDB MaxScale will be the Kafka producer, whereas big data platforms such as Hadoop, Cassandra, Spark or any other analytic database will be the consumer application consuming the data through the Kafka broker.
MariaDB MaxScale Plugins for Data Streaming
The current MariaDB MaxScale binlog router provides change data capture and flow from the MariaDB Master database towards the MariaDB Slave database, while caching binlog events on the MaxScale server itself. By extending this approach, two new plugin are introduced in MariaDB MaxScale:
- Avro Router: To convert the change data events from binlog events to AVRO and JSON events.
- Change Data Protocol Plugin: To publish AVRO or JSON change data events to registered clients via CDC Client API.
The avrorouter is a new MaxScale component has been added in order to convert MySQL binary events into AVRO records: it’s basically a MariaDB 10.0, 10.1 compatible binary log to AVRO file converter. It consumes binary logs from a local directory and transforms them into a set of AVRO files. These files can then be queried by clients for various purposes.
This router is intended to be used in tandem with the Binlog Server. The Binlog Server can connect to a master server and request binlog records. These records can then be consumed by the “avrorouter” directly from the binlog cache of the Binlog Server. This allows MariaDB MaxScale to automatically transform binlog events on the master to local Avro format files.
The converted AVRO files can be requested any time with the new CDC protocol plugin. This protocol should be used to communicate with the avrorouter. The clients can request either AVRO or JSON format data streams from a database table.
AVRO is a binary Object Container File that consists of a file header and one or more file data blocks. The header contains the JSON version of the schema.
Note: Each AVRO file contains data related to only ONE table.
AVRO relies on schemas. When AVRO data is read, the schema is used. When writing, it is always present. AVRO schemas are defined with JSON. In the context of MariaDB MaxScale Binlog-AVRO conversion, each AVRO file contains data related to one table. For each Master database table, there is a corresponding AVRO schema file on MariaDB MaxScale. There is a utility provided for cdc-schema to generate AVRO schema from the MariaDB database tables to AVRO schema in MariaDB MaxScale.
Next up, we’ll have upcoming blogs on how to use MariaDB MaxScale for data streaming, including:
- MariaDB MaxScale 2.0 Configuring MariaDB Master and MariaDB MaxScale for Data Streaming Service.
- How to Stream Change Data through MariaDB MaxScale using CDC API.
- Real-time Data Streaming to Kafka with MaxScale CDC.