Ingest BTC.com and Blockchain.com Data via Apache NiFi
How to ingest bitcoin and other cryptocurrency data from blockchain.com and btc.com.
Join the DZone community and get the full member experience.
Join For FreeIngesting Blockchain Data From btc.com and blockchain.com(.info).
Blockchain exchanges contain a large amount of data that you need to ingest, process, store, and analyze if you wish to make proper trades and find the right time to buy and sell various currencies. Bitcoin, being the first and most valuable, is a currency I wish to trash. Using two sites that provide great APIs for data: btc.com and blockchain.com. They provide easy to ingest and fast REST APIs returning clean JSON. With this data, sometimes like in Blockchain Blocks for the current data in the format of multiple JSON records, I want to split that, which is easy with Apache NiFi. Once there, I use the prebuilt JSON schema (I used InferAvroSchema to derive this). I can now run live queries on my blockchain data as it enters the system before I put it into Apache Hive for permanent storage.
API Calls
blockchain.com
https://blockchain.info/latestblock
https://api.blockchain.info/charts/transactions-per-second?timespan=5weeks&rollingAverage=8hours&format=json
https://blockchain.info/blocks/BTC.com?format=json
https://api.blockchain.info/pools?timespan=5days
https://api.blockchain.info/stats
https://blockchain.info/ticker
https://blockchain.info/tobtc?currency=USD&value=10000
Blocks For Today
https://chain.api.btc.com/v3/block/date/${now():format('yyyyMMdd')}
btc.com
https://chain.api.btc.com/v3/block/latest
https://chain.api.btc.com/v3/tx/unconfirmed
Deeper Use Case
Today's Blocks
This API returns records for all of today's blocks. We are gathering some important information like nonce, version, timestamp, hash, pool, and previous/next hash to link the blocks.
Apache Avro Schema in JSON Format Stored in Hortonworks Schema Registry
{
"type": "record",
"name": "blocksfortoday",
"fields": [
{
"name": "height",
"type": "int",
"doc": "Type inferred from '527784'"
},
{
"name": "version",
"type": "int",
"doc": "Type inferred from '536870912'"
},
{
"name": "mrkl_root",
"type": "string",
"doc": "Type inferred from '\"c8f658ad595854f4c8c510b672447e838a2746e8724bfb26d0d127e5a4421385\"'"
},
{
"name": "timestamp",
"type": "int",
"doc": "Type inferred from '1529180826'"
},
{
"name": "bits",
"type": "int",
"doc": "Type inferred from '389609537'"
},
{
"name": "nonce",
"type": "int",
"doc": "Type inferred from '236046944'"
},
{
"name": "hash",
"type": "string",
"doc": "Type inferred from '\"00000000000000000017ca1d74bdee575dd48d4b3513eea2f7e06b313883d73d\"'"
},
{
"name": "prev_block_hash",
"type": "string",
"doc": "Type inferred from '\"00000000000000000000acf6259fffe63d36623c324f756faaf995a9e2896b87\"'"
},
{
"name": "next_block_hash",
"type": "string",
"doc": "Type inferred from '\"0000000000000000000000000000000000000000000000000000000000000000\"'"
},
{
"name": "size",
"type": "int",
"doc": "Type inferred from '189026'"
},
{
"name": "pool_difficulty",
"type": "long",
"doc": "Type inferred from '11831716619811'"
},
{
"name": "difficulty",
"type": "double",
"doc": "Type inferred from '4.940704885521827E12'"
},
{
"name": "tx_count",
"type": "int",
"doc": "Type inferred from '486'"
},
{
"name": "reward_block",
"type": "int",
"doc": "Type inferred from '1250000000'"
},
{
"name": "reward_fees",
"type": "int",
"doc": "Type inferred from '15691427'"
},
{
"name": "created_at",
"type": "int",
"doc": "Type inferred from '1529180835'"
},
{
"name": "confirmations",
"type": "int",
"doc": "Type inferred from '1'"
},
{
"name": "is_orphan",
"type": "boolean",
"doc": "Type inferred from 'false'"
},
{
"name": "curr_max_timestamp",
"type": "int",
"doc": "Type inferred from '1529180826'"
},
{
"name": "is_sw_block",
"type": "boolean",
"doc": "Type inferred from 'true'"
},
{
"name": "stripped_size",
"type": "int",
"doc": "Type inferred from '153817'"
},
{
"name": "weight",
"type": "int",
"doc": "Type inferred from '650477'"
},
{
"name": "extras",
"type": {
"type": "record",
"name": "extras",
"fields": [
{
"name": "pool_name",
"type": "string",
"doc": "Type inferred from '\"BTC.com\"'"
},
{
"name": "pool_link",
"type": "string",
"doc": "Type inferred from '\"https://pool.btc.com\"'"
}
]
},
"doc": "Type inferred from '{\"pool_name\":\"BTC.com\",\"pool_link\":\"https://pool.btc.com\"}'"
}
]
}
QueryRecord Query
SELECT * FROM FLOWFILE WHERE CAST(tx_count AS INT) > 0
With my query in Apache NiFi, I am only interested in records that have a transaction count. So we filter out zero records.
Once we have ingested the data real-time with Apache NiFi, I can now store it in the external table that NiFi built the DDL for. As you can see, it's an external table since our Apache ORC files are stored outside of the Hive data warehouse directory. So, we just have to create new Apache ORC files in that LOCATION directory to add more rows to the dataset.
Create an Apache Hive Table
CREATE EXTERNAL TABLE IF NOT EXISTS blocksfortoday1 (height INT, version INT, mrkl_root STRING, timestamp INT, bits INT, nonce INT, hash STRING, prev_block_hash STRING, next_block_hash STRING, size INT, pool_difficulty BIGINT, difficulty DOUBLE, tx_count INT, reward_block INT, reward_fees INT, created_at INT, confirmations INT, is_orphan BOOLEAN, curr_max_timestamp INT, is_sw_block BOOLEAN, stripped_size INT, weight INT, extras STRUCT<pool_name:STRING, pool_link:STRING>)
STORED AS ORC
LOCATION '/blocksfortoday1'
Example Apache Hive Query Run in Apache Zeppelin
select * from blocksfortoday1 where CAST(tx_count as INT) > 500 order by created_at desc
Apache Zeppelin let's use query our current and constantly ingesting Blocks For Today Hive data as well as create charts and graphs. In the same environment, I can run Spark batch code, Spark machine learning, as well as Python libraries for Deep Learning like Apache MXNet and TensorFlow.
References:
Opinions expressed by DZone contributors are their own.
Comments