Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Ingest BTC.com and Blockchain.com Data via Apache NiFi

DZone's Guide to

Ingest BTC.com and Blockchain.com Data via Apache NiFi

How to ingest bitcoin and other cryptocurrency data from blockchain.com and btc.com.

· Integration Zone ·
Free Resource

The State of API Integration 2018: Get Cloud Elements’ report for the most comprehensive breakdown of the API integration industry’s past, present, and future.

Ingesting Blockchain Data From btc.com and blockchain.com(.info).

Blockchain exchanges contain a large amount of data that you need to ingest, process, store, and analyze if you wish to make proper trades and find the right time to buy and sell various currencies. Bitcoin, being the first and most valuable, is a currency I wish to trash. Using two sites that provide great APIs for data: btc.com and blockchain.com. They provide easy to ingest and fast REST APIs returning clean JSON. With this data, sometimes like in Blockchain Blocks for the current data in the format of multiple JSON records, I want to split that, which is easy with Apache NiFi. Once there, I use the prebuilt JSON schema (I used InferAvroSchema to derive this). I can now run live queries on my blockchain data as it enters the system before I put it into Apache Hive for permanent storage.

Image title

API Calls

blockchain.com

  • https://blockchain.info/latestblock

  • https://api.blockchain.info/charts/transactions-per-second?timespan=5weeks&rollingAverage=8hours&format=json

  • https://blockchain.info/blocks/BTC.com?format=json

  • https://api.blockchain.info/pools?timespan=5days

  • https://api.blockchain.info/stats

  • https://blockchain.info/ticker

  • https://blockchain.info/tobtc?currency=USD&value=10000

Blocks For Today

  • https://chain.api.btc.com/v3/block/date/${now():format('yyyyMMdd')}

btc.com

  • https://chain.api.btc.com/v3/block/latest

  • https://chain.api.btc.com/v3/tx/unconfirmed

Deeper Use Case

Today's Blocks

This API returns records for all of today's blocks.   We are gathering some important information like nonce, version, timestamp, hash, pool, and previous/next hash to link the blocks.

Apache Avro Schema in JSON Format Stored in Hortonworks Schema Registry

{
 "type": "record",
 "name": "blocksfortoday",
 "fields": [
  {
   "name": "height",
   "type": "int",
   "doc": "Type inferred from '527784'"
  },
  {
   "name": "version",
   "type": "int",
   "doc": "Type inferred from '536870912'"
  },
  {
   "name": "mrkl_root",
   "type": "string",
   "doc": "Type inferred from '\"c8f658ad595854f4c8c510b672447e838a2746e8724bfb26d0d127e5a4421385\"'"
  },
  {
   "name": "timestamp",
   "type": "int",
   "doc": "Type inferred from '1529180826'"
  },
  {
   "name": "bits",
   "type": "int",
   "doc": "Type inferred from '389609537'"
  },
  {
   "name": "nonce",
   "type": "int",
   "doc": "Type inferred from '236046944'"
  },
  {
   "name": "hash",
   "type": "string",
   "doc": "Type inferred from '\"00000000000000000017ca1d74bdee575dd48d4b3513eea2f7e06b313883d73d\"'"
  },
  {
   "name": "prev_block_hash",
   "type": "string",
   "doc": "Type inferred from '\"00000000000000000000acf6259fffe63d36623c324f756faaf995a9e2896b87\"'"
  },
  {
   "name": "next_block_hash",
   "type": "string",
   "doc": "Type inferred from '\"0000000000000000000000000000000000000000000000000000000000000000\"'"
  },
  {
   "name": "size",
   "type": "int",
   "doc": "Type inferred from '189026'"
  },
  {
   "name": "pool_difficulty",
   "type": "long",
   "doc": "Type inferred from '11831716619811'"
  },
  {
   "name": "difficulty",
   "type": "double",
   "doc": "Type inferred from '4.940704885521827E12'"
  },
  {
   "name": "tx_count",
   "type": "int",
   "doc": "Type inferred from '486'"
  },
  {
   "name": "reward_block",
   "type": "int",
   "doc": "Type inferred from '1250000000'"
  },
  {
   "name": "reward_fees",
   "type": "int",
   "doc": "Type inferred from '15691427'"
  },
  {
   "name": "created_at",
   "type": "int",
   "doc": "Type inferred from '1529180835'"
  },
  {
   "name": "confirmations",
   "type": "int",
   "doc": "Type inferred from '1'"
  },
  {
   "name": "is_orphan",
   "type": "boolean",
   "doc": "Type inferred from 'false'"
  },
  {
   "name": "curr_max_timestamp",
   "type": "int",
   "doc": "Type inferred from '1529180826'"
  },
  {
   "name": "is_sw_block",
   "type": "boolean",
   "doc": "Type inferred from 'true'"
  },
  {
   "name": "stripped_size",
   "type": "int",
   "doc": "Type inferred from '153817'"
  },
  {
   "name": "weight",
   "type": "int",
   "doc": "Type inferred from '650477'"
  },
  {
   "name": "extras",
   "type": {
    "type": "record",
    "name": "extras",
    "fields": [
     {
      "name": "pool_name",
      "type": "string",
      "doc": "Type inferred from '\"BTC.com\"'"
     },
     {
      "name": "pool_link",
      "type": "string",
      "doc": "Type inferred from '\"https://pool.btc.com\"'"
     }
    ]
   },
   "doc": "Type inferred from '{\"pool_name\":\"BTC.com\",\"pool_link\":\"https://pool.btc.com\"}'"
  }
 ]
}

QueryRecord Query

 SELECT * FROM FLOWFILE WHERE CAST(tx_count AS INT) > 0 

With my query in Apache NiFi, I am only interested in records that have a transaction count.  So we filter out zero records.

Once we have ingested the data real-time with Apache NiFi, I can now store it in the external table that NiFi built the DDL for. As you can see, it's an external table since our Apache ORC files are stored outside of the Hive data warehouse directory. So, we just have to create new Apache ORC files in that LOCATION directory to add more rows to the dataset.

Create an Apache Hive Table

CREATE EXTERNAL TABLE IF NOT EXISTS blocksfortoday1 (height INT, version INT, mrkl_root STRING, timestamp INT, bits INT, nonce INT, hash STRING, prev_block_hash STRING, next_block_hash STRING, size INT, pool_difficulty BIGINT, difficulty DOUBLE, tx_count INT, reward_block INT, reward_fees INT, created_at INT, confirmations INT, is_orphan BOOLEAN, curr_max_timestamp INT, is_sw_block BOOLEAN, stripped_size INT, weight INT, extras STRUCT<pool_name:STRING, pool_link:STRING>)
STORED AS ORC
LOCATION '/blocksfortoday1'

Example Apache Hive Query Run in Apache Zeppelin

select * from blocksfortoday1 where  CAST(tx_count as INT) > 500 order by created_at desc


Image title

Apache Zeppelin let's use query our current and constantly ingesting Blocks For Today Hive data as well as create charts and graphs. In the same environment, I can run Spark batch code, Spark machine learning, as well as Python libraries for Deep Learning like Apache MXNet and TensorFlow.   

References:

Your API is not enough. Learn why (and how) leading SaaS providers are turning their products into platforms with API integration in the ebook, Build Platforms, Not Products from Cloud Elements.

Topics:
blockchain ,apache nifi ,rest ,hortonworks ,hadoop ,hive ,sql ,json ,schemas ,integration

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}