The Physics of Big Data
The Physics of Big Data
Data in Motion needs to stay in motion, data in rest needs to rest forever, constantly growing in size, variety and value.
Join the DZone community and get the full member experience.Join For Free
The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.
Physics of Big Data
Big Data has all the properties of real world objects and are subject to real world physics. Inertia applies to the owners of data silos, crushed by the gravity of limited platforms constraining business functionality to a small subset of what is available, required, and needed.
With the the massive datasets at REST, using the deep available toolkit you can easily process terabytes of data with the same tools for Machine Learning, Streaming, and SQL.
Logger.getLogger("org.apache.spark").setLevel(Level.ERROR) Logger.getLogger("org.apache.spark.storage.BlockManager").setLevel(Level.ERROR) val logger: Logger = Logger.getLogger("com.dataflowdeveloper.sentiment.TwitterSentimentAnalysis") val sparkConf = new SparkConf().setAppName("TwitterSentimentAnalysis") sparkConf.set("spark.streaming.backpressure.enabled", "true") sparkConf.set("spark.serializer", classOf[KryoSerializer].getName) sparkConf.set("spark.sql.tungsten.enabled", "true") sparkConf.set("spark.app.id", "Sentiment") sparkConf.set("spark.io.compression.codec", "snappy") sparkConf.set("spark.rdd.compress", "true") sparkConf.set("spark.eventLog.enabled", "true") sparkConf.set("spark.eventLog.dir", "hdfs://tspannserver:8020/spark-logs") val sc = new SparkContext(sparkConf) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ val tweets = sqlContext.read.format("org.apache.phoenix.spark").options( Map("table" -> "tweets", "zkUrl" -> "tspannserver:2181:/hbase-unsecure")).load() tweets.printSchema() tweets.count tweets.take(10).foreach(println)
In our short Scala/Spark example, we are processing HBase data using the Phoenix-Spark interface. It's very easy to use a SQL metaphor to process this data.
You need to have Data in Motion entering your Connected Data Platform from internal and external sources, in hundreds of formats from JSON to XML to AVRO with never ending changing schemas and fields. While data is ingesting their are many valueable insights that can be queryed near real-time in Spark Streaming and Storm, with machine learning models applied in transit with intelligent routing and transformation directly in-stream with Apache nifi. Without a steady stream of different types of data, your system will grow cold, less users will query it, and it will gain inertia until it loses all use, agility, and function.
Petabytes of valuable data sit cold without energy, as busines value is lost in the vacuum of inactivity.
How big does data have to be to reach a critical mass that demands action, simply by it's massive volume and its effect on other systems, data, business users, and information technologists? Can you ignore gigabytes of data? Is any data to big to fit cheaply, scalable, SQL queryable, readily available in your existing legacy vendor solutions, in your frame of reference — BIG DATA.
Is data in the Yottabytes not big data if your Connected Data Platform allows your business users to easily query and extract value from it in real-time with Hive LLAP? Is Big Data relative to absolute time and space? On my first computer with 4bit bytes, 64K was Big Data because it was too big for me to store.
If my platform elastically scales and continues to continuously ingest more data while keeping query times constant, is your data Big Data yet?
Is Big Data absolute or relative? If it's relative, then the frame of reference is usability and timliness of delivery.
Wikipedia frames it in the terms of traditional systems, "Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate to deal with them." A good reason to move to a modern collected data platform like Hadoop 2.7 is to set a new tradition. If Hadoop is the new standard and tradition for data processing applications and this platform has no data sets too large or complex to deal with them, is now all data, just data. The data without quickly determing insights with real-business value, is it just garbage. Digital waste if it serves no purpose. If you have petabytes of log files sitting on tapes unanalyzed, in accessible, forgotten, then does that data exist at all?
It's time to beat inertia and get your data in motion.
Examples of Data in Motion
Opinions expressed by DZone contributors are their own.