Big Data Resources

The Latest Big Data Topics

Streaming Big Data: Storm, Spark and Samza

There are a number of distributed computation systems that can process Big Data in real time or near-real time. This article will start with a short description of three Apache frameworks, and attempt to provide a quick, high-level overview of some of their similarities and differences. Apache Storm In Storm, you design a graph of real-time computation called a topology, and feed it to the cluster where the master node will distribute the code among worker nodes to execute it. In a topology, data is passed around between spouts that emit data streams as immutable sets of key-value pairs called tuples, and bolts that transform those streams (count, filter etc.). Bolts themselves can optionally emit data to other bolts down the processing pipeline. Apache Spark Spark Streaming (an extension of the core Spark API) doesn’t process streams one at a time like Storm. Instead, it slices them in small batches of time intervals before processing them. The Spark abstraction for a continuous stream of data is called a DStream (for Discretized Stream). A DStream is a micro-batch of RDDs (Resilient Distributed Datasets). RDDs are distributed collections that can be operated in parallel by arbitrary functions and by transformations over a sliding window of data (windowed computations). Apache Samza Samza ’s approach to streaming is to process messages as they are received, one at a time. Samza’s stream primitive is not a tuple or a Dstream, but a message. Streams are divided into partitions and each partition is an ordered sequence of read-only messages with each message having a unique ID (offset). The system also supports batching, i.e. consuming several messages from the same stream partition in sequence. Samza`s Execution & Streaming modules are both pluggable, although Samza typically relies on Hadoop’s YARN (Yet Another Resource Negotiator) and Apache Kafka. Common Ground All three real-time computation systems are open-source, low-latency, distributed, scalable and fault-tolerant. They all allow you to run your stream processing code through parallel tasks distributed across a cluster of computing machines with fail-over capabilities. They also provide simple APIs to abstract the complexity of the underlying implementations. The three frameworks use different vocabularies for similar concepts: Comparison Matrix A few of the differences are summarized in the table below: There are three general categories of delivery patterns: At-most-once: messages may be lost. This is usually the least desirable outcome. At-least-once: messages may be redelivered (no loss, but duplicates). This is good enough for many use cases. Exactly-once: each message is delivered once and only once (no loss, no duplicates). This is a desirable feature although difficult to guarantee in all cases. Another aspect is state management. There are different strategies to store state. Spark Streaming writes data into the distributed file system (e.g. HDFS). Samza uses an embedded key-value store. With Storm, you’ll have to either roll your own state management at your application layer, or use a higher-level abstraction called Trident. Use Cases All three frameworks are particularly well-suited to efficiently process continuous, massive amounts of real-time data. So which one to use? There are no hard rules, at most a few general guidelines. If you want a high-speed event processing system that allows for incremental computations, Storm would be fine for that. If you further need to run distributed computations on demand, while the client is waiting synchronously for the results, you’ll have Distributed RPC (DRPC) out-of-the-box. Last but not least, because Storm uses Apache Thrift, you can write topologies in any programming language. If you need state persistence and/or exactly-once delivery though, you should look at the higher-level Trident API, which also offers micro-batching. A few companies using Storm: Twitter, Yahoo!, Spotify, The Weather Channel... Speaking of micro-batching, if you must have stateful computations, exactly-once delivery and don’t mind a higher latency, you could consider Spark Streaming…specially if you also plan for graph operations, machine learning or SQL access. The Apache Spark stack lets you combine several libraries with streaming (Spark SQL, MLlib, GraphX) and provides a convenient unifying programming model. In particular, streaming algorithms (e.g. streaming k-means) allow Spark to facilitate decisions in real-time. A few companies using Spark: Amazon, Yahoo!, NASA JPL, eBay Inc., Baidu… If you have a large amount of state to work with (e.g. many gigabytes per partition), Samza co-locates storage and processing on the same machines, allowing to work efficiently with state that won’t fit in memory. The framework also offers flexibility with its pluggable API: its default execution, messaging and storage engines can each be replaced with your choice of alternatives. Moreover, if you have a number of data processing stages from different teams with different codebases, Samza ‘s fine-grained jobs would be particularly well-suited, since they can be added/removed with minimal ripple effects. A few companies using Samza: LinkedIn, Intuit, Metamarkets, Quantiply, Fortscale… Conclusion We only scratched the surface of The Three Apaches. We didn’t cover a number of other features and more subtle differences between these frameworks. Also, it’s important to keep in mind the limits of the above comparisons, as these systems are constantly evolving.

February 28, 2015

by Tony Siciliani

· 32,792 Views · 5 Likes

How-To: Setup Development Environment for Hadoop MapReduce

This post is intended for folks who are looking out for a quick start on developing a basic Hadoop MapReduce application. We will see how to set up a basic MR application for WordCount using Java, Maven and Eclipse and run a basic MR program in local mode , which is easy for debugging at an early stage. Assuming JDK 1.6+ is already installed and Eclipse has a setup for Maven plugin and download from default maven repository is not restriced. Problem Statement : To count the occurrence of each word appearing in an input file using MapReduce. Step 1 : Adding Dependency Create a maven project in eclipse and use following code in your pom.xml. 4.0.0 com.saurzcode.hadoop MapReduce 0.0.1-SNAPSHOT jar org.apache.hadoop hadoop-client 2.2.0 Upon saving it should download all required dependencies for running a basic Hadoop MapReduce program. Step 2 : Mapper Program Map step involves tokenizing the file, traversing the words, and emitting a count of one for each word that is found. Our mapper class should extend Mapper class and override it’s map method. When this method is called the value parameter of the method will contain a chunk of the lines of file to be processed and the output parameter is used to emit word instances. In real world clustered setup, this code will run on multiple nodes which will be consumed by set of reducers to process further. public class WordCountMapper extends Mapper { private final IntWritable ONE = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while(tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, ONE); } } } Step 3 : Reducer Program Our reducer extends the Reducer class and implement logic to sum up each occurrence of word token received from mappers.Output from Reducers will go to the output folder as a text file ( default or as configured in Driver program for Output format) named as part-r-00000 along with a _SUCCESS file. public class WordCountReducer extends Reducer { public void reduce(Text text, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } context.write(text, new IntWritable(sum)); } } Step 4 : Driver Program Our driver program will configure the job by supplying the map and reduce program we just wrote along with various input , output parameters. public class WordCount { public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException { Path inputPath = new Path(args[0]); Path outputDir = new Path(args[1]); // Create configuration Configuration conf = new Configuration(true); // Create job Job job = new Job(conf, "WordCount"); job.setJarByClass(WordCountMapper.class); // Setup MapReduce job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); job.setNumReduceTasks(1); // Specify key / value job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // Input FileInputFormat.addInputPath(job, inputPath); job.setInputFormatClass(TextInputFormat.class); // Output FileOutputFormat.setOutputPath(job, outputDir); job.setOutputFormatClass(TextOutputFormat.class); // Delete output if exists FileSystem hdfs = FileSystem.get(conf); if (hdfs.exists(outputDir)) hdfs.delete(outputDir, true); // Execute job int code = job.waitForCompletion(true) ? 0 : 1; System.exit(code); } } That’s It !! We are all set to execute our first MapReduce Program in eclipse in local mode. Let’s assume there is an input text file called input.txt in folder input which contains following text : foo bar is foo count count foo for saurzcode Expected output : foo 3 bar 1 is 1 count 2 for 1 saurzcode 1 Let’s run this program in eclipse as Java Application :- We need to give path to input and output folder/file to the program as argument.Also, note output folder shouldn’t exist before running this program else program will fail. java com.saurzcode.mapreduce.WordCount input/inputfile.txt output If this program runs successfully emitting set of lines while it is executing mappers and reducers, we should see a output folder and with following files : output/ _SUCCESS part-r-00000

January 30, 2015

by Saurabh Chhajed

· 13,106 Views · 1 Like

Lambda Architecture for Big Data

An increasing number of systems are being built to handle the Volume, Velocity and Variety of Big Data, and hopefully help gain new insights and make better business decisions. Here, we will look at ways to deal with Big Data’s Volume and Velocity simultaneously, within a single architecture solution. Volume + Velocity Apache Hadoop provides both reliable storage (HDFS) and a processing system (MapReduce) for large data sets across clusters of computers. MapReduce is a batch query processor that is targeted at long-running background processes. Hadoop can handle Volume. But to handle Velocity, we need real-time processing tools that can compensate for the high-latency of batch systems, and serve the most recent data continuously, as new data arrives and older data is progressively integrated into the batch framework. Therefore we need both batch and real-time to run in parallel, and add a real-time computational system (e.g. Apache Storm) to our batch framework. This architectural combination of batch and real-time computation is referred to as a Lambda Architecture (λ). Generic Lambda λ has three layers: The Batch Layer manages the master data and precomputes the batch views The Speed Layer serves recent data only and increments the real-time views The Serving Layer is responsible for indexing and exposing the views so that they can be queried. The three layers are outlined in the below diagram along with a sample choice of technology stacks: Incoming data is dispatched to both Batch and Speed layers for processing. At the other end, queries are answered by merging both batch and real-time views. Note that real-time views are transient by nature and their data is discarded (making room for newer data) once propagated through the Batch and Serving layers. Most of the complexity is pushed onto the much smaller Speed layer where the results are only temporary, a process known as “complexity isolation“. We are indeed isolating the complexity of concurrent data updates in a layer that is regularly purged and kept small in size. λ is technology agnostic. The data pipeline is broken down into layers with clear demarcation of responsibilities, and at each layer, we can choose from a number of technologies. The Speed layer for instance could use either Apache Storm, or Apache Spark Streaming, or Spring “XD” ( eXtreme Data) etc. How do we recover from mistakes in λ ? Basically, we recompute the views. If that takes too long, we just revert to the previous, non-corrupted versions of our data. We can do that because of data immutability in the master dataset: data is never updated, only appended to (time-based ordering). The system is therefore Human Fault-Tolerant: if we write bad data, we can just remove that data altogether and recompute. Unified Lambda The downside of λ is its inherent complexity. Keeping in sync two already complex distributed systems is quite an implementation and maintenance challenge. People have started to look for simpler alternatives that would bring just about the same benefits and handle the full problem set. There are basically three approaches: 1) Adopt a pure streaming approach, and use a flexible framework such as Apache Samza to provide some type of batch processing. Although its distributed streaming layer is pluggable, Samza typically relies on Apache Kafka. Samza’s streams are replayable, ordered partitions. Samza can be configured for batching, i.e. consume several messages from the same stream partition in sequence. 2) Take the opposite approach, and choose a flexible Batch framework that would also allow micro-batches, small enough to be close to real-time, with Apache Spark/Spark Streaming or Storm’s Trident. Spark streaming is essentially a sequence of small batch processes that can reach latency as low as one second.Trident is a high-level abstraction on top of Storm that can process streams as small batches as well as do batch aggregation. 3) Use a technology stack already combining batch and real-time, such as Spring “XD”, Summingbird or Lambdoop. Summingbird (“Streaming MapReduce”) is a hybrid system where both batch/real-time workflows can be run at the same time and the results merged automatically.The Speed layer runs on Storm and the Batch layer on Hadoop, Lambdoop (Lambda-Hadoop, with HBase, Storm and Redis) also combines batch/real-time by offering a single API for both processing paradigms: The integrated approach (unified λ) seeks to handle Big Data’s Volume and Velocity by featuring a hybrid computation model, where both batch and real-time data processing are combined transparently. And with a unified framework, there would be only one system to learn, and one system to maintain.

January 17, 2015

by Tony Siciliani

· 40,545 Views · 6 Likes

Mutliple Table Insert Using a Single POST - Common Coding Examples

This is part of a series of blogs from Espresso Logic’s Application Engineering team on common coding tasks and how they can be accomplished more efficiently using Espresso Logic. The purpose of these blogs is two fold: Provide code examples of how Espresso system can be used to solve certain use cases Provide code samples that you can use within your programs while developing apps This blog deals with a design pattern all web and mobile transaction applications have to deal with – how to create a multiple table self registration using a single POST. This is both a performance issue, reducing the number of calls to and from the server, as well as a demonstration of how Espresso Logic advanced design handles complex transactions without coding. In the prior blog we dealt with rules used to validate credit cards. Self Registration Design Pattern User self registration is a standard design pattern for some web and mobile applications. This becomes a bit more challenging when the database model is using multiple tables. Using Espresso Logic , you can create a multiple table insert using a single POST. The back-end server can be run in the cloud or on-premise to connect to your database and quickly expose RESTful endpoints for each of your SQL tables, views, and stored procedures. Creating a compound nested document (called a Resource) will allow us to design an API that will join related tables into a single REST endpoint. Developers can design a Mobile front-end using a drag-and-drop tools that bind each of the text fields with the JSON nested document fields. Using a single POST, this data is sent to the Espresso Logic REST server to handle the details. Self Registration Account Setup Data Model The model below shows tables that hold each of the relevant parts of the self-registration entry. Each of these tables has an auto increment primary key and a foreign key relationship to the Person table (e.g. Password, PersonPhone, Address, EmailAddress) to support the one-to-many relationship cardinality. So how can we create a single POST to insert into multiple tables and propagate the primary key? Connecting to any SQL database, Espresso Logic will instantly create REST API definitions for every table, view, and stored procedure. Multiple table data model Create new Resource Next, we us the Espresso Logic Design Studio to select one or more tables from a point-and-click interface to create this new compound document endpoint (see Resource documentation), which is a combination of each of the child tables. The ‘Join’ for each child is be automatically completed using the existing relationships defined in the schema. The unused Attributes for the POST can be excluded including the primary key (auto increment) and the foreign parent key PersonID (these are handled by the server) in each child table. Multiple Table Resource Setup POST JSON Sample Data The REST Lab in Espresso Logic Design Studio can be used to test this new resource. We enter all the values of the JSON self registration (shown below) and use the POST command. { "Title": "Mr", "FirstName": "Test", "MiddleName": "M", "LastName": "Record5", "Phone": [ { "PhoneNumber": "(407) 555-1216", "PhoneNumberTypeID": 1 } ], "Address": [ { "AddressTypeID": 2, "AddressLine1": "555 Main St", "AddressLine2": "Apt 6", "City": "Maitland", "StateProvinceID": 15, "PostalCode": "32751" }, { "AddressTypeID": 1, "AddressLine1": "555 Main St", "AddressLine2": "6", "City": "Maitland", "StateProvinceID": 15, "PostalCode": "32751" } ], "Email": [ { "EmailAddress": "[email protected]" } ], "Password": [ { "PasswordHash": " password6", "Username": "user" } ] } How it works The Espresso Logic REST Server will take this JSON and perform the necessary validations, derivations, and event processing (using Reactive Logic Programming) and then insert the Person first (returning the primary key) and then propagate this down to each of the related children in a single transaction. All the logic, validations, and events must succeed or the entire transaction is rolled back. This is not an easy trick to perform for any REST API – so do not try this at home without a deep understanding of SQL, transactions, the data model and relationships between parent and child. Espresso Logic does this out-of-the-box with no code required. Live Browser Espresso Logic offers Live Browser which is an Instant HTML5/Angular view of your data using the active schema. This will lets us view the one-to-many relationships of the self registration process. Try it yourself and see the power of Espresso Logic. View the samples on GitHub. Espresso Logic ‘Mutliple Table insert using a Single POST’ examples are part of the extended demo. Summary The mobile and web developer want a simple and single REST endpoint to populate and update user information. The server should be able to handle the complexity and hide the underlying data model. Using the Espresso Logic back-end server eliminates all the code required to connect to the database, create a nested document endpoint, and propagate the primary key to all the related child tables. Live Browser gives you an instant view to see and test your results – again, no code. This is the fastest way to build and deliver mobile solutions on the market today.

December 30, 2014

by Val Huber

CORE

· 19,058 Views · 1 Like

Learn R: How to Extract Rows and Columns From Data Frame

This article represents command set in R programming language, which could be used to extract rows and columns from a given data frame.

December 8, 2014

by Ajitesh Kumar

· 1,105,212 Views · 5 Likes

How Does Elasticsearch Real-time Search?

Compared to other features, real-time search capability is undoubtedly one of the most important features in Elasticsearch. Today we’ll look closely how is provided real-time search by Elasticsearch. Real time First of all, if we need to explain the concept of real-time, in general, we can say that the delay between input and out time in the information is small at real-time systems. This means, data is taken without data accumulation, processed in real time. Today, the best solution Elasticsearch known for real-time search, when a record is added to it for storage makes it searchable in 1 second. How? As is known, the disks are able to create a risk of bottleneck for I/O operations at the data persistence step. Also some mechanisms used for prevent any loss of data increases cost of time. At this point Elasticsearch uses the file-system cache that sitting between itself and the disk for overcome the risk of bottleneck and ensure the a new document can be searched in real time. A new segment is written to the file-system cache first and only later it flushed to disk by Elasticsearch. This lightweight process of writing and opening a new segment is called a refresh in Elasticsearch. By default, all shards is refreshed automatically once every second. In this way, Elasticsearch support real-time search. Test time Above digression about the time of refresh of the shards you can bring to mind the following questions: What happens, when a new document is requested in less than 1 second time? Can be documents requested, without having to depend of the refresh period shards of managed by Elasticsearch? Short answers. Elasticsearch does not return the document. Yes. Now let’s get clarity on this issue is a simple example. hakdogan$ curl -XPUT localhost:9200/kodcucom/document/1 -d'{ > "title": "Document A" > }' We sent a document to Elasticsearch. The index name is kodcucom, type document, id value 1. The title field is only field in the document and the value of "Document A". Let’s take this document from Elasticsearch. hakdogan$ curl -XGET localhost:9200/kodcucom/document/1?pretty { "_index" : "kodcucom", "_type" : "document", "_id" : "1", "_version" : 1, "found" : true, "_source":{ "title": "Document A" } } As expected, the document was returned to us. Well, if we keep short the time between document recording and get request than default shard refresh time what will happen? Let’s see. hakdogan$ curl -XPUT localhost:9200/kodcucom/document/2 -d'{"title": "Document B"}'; curl -XGET localhost:9200/kodcucom/_search?pretty {"_index":"kodcucom","_type":"document","_id":"2","_version":1,"created":true}{ "took" : 38, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 1.0, "hits" : [ { "_index" : "kodcucom", "_type" : "document", "_id" : "1", "_score" : 1.0, "_source":{ "title": "Document A" } } ] } } As can be seen, only the previous document was returned to us by Elasticsearch when we do concurrently create and get request. Well, how can I get the document concurrently? Let’s see. hakdogan$ curl -XPUT localhost:9200/kodcucom/document/3 -d'{"title": "Document C"}'; curl -XGET localhost:9200/kodcucom/_refresh; curl -XGET localhost:9200/kodcucom/_search?pretty {"_index":"kodcucom","_type":"document","_id":"3","_version":1,"created":true}{"_shards":{"total":10,"successful":5,"failed":0}{ "took" : 3, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 3, "max_score" : 1.0, "hits" : [ { "_index" : "kodcucom", "_type" : "document", "_id" : "1", "_score" : 1.0, "_source":{ "title": "Document A" } }, { "_index" : "kodcucom", "_type" : "document", "_id" : "2", "_score" : 1.0, "_source":{"title": "Document B"} }, { "_index" : "kodcucom", "_type" : "document", "_id" : "3", "_score" : 1.0, "_source":{"title": "Document C"} } ] } } In this command, we perform to refresh operation on kodcucom index before the search request. In this way, the document was returned to us. Auto refresh time can be changed. By setting the index.refresh_interval parameter in the configuration file. Applies to all indices in the cluster. A per-index basis by updated index setting. In addition to these, you can turn off automatic refresh. An important point to keep in mind about the refresh time of the shards, the refresh operation is costly in terms of system resources. If you wished to make changes to the auto-refresh time, this situation should be taken into account. Extension of the automatic refresh time, enables faster indexing but new documents and changes made to the existing documents will not appear in searches during specified period of time.

November 25, 2014

by Hüseyin Akdoğan

CORE

· 17,489 Views

NewTypes Aren't As Cool As You Think

My last post talked about what’s wrong with type classes (in general, but also specifically in Haskell). This post generated some great feedback on Reddit, including some valid criticism that I didn’t explain why I hated on newtypes so much. I took some of that feedback and incorporated it into a revised version of the post, but I have even more to say about “newtypes," so I decided to write another blog post. What’s in a Newtype Newtypes are a feature of Haskell that let you define a new type in terms of an existing type. In the following example, I create a newtype for Email, which “holds” a String. newtype Email = Email String They are similar to type synonyms (type Email = String), except that type synonyms don’t create new types, they just allow you to refer to existing types by other names. Every newtype can be easily translated into a data declaration. In fact, only the keyword changes: data Email = Email String There’s a slight semantic difference between the two, but for purposes of this blog post, any criticism I have against newtypes apply equally to similar constructs modeled with type or data. The Promise of Newtypes Newtypes are used to provide and select between alternate implementations of type classes for some base types. I think that’s a hack (albeit a necessary one), but I’ve already talked about this so I won’t belabor it here. The other promise of newtypes is that we can use them to make our code more type safe. Instead of passing around String as an email, for example, we can create a super lightweight “wrapper” around String called Email, and make it an error to use a String wherever an Email is expected. This practice isn’t restricted to Haskell. Even in Java, it’s considered good coding practice to wrap primitives with classes whose names denote the meaning of the wrapper (Email, SSN, Address, etc.). There’s a part of this promise that’s certainly true. If I have to define a function accepting four parameters, and three of them are strings, but one of those strings denotes an email, then I have two choices: Model the email parameter with a String. In this case, I may accidentally use the email where I intended to use the other two string parameters, or I may use one of the other two string parameters where I intended to use the email. Considering just these choices, there are five ways my program may go wrong if I use the wrong name in the wrong position. Model the email parameter with a newtype. In this case, I cannot use the email where I intended to use the other two string parameters, because the compiler may stop me. Similarly, I cannot use the other two string parameters where I intended to use the email, for the same reason. Looking at just these choices, there are 0 ways my program may go wrong. Thus, newtypes, like all good FP practices, reduce the number of ways my program can go wrong. Unfortunately, in my opinion, they don’t go nearly far enough. False Security For most intent and purposes, newtypes are isomorphic to the single value they hold. In my preceding example, given a String, I can get an email (Email "foo"). Given an Email, I can also get a String, e.g. by pattern matching on the Email constructor. Stated differently, and also approximately because I’m ignoring bottom: the String and Email types are isomorphic; they contain the same inhabitants, for any useful definition of “same”. The only substantive difference between the preceding String and Email is the name of the data constructor (call Email an AbergrackleFoozyWatzit, and what has changed?). Hence, my previous criticism of newtypes as “programming by name”. By themselves, newtypes don’t really reduce the number of ways my program can go wrong. They just make it a bit harder to go wrong. But any newtype is isomorphic to the value it holds, and it’s trivial to convert between the two. In fact, if my code doesn’t need to convert between the two (either directly or indirectly), then it’s better off generic. That is, if I never need to convert an Email to a String, or a String to an Email, then I should really write the code generically to work with any value (even if that means making data structures or functions more polymorphic). Parametricity provides a massive reduction in the number of ways my program can go wrong. Newtypes, on the other hand, just make it a bit harder to go wrong, by adding one layer of indirection. In this example, as with many newtypes, I’ve created a bad isomorphism. The domain model of an email is not isomorphic to the data model of a string. But by using a newtype, I have implicitly declared that they are isomorphic. Calling a string an email may make me feel better, because of the different name, but fundamentally, with a newtype, it’s still a string, and I’m only ever one more step away from going wrong. In my experience, too many newtypes create an isomorphism between things that, properly modeled, are not isomorphic. Fortunately, there’s a well-worn workaround that lets us get more mileage out of newtypes. Smart Constructors If I define Email in a module, I can make its data constructor private, and export a helper function to construct an Email. Such helper functions are called smart constructors. They can be used to break the natural isomorphisms created by newtyping. An example is shown below: newtype Email = MkEmail String mkEmail :: String -> Maybe Email mkEmail s = ... In this example, I create a smart constructor which does not promise that it can turn every string into an email. It promises only that it might be able to turn a string into an email, by returning a Maybe Email. With the smart constructor approach, I’ve modeled the fact that while every email has a string representation, not every string has an email representation. Going back to my earlier example of passing same-typed parameters to a function, if I use a smart constructor, then while I can still use an email anywhere a string is expected (by converting), I can’t use a string anywhere an email is expected. (Well, ignoring the fromJust abomination!) Smart Constructors, Dumb Data Smart constructors take us one step closer toward modeling data in a type safe fashion. Unfortunately, I still don’t think it’s far enough. With smart constructors, our data model is fundamentally underconstrained, so we patch that up by restricting who can create the data. That’s putting a band-aid on the real problem. Why not just solve the root issue — viz., that our data model is underconstrained? Dumb Constructors, Smart Data The best solution to a great many newtype problems, I believe, is creating a data model where there is a true isomorphism between the entity modeled by our data and the values passed to the data constructor. That is, creating a data model such that there exists no regions in our data’s state space which correspond to invalid states. Email is a simple example, because there are well-defined models for what constitutes a data model, which can be translated into data declarations in straightforward, if tedious fashion. (To some extent, it’s a failure of most languages I know that such specifications cannot be easily translated into code without tedious boilerplate!) When our data declaration precisely fits our data model specification, there’s no need for smart constructors, and no need for newtypes. There’s far fewer ways that code can go wrong, and because our domain model is captured precisely by our data model, we can transform that data model in ways that make semantic sense (e.g. transforming just the name part of an email, since we’re now in the realm of structured data). Summary As I’ve explained in this blog post, I don’t really hate newtypes. I think they’re very useful, and I do use them, because they make it more difficult for my programs to go wrong. Ultimately, however, I think a lot of problems solved with newtypes (modeling coordinates, positions, emails, etc.) are better solved by more precise data modeling. That is, by making our programs stop lying about isomorphisms. Precision may be tedious due to limitations of the languages we work in, but honestly, what’s more tedious than debugging broken code?

November 20, 2014

by John De Goes

· 10,787 Views

The No Fluff Introduction to Big Data

big data traditionally has referred to a collection of data too massive to be handled efficiently by traditional database tools and methods. this original definition has expanded over the years to identify tools (big data tools) that tackle extremely large datasets (nosql databases, mapreduce, hadoop, newsql, etc.), and to describe the industry challenge posed by having data harvesting abilities that far outstrip the ability to process, interpret, and act on that data. technologists knew that those huge batches of user data and other data types were full of insights that could be extracted by analyzing the data in large aggregates. they just didn’t have any cheap, simple technology for organizing and querying these large batches of raw, unstructured data. the term quickly became a buzzword for every sort of data processing product’s marketing team. big data became a catchall term for anything that handled non-trivial sizes of data. sean owen, a data scientist at cloudera, has suggested that big data is a stage where individual data points are irrelevant and only aggregate analysis matters [1]. but this is true for a 400 person survey as well, and most people wouldn’t consider that very big. the key part missing from that definition is the transformation of unstructured data batches into structured datasets. it doesn’t matter if the database is relational or non-relational. big data is not defined by a number of terabytes, it’s rooted in the push to discoverhidden insights in data that companies used to disregard or throw away. due to the obstacles presented by large scale data management, the goal for developers and data scientists is two-fold: first, systems must be created to handle large scale data, and two, business intelligence and insights should be acquired from analysis of the data. acquiring the tools and methods to meet these goals is a major focus in the data science industry, but it’s a landscape where needs and goals are still shifting. what are the characteristics of big data? tech companies are constantly amassing data from a variety of digital sources that is almost without end—everything from email addresses to digital images, mp3s, social media communication, server traffic logs, purchase history, and demographics. and it’s not just the data itself, but data about the data (metadata). it is a barrage of information on every level. what is it that makes this mountain of data big data? one of the most helpful models for understanding the nature of big data is “the three vs:” volume, velocity, and variety. data volume volumeis the sheer size of the data being collected. there was a point in not-so-distant history where managing gigabytes of data was considered a serious task—now we have web giants like google and facebook handling petabytes of information about users’ digital activities. the size of the data is often seen as the first challenge of characterizing big data storage, but even beyond that is the capability of programs to provide architecture that can not only store but query these massive datasets. one of the most popular models for big data architecture comes from google’s mapreduce concept, which was the basis for apache hadoop, a popular data management solution. data velocity velocityis a problem that flows naturally from the volume characteristics of big data. data velocity is the speed at which data is flowing into a business’ infrastructure and the ability of software solutions to receive and ingest that data quickly. certain types of high-velocity data, such as streaming data, needs to be moved into storage and processed on the fly. this is often referred to as complex event processing (cep). the ability to intercept and analyze data that has a lifespan of milliseconds is a widely sought after. this kind of quick-fire data processing has long been the cornerstone of digital financial transactions, but it is also being used to track live consumer behavior or to bring instant updates to social media feeds. data variety variety refers to the source and type of data that is being collected. this data could be anything from raw image data to sensor readings, audio recordings, social media communication, and metadata. the challenge of data variety is being able to take raw, unstructured data and organize it so that an application can use it. this kind of structure can be achieved through architectural models that traditionally favor relational databases—but there is often a need to tidy up this data before it will even be useful to store in a raw form. sometimes a better option is to use a schema-less, non-relational database. how do you manage big data? the three vs is a great model for getting an initial understanding of what makes big data a challenge for businesses. however, big data is not just about the data itself, but the way that it is handled. a popular way of thinking about these challenges is to look at how a business stores, processes, and accesses their data. · store: can you store the vast amounts of data being collected? · process: can you organize, clean, and analyze the data collected? · access: can you search and query this data in an organized manner? the store, process, and access model is useful for two reasons: it reminds businesses that big data is largely about managing data, and it demonstrates the problem of scale within big data management. “big” is relative. the data batches that challenge some companies could be moved through a single google datacenter in under a minute. the only question a company needs to ask itself is how it will store and access increasingly massive amounts of data for its particular use case. there are several high level approaches that companies have turned to in the last few years. the traditional approach the traditional method for handling most data is to use relational databases. data warehouses are then used to integrate and analyze data from many sources. these databases are structured according to the concept of “early structure binding”—essentially, the database has predetermined “questions” that can be asked based on a schema. relational databases are highly functional, and the goal with this type of data processing is for the database to be fully transactional. although relational databases are the most common persistence type by a large margin (see key findings pg. 4-5), a growing number of use cases are not well-suited for relational schema. relational architectures tend to have difficulty when dealing with the velocity and variety of big data, since their structure is very rigid. when you perform functions such as join on many large data sets, the volume can be a problem as well. instead, businesses are looking to non-relational databases, or a mixture of both types, to meet data demand. the newer approach - mapreduce, hadoop, and nosql databases in the early 2000s, web giant google released two helpful web technologies: google file system (gfs) and mapreduce. both were new and unique approaches to the growing problem of big data, but mapreduce was chief among them, especially when it comes to its role as a major influencer of later solution models. mapreduce is a programming paradigm that allows for low cost data analysis and clustered scale-out processing. mapreduce became the primary architectural influence for the next big thing in big data: the creation of the big data management infrastructure known as hadoop. hadoop’s open source ecosystem and ease of use for handling large-scale data processing operations have secured a large part of the big data marketplace. besides hadoop, there was a host of non-relational (nosql) databases that emerged around 2009 to meet a different set of demands for processing big data. whereas hadoop is used for its massive scalability and parallel processing, nosql databases are especially useful for handling data stored within large multi-structured datasets. this kind of discrete data handling is not traditionally seen as a strong point of relational databases, but it’s also not the same kind of data operations that hadoop is running. the solution for many businesses ends up being a combination of these approaches to data management. finding hidden data insights once you get beyond storage and management, you still have the enormous task of creating actionable business intelligence (bi) from the datasets you’ve collected. this problem of processing and analyzing data is maybe one of the trickiest in the data management lifecycle. the best options for data analytics will favor an approach that is predictive and adaptable to changing data streams. the thing is, there’s so many types of analytic models and different ways of providing infrastructure for this process. your analytics solution should scale, but to what degree? scalability can be an enormous pain in your analytical neck, due to the problem of decreasing performance returns when scaling out an algorithm. ultimately, analytics tools rely on a great deal of reasoning and analysis to extract data patterns and data insights, but this capacity means nothing for a business if they can’t then create actionable intelligence. part of this problem is that many businesses have the infrastructure to accommodate big data, but they aren’t asking questions about what problems they’re going to solve with the data. implementing a big data-ready infrastructure before knowing what questions you want to ask is like putting the cart before the horse. but even if we do know the questions we want to ask, data analysis can always reveal many correlations with no clear causes. as organizations get better at processing and analyzing big data, the next major hurdle will be pinpointing the causes behind the trends by asking the right questions and embracing the complexity of our answers. [1] http://www.quora.com/what-is-big-data 2014 guide to big data this guide explores the meaning of big data, how businesses use it, and uncovers new tools and techniques for the future of big data. this guide includes: detailed profiles on 43 big data vendor solutions in-depth articles written by industry experts results from our survey of 850 it professionals "finding the database for your use case" download now

September 25, 2014

by Benjamin Ball

· 10,686 Views · 1 Like

The Near Future of IoT

[This article was written by Sean Lorenz.] Pundits within the technology sphere have been calling 2014 the year of the Internet of Things (IoT). The market revenue potentials are forecasted into the trillions and it’s a Fortune 500 land grab with major companies moving quickly to stake their claims [1]. If this sounds a bit like pages from an American Wild West history book, a frontier analogy isn’t too far off. This is an exciting turning point in technology that—thanks to advances in plummeting sensor costs, wireless communication, and chip size reduction—will soon make today’s futuristic IoT concepts seem humorous in retrospect. While it’s difficult to see where the market is going, given the exponential rate of change in IoT technology, I have noticed several key trends emerging. As a fellow IoT prospector on the frontier, this is my account of the most evident trends as well as some educated predictions for the future. Trends 1. Business Value Over Technology Focus Like any promising new technology still in its infancy stage, the true innovation stems from tech-savvy researchers and tinkerers that build fascinating devices that sometimes have no consumer base–I’m looking at you, robotics market. We have all heard about the smart toothbrushes and smart egg trays coming to market and thought: “Interesting! I wouldn’t buy one, but… sure!” Perhaps the biggest trend is a shift from thinking, “let’s build it because we can” to “what business problem are we solving here?” IoT developers are getting wise to this mentality and building user-focused MVPs (Minimum Viable Products) that will begin hitting the market in late 2014 and early 2015. 2. Keeping It Real At my company, Xively, we often get asked what are the real use cases for the IoT. Many times our customers walk in the door with a vague idea of how connecting their product or service to the Internet would be potentially interesting, but need a little help with seeing how an IoT-enabled product can transform their business—internally and externally. The reason for this is that most of the exciting, transformative elements happen under the hood. Right now, the true “wow” moments in the industry are far from sexy: energy savings in enterprise complexes, CRM & ERP integration, service and support, supply chain efficiencies, product part failure and alert, and so on… you get the idea. Smart homes that respond to our every whim are really great ideas, but these products aren’t integral to our lives yet. Large manufacturing companies and enterprises are using the Internet of Things to manage internal operations and efficiency while also engaging their customers more fully with new IoT data sources aggregated in existing services like Salesforce1 or SAP. 3. Publish-Subscribe The IoT protocol wars are heating up, but allegiances aside, publish-subscribe messaging is what the bulk of implemented models use for connecting devices to the cloud. Pub-sub protocols such as MQTT, CoAP, and AMQP are attractive for connected product development thanks to their ease of scalability and many-to-one/one-to-many possibilities. Given the massive variance of the IoT market, there is bound to be more than one protocol that wins in the end; yet before we get to that point, there are plenty of bugs and vulnerabilities to patch across all of the thriving, open IoT protocols out there. 4. Security Panic! Hacked refrigerators, big box stores, and security cameras… oh my! There has been no shortage of concern for privacy, security, and compliance in the Internet of Things space. Like any news story, some of this attention is warranted and some overblown. Just like your pre-IoT old-fashioned Internet, creating specific application keys and advanced permissioning systems for hardware connecting to the cloud is essential. The amount of nodes at the edge connecting to services across the Internet will be far larger than anything we see now, but IoT platforms are already addressing these complex device lifecycle management issues that are crucial for protecting personal and enterprise information in a connected world. Near-Term Predictions Now lets hop in the DeLorean and look into the future. Rather than focus on five, ten, or twenty years into the future, let’s focus only on the next few years. Why? As I mentioned in the beginning, the IoT landscape changes on a day-to-day basis, so even a prediction looking forward six-months from now can be unreliable. This list contains no self-driving cars or sentient AIs. Instead, it makes some pretty sure bets for what to expect over the horizon. 1. A Household Name Usually the second question after “what’s your name?” at a dinner party is the inevitable “so what do you do?” Mentioning the Internet of Things to non-techies still draws blank stares and looks of confusion. Those looks are justified given the not-so-great marketing name of IoT and the myriad definitions trying to explain what it actually is. Whether it’s called the Internet of Things, Internet of Everything, or just the good ol’ Internet, the concept of connecting any and everything to the Internet will begin to make sense for everyday consumers. 2. Consumers Slow to Adopt Many IoT products are still just toys in many people’s minds. Startups are building products that address problems which most consumers don’t see as a problem yet. This isn’t to say the consumer IoT market will evaporate. It just means we need to get smarter about what customers actually want from smart devices. Today’s wearable products remind me of the Newton—Apple’s infamous PDA. The problem wasn’t the idea, but rather the timing. The Apple Newton seemed clunky, not very powerful, and low on the usability scale. Years later, the iPhone and iPad came along with a set of features and a form factor that customers were looking for. The same feels true of wearables right now—they may need a few more years to incubate before the general public gives two thumbs up. Other consumer IoT markets such as the smart home or driverless cars seem to be in the same situation as the wearables market, but this is changing quickly with major players like Apple and Google moving into these arenas. For example, in the home automation space, frameworks like Apple HomeKit will be essential for unifying disparate protocols and clouds into one application that can handle various products’ data, automating much of the technology and pushing it into the background. I am sure there is a brilliant developer learning Swift and building the first killer smart home app as we speak. 3. Analytics and Automation This prediction probably comes as no surprise, but it is worth stating. Most companies willing to foray into the IoT unknown are, for now, happy with connecting their devices to an external application or cloud service. Having a place to send the data is usually the first step in constructing an IoT system. So what do you do with all this data once you have it? Reporting tools for IoT are just starting to become available, but this is just the tip of the iceberg. The real magic lies in the ability to use exploratory and predictive algorithms to make actionable intelligence a reality. These insights are beneficial to both businesses understanding their customers and to the customers themselves. One could imagine closing the feedback loop between sensor, cloud, and actuator by adding some beautiful supervised machine learning code into the cloud platform at some point in the chain. There are currently a handful of analytics startups focusing on IoT specifically, but this market is about to explode from both platform and application perspectives. 4. IoT Startups Galore For any developers out there interested in the IoT with a real customer pain that needs solving, now is the time to get coding and building that pitch deck. With hardware back en vogue, venture capital funding of IoT-centric companies ison the rise [2]. Having been to a number of IoT events, the amount of enthusiasm by VC and angel investors is palpable. There’s a definite need for developers with great, connected product and service ideas; so, if you haven’t already, I strongly suggest putting on your favorite prospecting gear and exploring the untamed wild west of the Internet of Things. [1] https://internetofeverything.cisco.com/sites/default/files/docs/en/ioe_public_sector_vas_white%20paper_121913final.pdf [2] http://www.cbinsights.com/blog/internet-of-things-investing-snapshot 2014 Guide to Internet of Things The 2014 Guide to Internet of Things covers 39 different IoT SDKs, developer programs, and hardware options, plus: Key findings from our survey of over 2,000 developers "How to IoT Your Life: The Complete Shopping List" "The Scale of IoT" Infographic Glossary of common IoT terms Four in-depth articles from industry experts DOWNLOAD NOW

August 28, 2014

by Benjamin Ball

· 10,546 Views

How to Setup Realtime Analytics over Logs with ELK Stack

Once we know something, we find it hard to imagine what it was like not to know it. - Chip & Dan Heath, Authors of Made to Stick, Switch Update: I have recently published a book on ELK stack titled - Learning ELK Stack , more details can be found here. What is the ELK stack ? The ELK stack is ElasticSearch, Logstash and Kibana. These three provide a fully working real-time data analytics tool for getting wonderful information sitting on your data. ElasticSearch ElasticSearch,built on top of Apache Lucene, is a search engine with focus on real-time analysis of the data, and is based on the RESTful architecture. It provides standard full text search functionality and powerful search based on query. ElasticSearch is document-oriented/based and you can store everything you want as JSON. This makes it powerful, simple and flexible. Logstash Logstash is a tool for managing events and logs. You can use it to collect logs, parse them, and store them for later use.In ELK Stack logstash plays an important role in shipping the log and indexing them later which can be supplied to Elastic Search. Kibana Kibana is a user friendly way to view, search and visualize your log data, which will present the data stored from Logstash into ElasticSearch, in a very customizable interface with histogram and other panels which provides real-time analysis and search of data you have parsed into ElasticSearch. How Do I Get It ? http://www.elasticsearch.org/overview/elkdownloads/ How Do They Work Together ? Logstash is essentially a pipelining tool. In a basic, centralized installation a logstash agent, known as the shipper, will read input from one to many input sources and output that text wrapped in a JSON message to a broker. Typically Redis, the broker, caches the messages until another logstash agent, known as the collector, picks them up, and sends them to another output. In the common example this output is Elasticsearch, where the messages will be indexed and stored for searching. The Elasticsearch store is accessed via the Kibana web application which allows you to visualize and search through the logs. The entire system is scalable. Many different shippers may be running on many different hosts, watching log files and shipping the messages off to a cluster of brokers. Then many collectors can be reading those messages and writing them to an Elasticsearch cluster. (E)lasticSearch (L)ogstash (K)ibana (The ELK Stack) How Do I Fetch Useful Information Out of Logs? Fetching useful information from logs is one of the most important part of this stack and is being done in logstash using its grok filters and a set of input , filter and output plugins which helps to scale this functionality for taking various kinds of inputs ( file,tcp, udp, gemfire, stdin, unix, web sockets and even IRC and twitter and many more) , filter them using (groks,grep,date filters etc.) and finally write ouput to ElasticSearch,redis,email,HTTP,MongoDB,Gemfire , Jira , Google Cloud Storage etc. A Bit More About Log Stash Filters Transforming the logs as they go through the pipeline is possible as well using filters. Either on the shipper or collector, whichever suits your needs better. As an example, an Apache HTTP log entry can have each element (request, response code, response size, etc) parsed out into individual fields so they can be searched on more seamlessly. Information can be dropped if it isn’t important. Sensitive data can be masked. Messages can be tagged. The list goes on. e.g. input { file { path => ["var/log/apache.log"] type => "saurzcode_apache_logs" } } filter { grok { match => ["message","%{COMBINEDAPACHELOG}"] } } output{ stdout{} } Above example takes input from an apache log file applies a grok filter with %{COMBINEDAPACHELOG}, which will index apache logs information on fields and finally output to Standard Output Console. Writing Grok Filters Writing grok filters and fetching information is the only task that requires some serious efforts and if done properly will give you great insights in to your data like Number of Transations performed over time, Which type of products have most hits etc. Below links will help you a lot in writing grok filters and test them with ease - Grok Debugger http://grokdebug.herokuapp.com/ Grok Patterns Lookup https://github.com/elasticsearch/logstash/tree/v1.4.2/patterns References http://www.elasticsearch.org/overview/ http://logstash.net/ http://rashidkpc.github.io/Kibana/about.html

August 26, 2014

by Saurabh Chhajed

· 47,910 Views · 4 Likes

The Programming Challenges of IoT

Pragmatic developers can look at the Internet of Things in two ways: This is amazing. I can only begin to imagine how I can directly improve the world outside the set of networked computer boxes. This is terrifying. If something goes wrong, then it’s on me—and this time the system affected extends outside the set of networked computer boxes. IoT is amazing in the way it bridges physical and virtual environments, but even the phrase “Internet of Things” should give a developer pause. Computers are pretty smart. Things are stupid. IoT tries to put Things online and tries to make them into inter-networked computers. That’s pop-philosophy, but you want to develop in the real world. So what real-world challenges will you face when you shoot for the IoT moon? Two Types of Challenges It seems there are two types of programming challenges for the Internet of Things: Data and control (the comp-sci and networking stuff) Information and business logic (the info-sci and human-computer interaction stuff) For this article, we’re going to talk about the programming problems we can solve around IoT. We’ll start at the bottom (data and control) and work our way up to the big picture (information and business logic). Type 1: Data and Control Challenge 1.1: Power This one is pretty obvious. Many IoT devices are wireless, and no one has invented thumbnail fusion reactors yet. One solution is equally obvious: pick your algorithms carefully. If you can save cycles to perform a given task, then do it. Libraries for implementing power-optimized algorithms will presumably spring up in greater numbers, but even so, you may need to inject some heavy-duty comp-sci know-how into IoT app development. The second solution is more complex than the first. Higher-level developers will have to think more about Dynamic Power Management (DPM), which just means: shutting down devices when they don’t need to be on and starting them up when they do. Normally the operating system worries about this, but an IoT application that orchestrates wearables and phones, for example, will know things that each device’s OS won’t—and therefore will be able to switch things on and off more intelligently than each device’s individual OS. Another option is to write or customize an embedded OS. Challenge 1.2: Latency Latency on IoT sits in two places: at the source and in the pipes. The basic problem is a physical one. Thing-chips often have to be small, which means that the chip can only be as powerful as current transistor technology allows. Another problem is power. Many small devices transmit and receive data in discrete active/sleep cycles (think TDMA) in order to save bandwidth and power, but this increases latency inversely to power saved. Another tradeoff is that network topologies optimized for IoT can involve more hops over slower devices. Mesh networks, for example, are immune to the failure of a few nodes. Similarly, “fog” and “edge” computing paradigms relieve Internet infrastructure by doing as much as possible without hub-nodes. The downside is that each node (a) can’t do very much on its own and (b) can only talk to neighboring nodes. The problem in the pipes is a matter of network infrastructure. Simply: the more Things, the less available bandwidth. Infrastructure technology will get faster, but cell networks won’t catch up overnight. And Things, unlike fancier computers, are often supposed to transmit blindly—that is, without anyone necessarily asking them to. This means there’s a massive potential for wasted bandwidth. Challenge 1.3: Unreliability The third challenge flows from the first two. Devices are unreliable–“Things” even more so. The distributed and decentralized virtues of IoT bring their own reliability problems. Here are just a few: Ubiquitous devices are cheap, so they fail more often. Truly ad-hoc connectivity implies ephemeral SLA, so uptime and recovery time may be unclear. Loosely controlled devices may have better things to do than give you their data (or computing resources), so concurrency may grow very complex. Less-reliable hardware generates less-reliable information (‘does my outlying datapoint just signify device failure?’), so you may need to chew your data more thoroughly at the application level. In a sense, IoT decouples low-level (the sub-session layer) from high-level channel capacity, because the distribution of error-sources on IoT is more heavily weighted toward originating or remote nodes. This means more error-correcting for application developers. Type 2: Information and Business Logic Challenge 2.1: Vast & Thin Data Sensors on smartphones are already generating oceans of raw data. These sensors are pretty sophisticated. Every major mobile OS provides a unified, simple API to access clean sensor and geo data. But start grabbing this data and it’s not immediately clear what to do with it. Try to think of killer applications for barometric data—besides weather and elevation (with GPS)—off the top of your head. Raw sensor data is extremely thin. It doesn’t explain itself, and we haven’t yet produced a complete mapping from physical measurements to business logic—let alone software design. Even if you know what to do with sensor/geo data eventually, you may have to learn new algorithms and data structures to process immediately. Geo-graphs aren’t CS101 graph data structures (for one thing, edge length is a first-class citizen of geo-graphs). The size of data over IoT is itself a problem. Wireless sensors beget tons of data. All the problems (and opportunities) of Big Data cascade naturally from IoT. Massively distributed computing on IoT devices is an exciting thought, but the toolchain for splitting calculations over a thousand idle Fitbits just isn’t here yet. 2. Context-Sensitivity Consider the term “ubiquitous computing,” defined as: what happens when wirelessly connected sensors and actuators, placed more or less everywhere, allow software to interact with much larger swaths of the physical world than just hardware or bare metal. Put ubiquitous computing on the Internet, and IoT makes the software context much larger. This has implications at two basic levels. At a high computer-architectural level: IoT extends the concept of computing environment well outside the von Neumann machine and weakens the concept of peripheral I/O. In an IoT-world interface, sensors are input and actuators are output. As IoT devices process increasingly at the edge (within individual nodes), the devices that appear peripheral to other nodes are actually doing an awful lot of computation. At a high business-logic level: the more stuff outside the computer-box affects the program, the less predictable the program behavior becomes at runtime. The same bizarrely-birthed memory leak might slow down the UI in a smartphone context but contribute to a cascading electrical grid failure in an IoT context. This means that IoT demands more self-monitoring and self-repairing code. Two Types of Solutions Plenty of researchers are working on ambitious solutions to the programming challenges presented by IoT. Two of the more exciting examples include: Abstract Task Graph—a data-driven model that maps the network graph to an application graph [1] Computational REST—replaces content resources with computation resources [2] There are also a few more strategies you can use right now to solve some of the IoT programming challenges mentioned above. Reactive ProgrammingThis general purpose paradigm responds to all major application-level challenges and embraces opportunities presented by IoT. The four definitive attributes of a reactive application are: event-driven, scalable, resilient, and responsive [3]. These four are excellent guiding principles for IoT applications at a high, cross-stack level. Flow-based Programming and the Actor ModelBoth present strongly independent components where only messages can affect processes. Both are deeply amenable to concurrency (for example, shared state is discouraged), nondeterminism, and scaling without exponential complexity growth, because components are black boxes. FBP is a bit more pragmatic and restrictive while the actor model is less restrictive and a bit harder to implement. FBP has already been implemented in Javascript (NoFlo), and the actor model has been implemented in Java (Akka) [4][5][6]. What’s important to remember is that there are already tools and techniques that can help you build IoT applications. FBP, actors, and reactive programming all have key attributes for creating applications that leverage the strengths of IoT to overcome its challenges. [1] https://www.usenix.org/legacy/event/mobisys05/eesr05/tech/full_papers/bakshi/bakshi.pdf [2] http://isr.uci.edu/tech_reports/UCI-ISR-10-3.pdf [3] http://www.reactivemanifesto.org/ [4] http://jpaulmorrison.com/fbp/ [5] http://arxiv.org/ftp/arxiv/papers/1008/1008.1459.pdf [6] http://noflojs.org/ [7] http://akka.io/ 2014 Guide to Internet of Things The 2014 Guide to Internet of Things covers 39 different IoT SDKs, developer programs, and hardware options, plus: Key findings from our survey of over 2,000 developers "How to IoT Your Life: The Complete Shopping List" "The Scale of IoT" Infographic Glossary of common IoT terms Four in-depth articles from industry experts DOWNLOAD NOW

August 14, 2014

by John Esposito

· 16,390 Views

An Early Mover's Guide to the Internet of Things

[This article was written by Andreea Borca, developer of patient-empowering solutions for the healthcare industry, co-host of Farstuff: The IoT Podcast, and featured author in DZone's 2014 Guide to Internet of Things]. The creation of the Internet was a significant shift in the way people acquire information, interact with each other, and make decisions. Now, the Internet is expanding its reach to a range of devices that can gather and analyze physical data and react to that data in a variety of applications that we’ve never seen before. This “Internet of Things” marks another dynamic shift in the history of technology. This new stage in the Internet’s evolution is changing it from a tool that we actively need to engage with—deliberately using a browser to access it—to one that passively endows the world around us with a “mind” of its own. We are developing a world where things interact intelligently and cooperate to achieve goals without explicit guidance from human operators. Defining the Internet of Things First, we need to define the Internet of Things (also called “The Internet of Everything” by Cisco). A system falls under the Internet of Things definition if it meets the following criteria, known as the “3 Cs”: 1. It must Connect – to the physical world around itself collecting information, to other things in order to interact with them effectively, to the internet or a network, etc. 2. It must Compute – by processing the inputs it receives in some way and making them meaningful to other systems. 3. It must Communicate – with the network, with other things, and with the user if necessary (more often than not, as you’ll see, communicating to the user may be an unnecessary burden). Challenges for the Internet of Things Efficiency Devices within the Internet of Things (IoT) only need to do the bare minimum necessary to effectively work within the existing ecosystem. Many of the newest products rely heavily on the power of your smartphone to connect to the Internet and orchestrate devices, but there is also extensive pressure to reduce the size, energy consumption, and cost of the processing entities within IoT devices. In order to reduce power consumption and manage node outages, there is a concept of daisy-chaining across a network of devices into a more powerful central hub. This is known as mesh networking, and it’s becoming quite popular for IoT systems. Security, Privacy & the need to Share A core requirement of a well-functioning IoT device is to collect, transfer, and store data from a wide variety of sources. As more sensors arrive in cities and healthcare institutions, that increasingly connected information will unavoidably lead to more concern about security and privacy. The debate is still raging over balancing the clear benefits of new discoveries from processing Big Data with the strong personal fear of losing privacy. With IoT now in the picture, there is concern about devices that continuously and passively collect information on users. One recent clash over always-on sensors came with the release of Microsoft’s Xbox One Kinect console, which has a camera that is constantly pointed at your living room. Although the camera itself is not always on, the backlash over that possibility was fierce [1]. Finding this balance will quickly become a requirement for continued progress. Furthermore, the very nature of IoT and the connectivity network necessary for its success does make it particularly vulnerable in certain instances. Devices are especially vulnerable when connected over WiFi, because low tech sensor nodes with minimal computing power tend to be less secure, making them the ideal point of entry for infiltrators. Standards As with all new technologies, the battle over standards is always a struggle. Nest, the company that developed the most popular smart home thermostat, and its new owner, Google, are now making significant strides trying to establish the Nest platform as the foundation for all consumer-based IoT devices and their software counterparts [2]. Cisco, Qualcomm, IBM, Microsoft, and most other major players have a similar strategy for creating standard models for approaching the Internet of Things. The pressure to standardize is especially clear when new devices are appearing weekly. ZigBee already has extensive reach as an established standard for many household IoT devices. However, as a preferred codebase has yet to emerge as the standard of choice, it is recommended to connect with major standardization organizations like the IEEE, IETF, and the ZigBee Alliance [3]. Currently, the most common sensor networks use protocols such as Bluetooth Low Energy (BLE), RFID tags, ZigBee, and Wi-Fi. There are also iBeacons, which allow devices like smartphones to better identify their location and potential needs with NFC-powered micro-location and GPS technology. Opportunities for the Internet of Things There are numerous prospects to consider when looking to develop IoT products. Given the multi-trillion dollar projections for the future IoT economy, we should take a look at these emerging markets for IoT tech [4]. Consumers The consumer IoT space has bred a small but growing segment of followers that have invested early into “smart” tech. At this year’s CES, we saw everything from the Babolat Tennis Racket that becomes your personal tennis coach to the Kolibree Toothbrush that monitors your gum health while you brush. The fastest growing consumer IoT segment seems to be in smart home technology, with products such as self-managing refrigerators and resident-sensing door locks. Commercial Retailers have already proven adept at collecting a consumer’s shopping history. With the functionality of NFC-powered beacons, these retailers are eager to personalize your shopping experience in a whole new way. Essentially, each physical shopping trip can now be as littered with targeted ads as any typical online search, much like a scene from the 2002 sci-fi film Minority Report. Walk into a store and instantly the advertising screens on the wall change to address your particular demographic, income level, and shopping preferences. If you’ve connected your Google calendar to certain applications, these screens would show outfits targeting your next big event. Signs on clothing racks sense you coming near and change prices, fully leveraging a custom pricing model that would have economists drooling. And as you try on outfits, the smart mirror in the dressing room recommends accessories or comments on alternatives that might be a better fit for your body type. After all of these IoT events have helped you with your purchase, there’s no need to checkout. You’ve registered with the store and there’s a beacon at the exit that registers what you picked up and charges your card automatically as you leave. Healthcare With the recent U.S. mandate that all health records must be digital, there has been an explosion in the marketplace of new, patient-centered, smart health devices. The excitement of a healthcare revolution among top innovative companies, incubators, and startups predicts that this trend is not likely to taper off anytime soon. The key areas of focus so far have been: monitoring technologies like wearables (especially passive monitoring), function improving technologies, education, and notification technologies. Wearables are generally the first consumer touch point in the IoT health sphere. With the popularity of Fitbit pedometers and Withings scales, the market is starting to experiment with internal monitoring and potentially replacing some organs completely in the near future. A study at Boston University has had incredibly positive results creating an artificial pancreas for Type 1 diabetics by inserting an insulin and glucagon pump that responds when an attached glucometer goes below a certain level, just like an actual pancreas. Proteus, a promising startup out of San Francisco, has created an all-natural microchip in a pill that the patient swallows in order to monitor whether they are remembering to take their medication. The pill sends data to an armband that the user is wearing, which then can send notifications to family members regarding the patient’s status. The most impressive feature is the fact that these chips are powered by the energy in the patient’s digestive system. Cities, Infrastructure, and Industry The long-term vision of the future includes technology such as self-driving cars and city lights that alert police when there’s been an accident. In this stage of development, the majority of value is coming from technologies that monitor and collect data in urban settings. From an evolutionary perspective, the IoT city as a whole is still in what many would consider a learning phase. The main objective is to collect as much data as possible, make it available via open APIs, and encourage motivated data analysts to find opportunities for improvement in utility usage, environmental impacts, and service management for larger populations. This is one area where being an industrial country like the U.S. may actually impede the ability to progress as quickly as our less established counterparts. Third world countries that haven’t yet built a solid infrastructure allow for the creativity and flexibility to implement sophisticated solutions unfettered by generations of previous development. Silicon Valley powerhouses like Facebook and Google are actively engaged in projects to create a free global Wi-Fi network, and key locations in Africa have allowed them to experiment with these projects. Being an Early Mover In the very near future, as more and more things connect to the internet, internet connectivity from IoT devices will dwarf the amount of traditional web browsing. The core standards and assumptions that will drive this next revolution in computing technology are still being established and, as a result, building anything that can add value to this exploding industry (software, hardware, devices, sensors, beacons etc.) is a remarkable opportunity for the right developer. Right now is the time to start contributing to the development of these technologies if you want to be an early mover in IoT. 2014 Guide to Internet of Things The 2014 Guide to Internet of Things covers 39 different IoT SDKs, developer programs, and hardware options, plus: Key findings from our survey of over 2,000 developers "How to IoT Your Life: The Complete Shopping List" "The Scale of IoT" Infographic Glossary of common IoT terms Four in-depth articles from industry experts DOWNLOAD NOW

August 12, 2014

by Benjamin Ball

· 15,121 Views

JBoss Data Grid: Installation and Development

In this blog, we will discuss one particular data grid platform from Redhat namely JBoss Data Grid (JDG). We will firstly cover how to access and install this data grid platform and then we will demonstrate how to develop and deploy a simple remote client/server data grid application which utilises the HotRod protocol. We will be using the latest release JDG 6.2 from Redhat in this article. Installation Overview To start using JDG, firstly log on to the redhat site https://access.redhat.com/home and download the software from the Downloads section of the site. We wish to download JDG 6.2 server by clicking on the appropriate links in the Downloads section. For future reference, it is also useful to download the quickstart and maven repository zip files. To install JDG, we simply unzip the JDG server package into an appropriate directory in your environment. JDG Overview In this section, we will provide a brief overview of the contents of the JDG installation package and the most notable configuration options available to users. Out of the box, users are provided with two runtime options either to run JDG in standalone or clustered mode. We can start JDG in either mode by invoking the stanadalone or clustered start up scripts in the / bin directory. To configure the JDG in either mode we need to configure the files standalone.xml and clustered.xml. In our case we will creating a distributed cache which will run on 3 node JDG cluster so we will be utilizing the clustered startup script. In order to set up and add new cache instances to JDG, we modify the infinispan subsystems in the appropriate xml configuration file above. We should also note the principal difference between the standalone and clustered configuration file is that in the clustered configuration file there is a JGroups subsystem configured element which allows for communication and messaging between configured cache instances running in a JDG cluster. Development Environment Setup and Configuration In this section, we will detail how to develop and configure a simple datagrid application which will be deployed to a 3 node JDG cluster. We will demonstrate how to configure and deploy a distributed cache in JDG and also show how to develop a HotRod Java client application which will be used to insert, update and display entries in the distributed cache. We will firstly discuss setting a new distributed cache on a 3 node JDG cluster. In this example, we will run our JDG cluster on a single machine by running each JDG instance on different ports. Firstly, we will create 3 instances of JDG by creating 3 directories (server1, server2, server3) on our host machine and unzipping each JDG installation into each directory. We will now configure each node in our cluster by copying and renaming the clustered.xml configuration file in the \server1\jboss-datagrid-6.2.0-server\standalone\configuration directory. We will name each of the cluster configuration files as "clustered1.xml", "clustered2.xml" and "clustered3.xml" for the JDG instances denoted by "server1", "server2" and "server3" respectively. We will now set up a new distributed cache on our JDG cluster by modifying the infinispan subsystem element in each clustered.xml file. We will demonstrate this for the node denoted "server1" here by modifying the file "clustered1.xml". The cache configuration shown here will be the same across all 3 nodes. To setup a new distributed cache named "directory-dist-cache", we configure the following elements in the file named "clustered1.xml" ......... ...... .............. ...... ...... /socket-binding-group> We will discuss the key elements and attributes relating to the configuration above. In the infinispan endpoint subsystem, we will configure hotrod clients to connect to the JDG server instance on socket 11222. The name of the cache container to host each of the cache instances will be held in the container named "clusteredcache". We have configured the infinispan core subsystem to the default cache container named "clusteredcacahe" whereby we will allow for jmx statistics to be collected relating the configured cache entries i.e statistics="true" We have created a new distributed cache named "directory-dist-cache" whereby there will be two copies of each cache entry held on two of the 3 cluster nodes. We have also set up an eviction policy whereby should there be more than 20 entries in our cache then cache entries will be removed using the LRU algorithm We should have configured nodes "server2" and "server3" to start up with a port offset of 100 and 200 respectively by configuring the socketing binding group element appropriately. Please view the socket bindings noted below. To set the socket binding element with a port offset of 100 on "server2", we configure "clustered2.xml" with the following entry: ...... ...... /socket-binding-group> To set the socket binding element with a port offset of 200 on "server3", we configure "clustered3.xml" with the following entry: ...... ...... /socket-binding-group> Before discussing the setup and configuration of our Hotrod client which will be used to interact with our JDG clustered HotRod server, we will start up each server instance to ensure our newly configured JDG distributed cache starts up correctly. Open up 3 Windows or Linux consoles and execute the following start up commands: Console 1: 1) Navigate to \server1\jboss-datagrid-6.2.0-server\bin 2) Execute this command to start the first instance of our JDG cluster denoted "server1": clustered -c=clustered1.xml -Djboss.node.name=server1 Console 2: 1) Navigate to \server2\jboss-datagrid-6.2.0-server\bin 2) Execute this command to start the second instance of our JDG cluster denoted "server2": clustered -c=clustered2.xml -Djboss.node.name=server2 Console 3: 1) Navigate to \server3\jboss-datagrid-6.2.0-server\bin 2) Execute this command to start the third instance of our JDG cluster denoted "server3": clustered -c=clustered3.xml -Djboss.node.name=server3 Providing all 3 JDG instances have started up correctly, you should see output in the console window whereby we can see there are 3 JDG instances in the JGroups view: HotRod Client Development Setup Now that the Hotrod server is up and running, we need to develop a Hotrod Java client which will interact with the clustered server application. The development environment consists of the following tools. 1) JDK Hotspot 1.7.0_45 2) IDE - Eclipse Kepler Build id: 20130919-0819 The HotRod client application is a simple application consisting of two Java classes. The application allows users to retrieve a reference to the distributed cache from the JDG server and then perform these actions: a) add new cinema objects. b) add and remove shows to each cinema object. c) print the list of all cinemas and shows stored in our distributed cache. The source code can be downloaded from github @ https://github.com/davewinters/JDG. We could use maven here to build and execute our application by configuring the maven settings.xml to point to the maven repository files we downloaded earlier and set up a maven project file (pom.xml) to build and execute the client application. In this article we will build our application using the Eclipse IDE and run the client application on the command line. To create a HotRod client application and execute the sample application, one should complete the following steps: 1) Create a new Java Project in Eclipse 2) Create a new package named uk.co.c2b2.jdg.hotrod and import the source code that has been downloaded from Github mentioned previously. 3) Now we need to configure the build path in Eclipse to contain the appropriate JDG client jar files which are required to compile the application. You should include all the client jar files in the project build path. These jar files are contained in the JDG installation zip file. For example on my machine these jar files are located in the directory: \server1\jboss-datagrid-6.2.0-server\client\hotrod\java 4. Providing the Eclipse build path has been configured appropriately, the application source should compile without issue. 5. We will need to execute the Hotrod application by opening the console window and executing the following command. Note the path specified here will differ depending on where the JDG client jar files and application class files are located in your environment: java -classpath ".;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\commons-pool-1.6-redhat-4.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\infinispan-client-hotrod-6.0.1.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\infinispan-commons-6.0.1.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\infinispan-query-dsl-6.0.1.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\infinispan-remote-query-client-6.0.1.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\jboss-logging-3.1.2.GA-redhat-1.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\jboss-marshalling-1.4.2.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\jboss-marshalling-river-1.4.2.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\protobuf-java-2.5.0.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\protostream-1.0.0.CR1-redhat-1.jar" uk/co/c2b2/jdg/hotrod/CinemaDirectory 6. The Hotrod client at runtime provides the end user with a number of different options to interact with the distributed cache as we can view from the console window below. Client Application Principal API Details We will not provide a detailed overview of the Hotrod application code however we will describe the principal API and code details briefly. In order to interact with the distributed cache on the JDG cluster using the Hotrod protocol, we will use the RemoteCacheManager Object which will allow us to retrieve a remote reference to the distributed cache. We have initialised a Properties object with the list of JDG instances and the associated with HotRod server port on each instance. We can add Cinema objects into the distributed cache using the RemoteCache.put() method. private RemoteCacheManager cacheManager; private RemoteCache cache; ..... Properties properties = new Properties(); properties.setProperty(ConfigurationProperties.SERVER_LIST, "127.0.0.1:11222;127.0.0.1:11322;127.0.0.1:11422"); cacheManager = new RemoteCacheManager(properties); cache = cacheManager.getCache("directory-dist-cache"); ..... cache.put(cinemaKey, cinemalist); In the webinar below, I describe in further detail how to set up a JDG cluster and how to develop and run the JDG application discussed above. For further details on JDG please visit: http://www.redhat.com/products/jbossenterprisemiddleware/data-grid/ Webinar: Introduction to JBoss Data Grid -- Installation, Configuration and Development In this webinar we will look at the basics of setting up JBoss Data Grid covering installation, configuration and development. We will look at practical examples of storing data, viewing the data in the cache and removing it. We will also take a look at the different clustered modes and what effect these have on the storage of your data:

July 25, 2014

by David Winters

· 16,103 Views

5 Reasons to Use a Java Data Grid in Your Application

In this post we explore 5 reasons to use a Java Data Grid for caching Java objects in-memory in your applications. In a later post we will explore some of the other data grid capabilities, beyond data storage, that can revolutionize your Java architectures, like on-grid computation and events. Memory is Fast Java Data Grids store Java objects in memory. Memory access is fast with low latency. So if access to data storage either disk or database is the primary bottleneck in your application then using a data grid as an in-memory cache in front of your storage tier will give you a performance boost. Scale out your Application Shared State If you need to share state across JVMs to scale out your application then using a Java Data Grid rather than a database will increase your scalability. A typical shared state architecture is shown below, the application server tier stores shared Java objects in the data grid and these objects are available to all application server nodes in your architecture. Separating the data grid tier from the application server tier has a number of advantages; Applications can be redeployed and restarted without losing the shared state Data Grid JVMs and Application JVMs can be tuned separately State can be shared across multiple different applications. Each tier can be scaled horizontally separately depending on work load Typical use cases for shared state include; PCI compliant storage of card security codes; In-game state in online games; web session data; prices and catalogues in ecommerce. Anything that needs low latency access can be stored in the shared data grid. High Availability for In-Memory Data As well as low latency access and scaling out shared state. Java Data Grids also provide high availability for your in-memory data. When storing Java objects in a data grid a primary object is stored in one of the Data Grid JVMs and secondary back up copies of the object are stored in different Data Grid JVM node, ensuring that if you lose a node then you don't lose any data. Clients of the data grid do not need to know where data is to access it so high availability is transparent to your application. Scale Out In-Memory Data Volumes Java objects, in data grids, aren't fully replicated across all Data Grid JVMs but are stored as a primary object and a secondary object. This means the more Data Grid JVM nodes we add the more JVM heap we have for storing Java objects in-memory (and remember memory is fast). For example if we build a Data Grid with 20 JVMs each with 4Gb free heap (after per JVM overhead) we could theoretically store 80Gb (4 times 20) of shared Java objects. If we assume we have 1 duplicate for high availability this cuts our storage in half so we can store 40Gb (.5 time 4 times 20 ) of Java Objects in memory. Native Integration with JPA Java Data Grids have native integration with JPA frameworks like TopLink and Hibernate whereby the Data Grid can act as a second level cache between JPA and the database. This can give a large performance boost to your database driven application if latency associated with database access is a key performance bottleneck.

July 22, 2014

by Steve Millidge

· 7,437 Views

Designing a Data Architecture to Support both Fast and Big Data

Originally written by Scott Jarr for VoltDB. In post one of this series, we introduced the ideas that a Corporate Data Architecture was taking shape and that working with Fast Data is different from working with Big Data. In the second post we looked at examples of Fast Data and what is required of applications that interact with Fast Data. In this post, I will illustrate how I envision the corporate architecture that will enable companies to achieve the data dream that integrates Fast and Big. The following diagram depicts a basic view of how the “Big” side of the picture is starting to fill out. At the center is a Data Lake, or pool or reservoir or…. there is no shortage of clever names and debate over what to call it. What is clear is this is the spot in which the enterprise will dump ALL of its data. This component is not necessarily unique because of its design or functionality, but because it is an enormously cost effective system to store everything. Essentially, it is a distributed file system on cheap commodity machines. There may or may not be a single winning technology here. It may be HDFS or some other store (maybe S3 if you’re on Amazon), but the point is, this is where all data will go. This platform will: 1. Store data that will be sent to other data management products, and 2. Support frameworks for executing jobs directly against the data in the file system. Moving around the outside of our Data Lake are the complementary pieces of technology that allow people to gain insight and value from the data stored in the Data Lake. Starting at 12 o’clock in the diagram above and moving clockwise: BI – Reporting: Data warehouses do an excellent job of reporting, and will continue to offer this capability. Some data will be exported to those systems and temporarily stored there, while other data will be accessed directly from the Data Lake in a hybrid fashion. These data warehouse systems were specifically designed to run complex report analytics, and do this well. SQL on Hadoop: There is a lot of innovation here. The goal of many of these products is to displace the data warehouse. Advances have been made with the likes of Hawq and Impala. But make no mistake, there is a long way to go for these systems to get near the speed and efficiency of the data warehouses, especially those with columnar designs. SQL-on-Hadoop systems exist for a couple of important reasons: 1) SQL is still the best way to get at data, and 2) Processing can occur without moving big chunks of data around. Exploratory Analytics: This is the realm of the data scientist. These tools offer the ability to “find” things in data – patterns, obscure relationships, statistical rules, etc. Mahout and R are popular tools in this category. MapReduce: This is a lazily-named group of all the job scheduling and management tasks that often occur on Hadoop (I really should come up with something more accurate). Many Hadoop use cases today involve pre-processing or cleaning data prior to the use of the analytics tools described above. These are the tools and interfaces that allow that to happen. ETL of Enterprise Apps: Last at 6 o’clock is the ETL process that will help get all the legacy data from our trusty enterprise applications into our data lake that stores everything. These applications will slowly migrate to full-fledged Fast+Big Data apps in time, which I will discuss in a future post. But suffice it to say: once I add sensors to a manufacturing line, I have a Fast+Big Data problem. OK, we now have analytics … so what? Why do we do analytics in the first place? Simple. We want: Better decisions Better personalization Better detection Better …. Interaction. Interaction is what the application is responsible for, and the most valuable improvements come when you can do these interactions accurately and in real-time. This brings us to the second half of the architecture where we deal with Fast Data to make better, faster real-time applications, depicted in the diagram below. The first thing to notice is that there is a tight coupling of Fast and Big, although they are separate systems. They have to be, at least at scale. The database system designed to work with millions of event decisions per second is wholly different from the system designed to hold Petabytes of data and generate extensive reports. The nature of Fast Data produces a number of critical requirements to get the most out of it. These include the ability to: Ingest / interact with the data feed Make decisions on each event in the feed Provide visibility into fast-moving data with real-time analytics Seamlessly integrate into the systems designed to store Big Data Ability to serve analytic results and knowledge from the Big Data systems quickly to users and applications, closing the data loop. There is no better technology to meet these requirements than an operational database. The challenge we have faced is that there hasn’t been an operational database that can manage this kind of throughput. As a result, there have been a number of Band-Aids people have used to attempt to meet their needs, often giving up capabilities and always adding complexity. In a next post, I will detail the capabilities I see customers looking for to support their Fast Data applications. Then we will take a look at the results of attempting this solution with a popular alternative, stream processing. Originally written by Scott Jarr for VoltDB.

July 9, 2014

by John Piekos

· 14,148 Views

What is Scalable Machine Learning?

scalability has become one of those core concept slash buzzwords of big data. it’s all about scaling out, web scale, and so on. in principle, the idea is to be able to take one piece of code and then throw any number of computers at it to make it fast. the terms “scalable” and “large scale” have been used in machine learning circles long before there was big data. there had always been certain problems which lead to a large amount of data, for example in bioinformatics, or when dealing with large number of text documents. so finding learning algorithms, or more generally data analysis algorithms which can deal with a very large set of data was always a relevant question. interestingly, this issue of scalability were seldom solved using actual scaling in in machine learning, at least not in the big data kind of sense. part of the reason is certainly that multicore processors didn’t yet exist at the scale they do today and that the idea of “just scaling out” wasn’t as pervasive as it is today. instead, “scalable” machine learning is almost always based on finding more efficient algorithms, and most often, approximations to the original algorithm which can be computed much more efficiently. to illustrate this, let’s search for nips papers (the annual advances in neural information processing systems, short nips, conference is one of the big ml community meetings) for papers which have the term “scalable” in the title. here are some examples: scalable inference for logistic-normal topic models … this paper presents a partially collapsed gibbs sampling algorithm that approaches the provably correct distribution by exploring the ideas of data augmentation … partially collapsed gibbs sampling is a kind of estimation algorithm for certain graphical models. a scalable approach to probabilistic latent space inference of large-scale networks … with […] an efficient stochastic variational inference algorithm, we are able to analyze real networks with over a million vertices […] on a single machine in a matter of hours … stochastic variational inference algorithm is both an approximation and an estimation algorithm. scalable kernels for graphs with continuous attributes … in this paper, we present a class of path kernels with computational complexity $o(n^2(m + \delta^2 ))$ … and this algorithm has squared runtime in the number of data points, so wouldn’t even scale out well even if you could. usually, even if there is potential for scalability, it usually something that is “embarassingly parallel” (yep, that’s a technical term), meaning that it’s something like a summation which can be parallelized very easily. still, the actual “scalability” comes from the algorithmic side. so how do scalable ml algorithms look like? a typical example are the stochastic gradient descent (sgd) class of algorithms. these algorithms can be used, for example, to train classifiers like linear svms or logistic regression. one data point is considered at each iteration. the prediction error on that point is computed and then the gradient is taken with respect to the model parameters, giving information about how to adapt these parameters slightly to make the error smaller. vowpal wabbit is one program based on this approach and it has a nice definition of what it considers to mean scalable in machine learning: there are two ways to have a fast learning algorithm: (a) start with a slow algorithm and speed it up, or (b) build an intrinsically fast learning algorithm. this project is about approach (b), and it’s reached a state where it may be useful to others as a platform for research and experimentation. so “scalable” means having a learning algorithm which can deal with any amount of data, without consuming ever growing amounts of resources like memory. for sgd type algorithms this is the case, because all you need to store are the model parameters, usually a few ten to hundred thousand double precision floating point value, so maybe a few megabytes in total. the main problem to speed this kind of computation up is how to stream the data by fast enough. to put it differently, not only does this kind of scalability not rely on scaling out, it’s actually not even necessary or possible to scale the computation out because the main state of the computation easily fits into main memory and computations on it cannot be distributed easily. i know that gradient descent is often taken as an example for map reduce and other approaches like in this paper on the architecture of spark , but that paper discusses a version of gradient descent where you are not taking one point at a time, but aggregate the gradient information for the whole data set before making the update to the model parameters. while this can be easily parallelized, it does not perform well in practice because the gradient information tends to average out when computed over the whole data set. if you want to know more, this large scale learning challenge sören sonnneburg organized in 2008 still has valuable information on how to deal with massive data sets. of course, there are things which can be easily scaled well using hadoop or spark, in particular any kind of data preprocessing or feature extraction where you need to apply the same operation to each data point in your data set. another area where parallelization is easy and useful is when you are using cross validation to do model selection where you usually have to train a large number of models for different parameter sets to find the combination which performs best. again, even here there is more potential for even speeding up such computations using better algorithms like in this paper of mine . i’ve just scratched the surface of this, but i hope you got the idea that scalability can mean quite different things. in big data (meaning the infrastructure side of it) what you want to compute is pretty well defined, for example some kind of aggregate over your data set, so you’re left with the question of how to parallelize that computation well. in machine learning, you have much more freedom because data is noisy and there’s always some freedom in how you model your data, so you can often get away with computing some variation of what you originally wanted to do and still perform well. often, this allows you to speed up your computations significantly by decoupling computations. parallelization is important, too, but alone it won’t get you very far. luckily, there are projects like spark and stratosphere/flink which work on providing more useful abstractions beyond map and reduce to make the last part easier for data scientists, but you won’t get rid of the algorithmic design part any time soon.

July 3, 2014

by Mikio Braun

· 18,580 Views · 1 Like

Top 10 Hadoop Shell Commands to Manage HDFS

So you already know what Hadoop is? Why it is used? What problems you can solve with it?

June 30, 2014

by Saurabh Chhajed

· 485,725 Views · 8 Likes

Internet of Things (IoT) Reference Architecture

to converge internet of thing devices with corporate it solutions, teams require a reference architecture for the internet of things (iot). the reference architecture must include devices, server-side capabilities, and cloud architecture required to interact with and manage the devices. a reference architecture should provide architects and developers of iot projects with an effective starting point that addresses major iot project and system requirements. a high-level iot reference architecture may include the following layers (see figure 1): external communications - web/portal, dashboard, apis event processing and analytics (including data storage) aggregation / bus layer – esb and message broker device communications devices cross-‐cutting layers include: device and application management identity and access management a more detailed architecture component description can be found in the iot reference architecture white paper .

June 18, 2014

by Chris Haddad

· 17,800 Views

Exploring Message Brokers: RabbitMQ, Kafka, ActiveMQ, and Kestrel

Explore different message brokers, and discover how these important web technologies impact a customer's backlog of messages, and cluster/data performance.

June 3, 2014

by Yves Trudeau

· 460,755 Views · 86 Likes

How to Migrate from MySQL to MongoDB

In the last week I was working on a key project to migrate a BI platform from MySQL to MongoDB. We chose that database due to its support and scalability.

April 14, 2014

by Moshe Kaplan

· 115,919 Views · 6 Likes