Data Resources

The Latest Data Topics

The Near Future of IoT

[This article was written by Sean Lorenz.] Pundits within the technology sphere have been calling 2014 the year of the Internet of Things (IoT). The market revenue potentials are forecasted into the trillions and it’s a Fortune 500 land grab with major companies moving quickly to stake their claims [1]. If this sounds a bit like pages from an American Wild West history book, a frontier analogy isn’t too far off. This is an exciting turning point in technology that—thanks to advances in plummeting sensor costs, wireless communication, and chip size reduction—will soon make today’s futuristic IoT concepts seem humorous in retrospect. While it’s difficult to see where the market is going, given the exponential rate of change in IoT technology, I have noticed several key trends emerging. As a fellow IoT prospector on the frontier, this is my account of the most evident trends as well as some educated predictions for the future. Trends 1. Business Value Over Technology Focus Like any promising new technology still in its infancy stage, the true innovation stems from tech-savvy researchers and tinkerers that build fascinating devices that sometimes have no consumer base–I’m looking at you, robotics market. We have all heard about the smart toothbrushes and smart egg trays coming to market and thought: “Interesting! I wouldn’t buy one, but… sure!” Perhaps the biggest trend is a shift from thinking, “let’s build it because we can” to “what business problem are we solving here?” IoT developers are getting wise to this mentality and building user-focused MVPs (Minimum Viable Products) that will begin hitting the market in late 2014 and early 2015. 2. Keeping It Real At my company, Xively, we often get asked what are the real use cases for the IoT. Many times our customers walk in the door with a vague idea of how connecting their product or service to the Internet would be potentially interesting, but need a little help with seeing how an IoT-enabled product can transform their business—internally and externally. The reason for this is that most of the exciting, transformative elements happen under the hood. Right now, the true “wow” moments in the industry are far from sexy: energy savings in enterprise complexes, CRM & ERP integration, service and support, supply chain efficiencies, product part failure and alert, and so on… you get the idea. Smart homes that respond to our every whim are really great ideas, but these products aren’t integral to our lives yet. Large manufacturing companies and enterprises are using the Internet of Things to manage internal operations and efficiency while also engaging their customers more fully with new IoT data sources aggregated in existing services like Salesforce1 or SAP. 3. Publish-Subscribe The IoT protocol wars are heating up, but allegiances aside, publish-subscribe messaging is what the bulk of implemented models use for connecting devices to the cloud. Pub-sub protocols such as MQTT, CoAP, and AMQP are attractive for connected product development thanks to their ease of scalability and many-to-one/one-to-many possibilities. Given the massive variance of the IoT market, there is bound to be more than one protocol that wins in the end; yet before we get to that point, there are plenty of bugs and vulnerabilities to patch across all of the thriving, open IoT protocols out there. 4. Security Panic! Hacked refrigerators, big box stores, and security cameras… oh my! There has been no shortage of concern for privacy, security, and compliance in the Internet of Things space. Like any news story, some of this attention is warranted and some overblown. Just like your pre-IoT old-fashioned Internet, creating specific application keys and advanced permissioning systems for hardware connecting to the cloud is essential. The amount of nodes at the edge connecting to services across the Internet will be far larger than anything we see now, but IoT platforms are already addressing these complex device lifecycle management issues that are crucial for protecting personal and enterprise information in a connected world. Near-Term Predictions Now lets hop in the DeLorean and look into the future. Rather than focus on five, ten, or twenty years into the future, let’s focus only on the next few years. Why? As I mentioned in the beginning, the IoT landscape changes on a day-to-day basis, so even a prediction looking forward six-months from now can be unreliable. This list contains no self-driving cars or sentient AIs. Instead, it makes some pretty sure bets for what to expect over the horizon. 1. A Household Name Usually the second question after “what’s your name?” at a dinner party is the inevitable “so what do you do?” Mentioning the Internet of Things to non-techies still draws blank stares and looks of confusion. Those looks are justified given the not-so-great marketing name of IoT and the myriad definitions trying to explain what it actually is. Whether it’s called the Internet of Things, Internet of Everything, or just the good ol’ Internet, the concept of connecting any and everything to the Internet will begin to make sense for everyday consumers. 2. Consumers Slow to Adopt Many IoT products are still just toys in many people’s minds. Startups are building products that address problems which most consumers don’t see as a problem yet. This isn’t to say the consumer IoT market will evaporate. It just means we need to get smarter about what customers actually want from smart devices. Today’s wearable products remind me of the Newton—Apple’s infamous PDA. The problem wasn’t the idea, but rather the timing. The Apple Newton seemed clunky, not very powerful, and low on the usability scale. Years later, the iPhone and iPad came along with a set of features and a form factor that customers were looking for. The same feels true of wearables right now—they may need a few more years to incubate before the general public gives two thumbs up. Other consumer IoT markets such as the smart home or driverless cars seem to be in the same situation as the wearables market, but this is changing quickly with major players like Apple and Google moving into these arenas. For example, in the home automation space, frameworks like Apple HomeKit will be essential for unifying disparate protocols and clouds into one application that can handle various products’ data, automating much of the technology and pushing it into the background. I am sure there is a brilliant developer learning Swift and building the first killer smart home app as we speak. 3. Analytics and Automation This prediction probably comes as no surprise, but it is worth stating. Most companies willing to foray into the IoT unknown are, for now, happy with connecting their devices to an external application or cloud service. Having a place to send the data is usually the first step in constructing an IoT system. So what do you do with all this data once you have it? Reporting tools for IoT are just starting to become available, but this is just the tip of the iceberg. The real magic lies in the ability to use exploratory and predictive algorithms to make actionable intelligence a reality. These insights are beneficial to both businesses understanding their customers and to the customers themselves. One could imagine closing the feedback loop between sensor, cloud, and actuator by adding some beautiful supervised machine learning code into the cloud platform at some point in the chain. There are currently a handful of analytics startups focusing on IoT specifically, but this market is about to explode from both platform and application perspectives. 4. IoT Startups Galore For any developers out there interested in the IoT with a real customer pain that needs solving, now is the time to get coding and building that pitch deck. With hardware back en vogue, venture capital funding of IoT-centric companies ison the rise [2]. Having been to a number of IoT events, the amount of enthusiasm by VC and angel investors is palpable. There’s a definite need for developers with great, connected product and service ideas; so, if you haven’t already, I strongly suggest putting on your favorite prospecting gear and exploring the untamed wild west of the Internet of Things. [1] https://internetofeverything.cisco.com/sites/default/files/docs/en/ioe_public_sector_vas_white%20paper_121913final.pdf [2] http://www.cbinsights.com/blog/internet-of-things-investing-snapshot 2014 Guide to Internet of Things The 2014 Guide to Internet of Things covers 39 different IoT SDKs, developer programs, and hardware options, plus: Key findings from our survey of over 2,000 developers "How to IoT Your Life: The Complete Shopping List" "The Scale of IoT" Infographic Glossary of common IoT terms Four in-depth articles from industry experts DOWNLOAD NOW

August 28, 2014

by Benjamin Ball

· 10,537 Views

Deserializing Json to a Java Object Using Google’s Gson Library

javascript object notation (json) is fast becoming the de facto standard or format for transferring, sharing and passing around data. be it on the web, rest service, a remote procedure call or even an ajax request. json is light weight with little memory footprint when compared to an xml. the content of a json string in its raw form when observed looks gibberish. to make the content usable it needs to be deserialized or converted to a useable form usually a java object (pojo) or an array or list of objects depending on the json content. a typical json string is as shown below {"city":"jos","country":"nigeria","housenumber":"13","lga":"jos south", "state":"plateau","streetname":"jonah jann","village":"bukuru","ward":"1"} there are a lot of frameworks for deserializing json to a java object such as json-rpc , gson , flexjson and a whole lots of other open source libraries. of all the libraries mentioned i would in this blog post demonstrate how to use google-gson library to deserialize a json string to a java object. you can download the gson library from https://code.google.com/p/google-gson/ . to have the json string deserialized, a java object must be created that has the same fields names with the fields in the json string. there is a website that provides a service for viewing the content of a json string in a tree like manner. http://jsonviewer.stack.hu paste the json string in the text tab and view the fields and the content from the viewer tab i would deserialize a json string that contains address details to an address pojo, the address object follows the structure as seen from the json tree view above. public class address{ private string city; private string country; private string housenumber; private string lga; private string state; private string streetname; private string village; private string ward; public string getcity() { return city; } public void setcity(string city) { this.city = city; } public string getcountry() { return country; } public void setcountry(string country) { this.country = country; } public string gethousenumber() { return housenumber; } public void sethousenumber(string housenumber) { this.housenumber = housenumber; } public string getlga() { return lga; } public void setlga(string lga) { this.lga = lga; } public string getstate() { return state; } public void setstate(string state) { this.state = state; } public string getstreetname() { return streetname; } public void setstreetname(string streetname) { this.streetname = streetname; } public string getvillage() { return village; } public void setvillage(string village) { this.village = village; } public string getward() { return ward; } public void setward(string ward) { this.ward = ward; } @override public string tostring() { return "address [city=" + city + ", country=" + country + ", housenumber=" + housenumber + ", lga=" + lga + ", state=" + state + ", streetname=" + streetname + ", village=" + village + ", ward=" + ward + "]"; } } to perform the deserialization with gson is easy, create pojo classes to hold your data, import the packages com.google.gson.gson and com.google.gson.gsonbuilder, to your project. then create and instance of the gson class and then perform the deserialization as shown below. gson gson = new gsonbuilder().create(); address address=gson.fromjson(json, address.class); voila, you have your json deserialized! the source code listing is below. package jsondeserializer import com.google.gson.gson; import com.google.gson.gsonbuilder; public class tester { public static void main(string[] args) { string json ="{\"city\":\"jos\",\"country\":\"nigeria\",\"housenumber\":\"13\",\"lga\":\"jos south\",\n" + "\"state\":\"plateau\",\"streetname\":\"jonah jann\",\"village\":\"bukuru\",\"ward\":\"1\"}"; gson gson = new gsonbuilder().create(); address address=gson.fromjson(json, address.class); system.out.println(address.tostring()); } }

August 27, 2014

by Ayobami Adewole

· 90,872 Views

Microservices and PaaS (Part II)

[This article was written by John Wetherill.] This is a continuation of the Microservices and PaaS - Part I blog post I wrote last week, which was an attempt to distil the wealth of information presented at the microservices meetup hosted by Cisco, with Adrian Cockcroft and others presenting. Part I provided a brief background on microservices, with a summary of some lessons learned by microservices pioneers. In this installment I will cover a number of practices related to microservices that were discussed during the meetup. A followup article will dive into the advantages that Platform as a Service brings to microservice development. Microservices Practices I'm calling these "Microservice Practices," not "Microservices Best Practices" because microservices-based architectures are still evolving, with new practices, techniques, tools, and patterns emerging constantly. At the meetup a number of practices were highlighted that Netflix and other microservices pioneers have spearheaded in their efforts to adopt a microservices mentality across their organizations. Break Things Deliberately According to Netflix: "We have found that the best defense against major unexpected failures is to fail often." Netflix has brought us "Chaos Monkey" which is a powerful tool the sole purpose of which is to break things, often and randomly. They use this tool continuously on their production systems to bring down essential services, to ensure that doing so doesn't disrupt the user experience or their overall service. It's much better to deliberately break the system in the middle of the morning when all teams are assembled and sufficient caffeine has been consumed, than to be informed of a breakage by a page at 3am. No Manual "Anything" In a world where microservices come and go, grow and shrink, and migrate around racks and data centers in seconds - there's absolutely no room for manual intervention. All aspects of deployment, monitoring, testing, and recovery must be fully automated. For example, monitoring a service should occur instantly and automatically by virtue of it being deployed, not requiring a separate manual step. Similarly failure discovery and rerouting to old code, as described in Part I of this blog, must be fully automated, no human intervention required. Respect Human Attention Span Speaking of humans, a typical human's attention span, say when filling out a shopping cart, is around 10 seconds. If a failure occurs when deploying an updated shopping cart microservice, it's important that the time between the failure, reporting, and rerouting to existing, working code is kept under around this 10 second range. Obviously this shouldn't happen too often, but the occasional 10 second gap in response will probably not lose the customer. A five minute, or 5 hour lag, resulting from manual intervention and rollback, will. Denormalize like Crazy Refactor database schemas, and de-normalize everything, to allow complete separation and partitioning of data. That is, do not use underlying tables that serve multiple microservices. There should be no sharing of underlying tables that span multiple microservices, and no sharing of data. Instead, if several services need access to the same data, it should be shared via a service API (such as a published REST or a message service interface). Polyglot Persistence Each microservice can have its own persistence layer. Gone are the days of a single monolithic database instance that's shared across all parts of an application. Databases are getting cheaper and easier. As an example, Neo4J allows you to embed an industry-strength self-contained graph database in your microservice at the cost of a few megabytes in a jarfile, with startup time on the order of milliseconds. That's essentially free. Even better, any PaaS worth its salt will provide multiple database services that can be spawned and accessed at the drop of a hat. With technology like this at our disposal, it makes sense to use the persistence layer that fits, both to the problem being solved, and to the expertise - and passions - of the team that's solving the problem. Avoid Trunk Conflicts The old mindset had all code for a large project contained in a single source repository. This can be slightly easier to setup and manage, but it ties the microservices together and makes it much more difficult to evolve them independently. Instead each microservice should have its own scm repository so it can truly be updated and enhanced independent of other services. One Service, One Manifest Each microservice must have its own manifest and dependencies, instead of maintaining a global dependency list for all services. This allows, for example, one microservice to depend on Spring v3.2, while another can require Spring 4.1. The dependencies for one microservice can change over time with no effect on the dependencies of other microservices. Contain Everything All microservices should run in a container, such as Tomcat, Docker, or in whatever container system is provided by the PaaS (you are running a PaaS aren't you?). Do not run microservices on bare metal, or directly on a VM. Containerization brings countless advantages, particularly a consistent, isolated runtime environment that can easily migrate around the datacenter or around the globe. With Docker and other modern containerization approaches, there is very little overhead in running in a container, and considerable upside. No State Do not build stateful services. Instead, maintain state in a dedicated persistence service, or elsewhere. This is a well-known practice brought to us by the cloud. When an application instance maintains state, it can't easily be moved, scaling is more complex, and it's more likely to cause problems when it fails. This practice applies even more to microservices which in general should be light-weight, instantly replaceable on failure, and should be able to hop around data-centers. Don't Name your Chickens People who raise chickens soon learn that naming chickens is a bad idea: after naming a chicken you get attached to it, at least the kids do, and it can be uncomfortable to have to explain at the dinner table that the chicken pot pie is really "Molly." Instead, number your chickens, so you can say "that was chicken #38" or even better, "that was chicken 586ec9bd." Makes for a much more enjoyable meal. The same can be said of computer systems. Do not name systems after planets, or animals, or philosophers, or prisons, as was common practice in the UNIX world for decades. Instead, assign them guid's, and don't attach any sort of significance to them, like assigning them specific roles or purposes. Systems should be commodities, like McDonalds Franchises. Each McDonalds is eerily similar, with the advantage that if one shuts down you can just walk an extra few blocks and be served the exact same burger at the same price in the same amount of time. Create and Curate Access Libraries Microservices are accessed by externally published APIs or protocols. This allows the microservice implementation to completely change with no effect on its consumers, as long as the API remains constant. But just publishing an API is not enough. The microservice provider should also be responsible for building and stewarding client libraries used to access the service. If this is not done, the construction of these libraries will be left to third parties, and will likely result in fragmentation where various implementations might have slight differences, or implementors may incorrectly interpret the spec and introduce inconsistencies which then stick. Optimize the Interaction One downside of a microservices architecture is the "fanout" problem where a single request to the overall application results in 10 or 20 requests bubbling throughout the various microservices the application relies on. This dramatic increase in network traffic calls for more optimal communication between microservices. Instead of transmitting the standard text/html REST content type, consider using something like Google Protocol Buffers, Simple Binary Encoding, or Apache Thrift, to decrease the size of the payload and optimize the inter-microservice communications. Release the Monkeys Netflix has released what they call the "Simian Army," a suite of tools including Chaos Monkey, mentioned above, whose purpose is to help an organization build resilient, scalable, fault-tolerant software. The suite includes such tools as Janitor Monkey, to reclaim unused resources, Security Monkey which looks for security vulnerabilities, Latency Monkey, which induces artificial delays in the REST layer to scare out latency issues, and many more. As Phil described last week in his blog Devops: Tools vs. Culture, most organizations don't have the resources or luxury of being able to build their own toolsets when evolving to a microservices and devops culture. Instead they must leverage existing tools, and fortunately lots of tools are constantly appearing. It's worth spending the effort searching and researching these tools, and incorporating them into your overall development process when they make sense. To be continued... Again I originally intended to cover last week's microservices meetup in a single blog post, which then expanded to two. I have yet to address the power of PaaS in microservices architectures, and I'm out of space already. So I will continue this Microservices and PaaS theme next week, finally getting into PaaS, and discuss how Platform as a Service can significantly streamline the microservices development process.

August 26, 2014

by John Wetherill

· 9,702 Views

How to Setup Realtime Analytics over Logs with ELK Stack

Once we know something, we find it hard to imagine what it was like not to know it. - Chip & Dan Heath, Authors of Made to Stick, Switch Update: I have recently published a book on ELK stack titled - Learning ELK Stack , more details can be found here. What is the ELK stack ? The ELK stack is ElasticSearch, Logstash and Kibana. These three provide a fully working real-time data analytics tool for getting wonderful information sitting on your data. ElasticSearch ElasticSearch,built on top of Apache Lucene, is a search engine with focus on real-time analysis of the data, and is based on the RESTful architecture. It provides standard full text search functionality and powerful search based on query. ElasticSearch is document-oriented/based and you can store everything you want as JSON. This makes it powerful, simple and flexible. Logstash Logstash is a tool for managing events and logs. You can use it to collect logs, parse them, and store them for later use.In ELK Stack logstash plays an important role in shipping the log and indexing them later which can be supplied to Elastic Search. Kibana Kibana is a user friendly way to view, search and visualize your log data, which will present the data stored from Logstash into ElasticSearch, in a very customizable interface with histogram and other panels which provides real-time analysis and search of data you have parsed into ElasticSearch. How Do I Get It ? http://www.elasticsearch.org/overview/elkdownloads/ How Do They Work Together ? Logstash is essentially a pipelining tool. In a basic, centralized installation a logstash agent, known as the shipper, will read input from one to many input sources and output that text wrapped in a JSON message to a broker. Typically Redis, the broker, caches the messages until another logstash agent, known as the collector, picks them up, and sends them to another output. In the common example this output is Elasticsearch, where the messages will be indexed and stored for searching. The Elasticsearch store is accessed via the Kibana web application which allows you to visualize and search through the logs. The entire system is scalable. Many different shippers may be running on many different hosts, watching log files and shipping the messages off to a cluster of brokers. Then many collectors can be reading those messages and writing them to an Elasticsearch cluster. (E)lasticSearch (L)ogstash (K)ibana (The ELK Stack) How Do I Fetch Useful Information Out of Logs? Fetching useful information from logs is one of the most important part of this stack and is being done in logstash using its grok filters and a set of input , filter and output plugins which helps to scale this functionality for taking various kinds of inputs ( file,tcp, udp, gemfire, stdin, unix, web sockets and even IRC and twitter and many more) , filter them using (groks,grep,date filters etc.) and finally write ouput to ElasticSearch,redis,email,HTTP,MongoDB,Gemfire , Jira , Google Cloud Storage etc. A Bit More About Log Stash Filters Transforming the logs as they go through the pipeline is possible as well using filters. Either on the shipper or collector, whichever suits your needs better. As an example, an Apache HTTP log entry can have each element (request, response code, response size, etc) parsed out into individual fields so they can be searched on more seamlessly. Information can be dropped if it isn’t important. Sensitive data can be masked. Messages can be tagged. The list goes on. e.g. input { file { path => ["var/log/apache.log"] type => "saurzcode_apache_logs" } } filter { grok { match => ["message","%{COMBINEDAPACHELOG}"] } } output{ stdout{} } Above example takes input from an apache log file applies a grok filter with %{COMBINEDAPACHELOG}, which will index apache logs information on fields and finally output to Standard Output Console. Writing Grok Filters Writing grok filters and fetching information is the only task that requires some serious efforts and if done properly will give you great insights in to your data like Number of Transations performed over time, Which type of products have most hits etc. Below links will help you a lot in writing grok filters and test them with ease - Grok Debugger http://grokdebug.herokuapp.com/ Grok Patterns Lookup https://github.com/elasticsearch/logstash/tree/v1.4.2/patterns References http://www.elasticsearch.org/overview/ http://logstash.net/ http://rashidkpc.github.io/Kibana/about.html

August 26, 2014

by Saurabh Chhajed

· 47,889 Views · 4 Likes

A Puppet Automation + MySQL Tutorial: Wordpress Install in 7 Short Steps

[This article was written by Koby Nachmany.] If you are familiar with configuration management (aka CM) and automation, you probably know a thing or two about Puppet, and the amazing and rich collection of modules it offers. Puppet Forge contains a wealth of third party modules that enable us to do some pretty nifty stuff with almost no effort. Puppet helps deal with the messy parts of CM, like installing binaries and running installation scripts that are tedious to do manually. Tools such as Puppet were originally created for IT operations people, that are for the most part infrastructure-centric, and are best suited for setup and maintenance of hosts in a physical data center. Dealing with applications and certainly managing applications on an elastic virtualized or even cloudified environment, brings a set of new challenges despite the agility and other benefits it provides. Now imagine we can have this goodness coupled with an intelligent orchestration framework for an entire deployment? In this blog post I'd like to demonstrate how a cloud application orchestrator can complement already existing automation processes powered by configuration management tools, in this case we will demonstrate with Puppet. I will use the nodecellar application and the popular WordPress content management framework as examples. This will hopefully provide a good introduction to Cloudify blueprints. Overview So we've seen how Cloudify 3 allows us to easily orchestrate the "nodecellar" application Read about it Cloudify blueprints here. With the "nodecellar" example, Cloudify deploys a complex application using workflows that map deployment lifecycle events to bash scripts using Cloudify's bash runner plugin. Cloudify's Puppet integration now makes this pretty easy. Cloudify 3.0 - Taking Puppet to the Next Level of Orchestration. Check it out. Go The synergy between Cloudify and Puppet not only allows you to enjoy the benefits of your Puppet environment, but it also amplifies its usability by introducing unique advantages that will answer the following common challenges involved with configuration management tools: Agent Installation: Provision your service VMs, install a Puppet agent (if you like) and wires them up with the Puppet Master. Or, if you choose to run standalone, you can install the agent with the appropriate manifests needed for that service, as well. Order of Dependencies: Define the dependencies between application stacks, services and infrastructure resources. Which will then be launched based on that order. Remote Execution and Updates: Other than the basic install/uninstall, Cloudify enables customized application workflows that allow you to execute tools like remote shell scripts on a group of instances that belong to a particular service, or to a specific instance in a group. This feature is useful to run maintenance operations, such as snapshots in the case of a database, or code pushes in a continuous deployment model. In addition, you can run puppet apply whenever you feel it's right for your service. Post Deployment: Once your application is up, Cloudify will be able to glue your monitoring tool of choice, or you can choose to use the built-in one. A robust policy engine, enables auto-healing and even auto-scaling according to your service's required SLA. I'm now going to take a deep dive on my experience with a WordPress example that I feel is a very good representation of how Puppet and Cloudify work in sync. Let's say we want to deploy the popular WordPress application stack on two VMs . Something as follows: The flow is quite simple: -server 3.5.1 with the basic following modules installed: |-- hunner-wordpress (v0.6.0) |-- puppetlabs-apache (v1.0.1) - with php mods enabled |-- puppetlabs-mysql (v2.1.0) Your site.pp file should resemble something like this: node /^apache_web.*/ { include apache class { 'wordpress': create_db => false, create_db_user => false, } } node /^mysql.*/ { class { '::mysql::server': root_password => 'password', override_options => { 'mysqld' => { 'bind_address' => '0.0.0.0' } } } include mysql::client include wordpress } As we can see, we have an Apache PHP application that will likely require a database connection string (IP, port, user and password). This is where Cloudify facilitates the "gluing" of all the pieces together, by allowing us to inject dynamic/static custom facts to the dependent node (Apache server). Cloudify supports both standalone agents and PuppetMaster environments. Step 2: Tweaking the Original WordPress Module. Some minor adaptations to the wordpress init class of the WordPress module will allow us to embed these facts during Puppet agent invocation. Below is a code snippet (With defaults truncated): class wordpress ( $db_host_ip = $cloudify_related_host_ip, $db_user, = $cloudify_properties_db_user, $db_password = $cloudify_properties_db_pw, . . ) And some tweaking to the templates/wp-config.php.erb: /** MySQL hostname */ define('DB_HOST', ''); Let's add some tags for finer control of manifest execution: The MySQL node will not require the application part to run on it, so I've excluded it using a Puppet "tag" (read more about Puppet tags). Cloudify, of course, supports this and will provide the appropriate tags during agent invocation. -> class { 'wordpress::app': tag => ['postconfigure'], install_dir => $install_dir, install_url => $install_url, version => $version, db_name => $db_name, . .} Step 3: Creating the Blueprint In a similar way to the "nodecellar" blueprint, first lets create a folder with the name of "wp_puppet" and create a blueprint.yaml file within it. This file will then serve as the blueprint file. Now let's declare the name of this blueprint. blueprint: name: wp_puppet nodes: Now we can start creating the topology. Step 4: Creating VM Nodes Since, in this case I use the OpenStack provider to create the nodes, let's import the "OpenStack types" plugin. imports: - http://www.getcloudify.org/spec/openstack-plugin/1.0/plugin.yaml Since the VMs are the same, I declared a generic template for a VM host: vm_host: derived_from: cloudify.openstack.server properties: - install_agent: true - worker_config: user: ubuntu port: 22 # example for ssh key file (see `key_name` below) # this file matches the agent key configured during the bootstrap key: ~/.ssh/agent.key # Uncomment and update `management_network_name` when working a n neutron enabled openstack - management_network_name: cfy-mng-network - server: image: 8c096c29-a666-4b82-99c4-c77dc70cfb40 flavor: 102 key_name: cfy-agnt-kp security_groups: ['cfy-agent-default', 'wp_security_group'] # This is how we inject the puppet server's ip userdata: | #!/bin/bash -ex grep -q puppet /etc/hosts || echo "x.x.x.x puppet" | sudo -A tee -a /etc/hosts Create the MySQL and Apache VMs: - name: mysql_db_vm type: vm_host instances: deploy: 1 - name: apache_web_vm type: vm_host instances: deploy: 1 Step 5: Declaring Apache and MySQL Servers Since we are using the Puppet plugin to create those servers, first we have to import it: plugins: puppet_plugin: derived_from: cloudify.plugins.agent_plugin properties: url: https://github.com/cloudify-cosmo/cloudify-puppet-plugin/archive/nightly.zip The plugin defines server types as follows: middleware_server, app_server, db_server, web_server, message_bus_server, app_module. They are virtually the same, but serve the purpose of enabling better readability for the user and GUI visualization A Puppet server type is derived_from: cloudify.types.server type, but includes some puppet-specific properties and lifecycle events. For documentation see: Puppet Types So we now will go ahead and declare the server types: cloudify.types.puppet.web_server: derived_from: cloudify.types.web_server properties: # All Puppet related configuration goes inside # the "puppet_config" property. - puppet_config interfaces: cloudify.interfaces.lifecycle: # Specifically "start" operation. Otherwise tags must be # provided. - start: puppet_plugin.operations.operation cloudify.types.puppet.app_module: derived_from: cloudify.types.app_module properties: - puppet_config interfaces: cloudify.interfaces.lifecycle: - configure: puppet_plugin.operations.operation cloudify.types.puppet.db_server: derived_from: cloudify.types.db_server properties: - puppet_config interfaces: cloudify.interfaces.lifecycle: - start: puppet_plugin.operations.operation Step 6: Instantiating the Apache and MySQL nodes: Here we provide the Puppet configuration and tags and define the relationships between the nodes. Cloudify's agent will use those relationships in order to decide the appropriate facts to inject. - name: apache_web_server type: cloudify.types.puppet.web_server properties: port: 8080 puppet_config: server: puppet environment: wordpress_env relationships: - type: cloudify.relationships.contained_in target: apache_web_vm - name: wordpress_app type: cloudify.types.puppet.app_module properties: db_user: wordpress db_pass: passwd puppet_config: server: puppet tags: ['postconfigure'] environment: wordpress_env relationships: - type: cloudify.relationships.contained_in target: apache_web_server - type: wp_connected_to_mysql target: mysql_db_server - name: mysql_db_server type: cloudify.types.puppet.db_server properties: db_user: wordpress db_pass: passwd puppet_config: server: puppet environment: wordpress_env relationships: - type: cloudify.relationships.contained_in target: mysql_db_vm Step 7: Upload the Blueprint and Create the Deployment (via CLI or GUI) Then execute your deployment (via CLI or GUI). ubuntu@koby-n-cfy3-cli:~/cosmo_cli$ cfy blueprints upload -b wp9 wordpress/blueprint.yaml ubuntu@koby-n-cfy3-cli:~/cosmo_cli$ cfy deployments create -b wp9 -d WordPress_Deployment_1 Step 8: Take a Quick Coffee Break. Step 9: Enjoy your Orchestrated WordPress Stack!

August 21, 2014

by Sharone Zitzman

· 9,172 Views

Lambda Architecture Principles

"Lambda Architecture" (introduced by Nathan Marz) has gained a lot of traction recently. Fundamentally, it is a set of design patterns of dealing with Batch and Real time data processing workflow that fuel many organization's business operations. Although I don't realize any novice ideas has been introduced, it is the first time these principles are being outlined in such a clear and unambiguous manner. In this post, I'd like to summarize the key principles of the Lambda architecture, focus more in the underlying design principles and less in the choice of implementation technologies, which I may have a different favors from Nathan. One important distinction of Lambda architecture is that it has a clear separation between the batch processing pipeline (ie: Batch Layer) and the real-time processing pipeline (ie: Real-time Layer). Such separation provides a means to localize and isolate complexity for handling data update. To handle real-time query, Lambda architecture provide a mechanism (ie: Serving Layer) to merge/combine data from the Batch Layer and Real-time Layer and return the latest information to the user. Data Source Entry At the very beginning, data flows in Lambda architecture as follows ... Transaction data starts streaming in from OLTP system during business operations. Transaction data ingestion can be materialized in the form of records in OLTP systems, or text lines in App log files, or incoming API calls, or an event queue (e.g. Kafka) This transaction data stream is replicated and fed into both the Batch Layer and Realtime Layer Here is an overall architecture diagram for Lambda. Batch Layer For storing the ground truth, "Master dataset" is the most fundamental DB that captures all basic event happens. It stores data in the most "raw" form (and hence the finest granularity) that can be used to compute any perspective at any given point in time. As long as we can maintain the correctness of master dataset, every perspective of data view derived from it will be automatically correct. Given maintaining the correctness of master dataset is crucial, to avoid the complexity of maintenance, master dataset is "immutable". Specifically data can only be appended while update and delete are disallowed. By disallowing changes of existing data, it avoids the complexity of handling the conflicting concurrent update completely. Here is a conceptual schema of how the master dataset can be structured. The center green table represents the old, traditional-way of storing data in RDBMS. The surrounding blue tables illustrates the schema of how the master dataset can be structured, with some key highlights Data are partitioned by columns and stored in different tables. Columns that are closely related can be stored in the same table NULL values are not stored Each data record is associated with a time stamp since then the record is valid Notice that every piece of data is tagged with a time stamp at which the data is changed (or more precisely, a change record that represents the data modification is created). The latest state of an object can be retrieved by extracting the version of the object with the largest time stamp. Although master dataset stores data in the finest granularity and therefore can be used to compute result of any query, it usually take a long time to perform such computation if the processing starts with such raw form. To speed up the query processing, various data at intermediate form (called Batch View) that aligns closer to the query will be generated in a periodic manner. These batch views (instead of the original master dataset) will be used to serve the real-time query processing. To generate these batch views, the "Batch Layer" use a massively parallel, brute force approach to process the original master dataset. Notice that since data in master data set is timestamped, the data candidate can be identified simply from those that has the time stamp later than the last round of batch processing. Although less efficient, Lambda architecture advocates that at each round of batch view generation, the previous batch view should just be simply discarded and the new batch view is computed from master dataset. This simple-mind, compute-from-scratch approach has some good properties in stopping error propagation (since error cannot be accumulated), but the processing may not be optimized and may take a longer time to finish. This can increase the "staleness" of the batch view. Real time Layer As discussed above, generating the batch view requires scanning a large volume of master dataset that takes few hours. The batch view will therefore be stale for at least the processing time duration (ie: between the start and end of the Batch processing). But the maximum staleness can be up to the time period between the end of this Batch processing and the end of next Batch processing (ie: the batch cycle). The following diagram illustrate this staleness. Even the batch view is stale period, business operates as usual and transaction data will be streamed in continuously. To answer user's query with the latest, up-to-date information. The business transaction records need to be captured and merged into the real-time view. This is the responsibility of the Real-time Layer. To reduce the latency of latest information availability close to zero, the merge mechanism has to be done in an incremental manner such that no batching delaying the processing will be introduced. This requires the real time view update to be very different from the batch view update, which can tolerate a high latency. The end goal is that the latest information that is not captured in the Batch view will be made available in the Realtime view. The logic of doing the incremental merge on Realtime view is application specific. As a common use case, lets say we want to compute a set of summary statistics (e.g. mean, count, max, min, sum, standard deviation, percentile) of the transaction data since the last batch view update. To compute the sum, we can simply add the new transaction data to the existing sum and then write the new sum back to the real-time view. To compute the mean, we can multiply the existing count with existing mean, adding the transaction sum and then divide by the existing count plus one. To implement this logic, we need to READ data from the Realtime view, perform the merge and WRITE the data back to the Realtime view. This requires the Realtime serving DB (which host the Realtime view) to support both random READ and WRITE. Fortunately, since the realtime view only need to store the stale data up to one batch cycle, its scale is limited to some degree. Once the batch view update is completed, the real-time layer will discard the data from the real time serving DB that has time stamp earlier than the batch processing. This not only limit the data volume of Realtime serving DB, but also allows any data inconsistency (of the realtime view) to be clean up eventually. This drastically reduce the requirement of sophisticated multi-user, large scale DB. Many DB system support multiple user random read/write and can be used for this purpose. Serving Layer The serving layer is responsible to host the batch view (in the batch serving database) as well as hosting the real-time view (in the real-time serving database). Due to very different accessing pattern, the batch serving DB has a quite different characteristic from the real-time serving DB. As mentioned in above, while required to support efficient random read at large scale data volume, the batch serving DB doesn't need to support random write because data will only be bulk-loaded into the batch serving DB. On the other hand, the real-time serving DB will be incrementally (and continuously) updated by the real-time layer, and therefore need to support both random read and random write. To maintain the batch serving DB updated, the serving layer need to periodically check the batch layer progression to determine whether a later round of batch view generation is finished. If so, bulk load the batch view into the batch serving DB. After completing the bulk load, the batch serving DB has contained the latest version of batch view and some data in the real-time view is expired and therefore can be deleted. The serving layer will orchestrate these processes. This purge action is especially important to keep the size of the real-time serving DB small and hence can limit the complexity for handling real-time, concurrent read/write. To process a real-time query, the serving layer disseminates the incoming query into 2 different sub-queries and forward them to both the Batch serving DB and Realtime serving DB, apply application-specific logic to combine/merge their corresponding result and form a single response to the query. Since the data in the real-time view and batch view are different from a timestamp perspective, the combine/merge is typically done by concatenate the results together. In case of any conflict (same time stamp), the one from Batch view will overwrite the one from Realtime view. Final Thoughts By separating different responsibility into different layers, the Lambda architecture can leverage different optimization techniques specifically designed for different constraints. For example, the Batch Layer focuses in large scale data processing using simple, start-from-scratch approach and not worrying about the processing latency. On the other hand, the Real-time Layer covers where the Batch Layer left off and focus in low-latency merging of the latest information and no need to worry about large scale. Finally the Serving Layer is responsible to stitch together the Batch View and Realtime View to provide the final complete picture. The clear demarcation of responsibility also enable different technology stacks to be utilized at each layer and hence can tailor more closely to the organization's specific business need. Nevertheless, using a very different mechanism to update the Batch view (ie: start-from-scratch) and Realtime view (ie: incremental merge) requires two different algorithm implementation and code base to handle the same type of data. This can increase the code maintenance effort and can be considered to be the price to pay for bridging the fundamental gap between the "scalability" and "low latency" need. Nathan's Lambda architecture also introduce a set of candidate technologies which he has developed and used in his past projects (e.g. Hadoop for storing Master dataset, Hadoop for generating Batch view, ElephantDB for batch serving DB, Cassandra for realtime serving DB, STORM for generating Realtime view). The beauty of Lambda architecture is that the choice of technologies is completely decoupled so I intentionally do not describe any of their details in this post. On the other hand, I have my own favorite which is different and that will be covered in my future posts.

August 20, 2014

by Ricky Ho

· 12,213 Views

The Programming Challenges of IoT

Pragmatic developers can look at the Internet of Things in two ways: This is amazing. I can only begin to imagine how I can directly improve the world outside the set of networked computer boxes. This is terrifying. If something goes wrong, then it’s on me—and this time the system affected extends outside the set of networked computer boxes. IoT is amazing in the way it bridges physical and virtual environments, but even the phrase “Internet of Things” should give a developer pause. Computers are pretty smart. Things are stupid. IoT tries to put Things online and tries to make them into inter-networked computers. That’s pop-philosophy, but you want to develop in the real world. So what real-world challenges will you face when you shoot for the IoT moon? Two Types of Challenges It seems there are two types of programming challenges for the Internet of Things: Data and control (the comp-sci and networking stuff) Information and business logic (the info-sci and human-computer interaction stuff) For this article, we’re going to talk about the programming problems we can solve around IoT. We’ll start at the bottom (data and control) and work our way up to the big picture (information and business logic). Type 1: Data and Control Challenge 1.1: Power This one is pretty obvious. Many IoT devices are wireless, and no one has invented thumbnail fusion reactors yet. One solution is equally obvious: pick your algorithms carefully. If you can save cycles to perform a given task, then do it. Libraries for implementing power-optimized algorithms will presumably spring up in greater numbers, but even so, you may need to inject some heavy-duty comp-sci know-how into IoT app development. The second solution is more complex than the first. Higher-level developers will have to think more about Dynamic Power Management (DPM), which just means: shutting down devices when they don’t need to be on and starting them up when they do. Normally the operating system worries about this, but an IoT application that orchestrates wearables and phones, for example, will know things that each device’s OS won’t—and therefore will be able to switch things on and off more intelligently than each device’s individual OS. Another option is to write or customize an embedded OS. Challenge 1.2: Latency Latency on IoT sits in two places: at the source and in the pipes. The basic problem is a physical one. Thing-chips often have to be small, which means that the chip can only be as powerful as current transistor technology allows. Another problem is power. Many small devices transmit and receive data in discrete active/sleep cycles (think TDMA) in order to save bandwidth and power, but this increases latency inversely to power saved. Another tradeoff is that network topologies optimized for IoT can involve more hops over slower devices. Mesh networks, for example, are immune to the failure of a few nodes. Similarly, “fog” and “edge” computing paradigms relieve Internet infrastructure by doing as much as possible without hub-nodes. The downside is that each node (a) can’t do very much on its own and (b) can only talk to neighboring nodes. The problem in the pipes is a matter of network infrastructure. Simply: the more Things, the less available bandwidth. Infrastructure technology will get faster, but cell networks won’t catch up overnight. And Things, unlike fancier computers, are often supposed to transmit blindly—that is, without anyone necessarily asking them to. This means there’s a massive potential for wasted bandwidth. Challenge 1.3: Unreliability The third challenge flows from the first two. Devices are unreliable–“Things” even more so. The distributed and decentralized virtues of IoT bring their own reliability problems. Here are just a few: Ubiquitous devices are cheap, so they fail more often. Truly ad-hoc connectivity implies ephemeral SLA, so uptime and recovery time may be unclear. Loosely controlled devices may have better things to do than give you their data (or computing resources), so concurrency may grow very complex. Less-reliable hardware generates less-reliable information (‘does my outlying datapoint just signify device failure?’), so you may need to chew your data more thoroughly at the application level. In a sense, IoT decouples low-level (the sub-session layer) from high-level channel capacity, because the distribution of error-sources on IoT is more heavily weighted toward originating or remote nodes. This means more error-correcting for application developers. Type 2: Information and Business Logic Challenge 2.1: Vast & Thin Data Sensors on smartphones are already generating oceans of raw data. These sensors are pretty sophisticated. Every major mobile OS provides a unified, simple API to access clean sensor and geo data. But start grabbing this data and it’s not immediately clear what to do with it. Try to think of killer applications for barometric data—besides weather and elevation (with GPS)—off the top of your head. Raw sensor data is extremely thin. It doesn’t explain itself, and we haven’t yet produced a complete mapping from physical measurements to business logic—let alone software design. Even if you know what to do with sensor/geo data eventually, you may have to learn new algorithms and data structures to process immediately. Geo-graphs aren’t CS101 graph data structures (for one thing, edge length is a first-class citizen of geo-graphs). The size of data over IoT is itself a problem. Wireless sensors beget tons of data. All the problems (and opportunities) of Big Data cascade naturally from IoT. Massively distributed computing on IoT devices is an exciting thought, but the toolchain for splitting calculations over a thousand idle Fitbits just isn’t here yet. 2. Context-Sensitivity Consider the term “ubiquitous computing,” defined as: what happens when wirelessly connected sensors and actuators, placed more or less everywhere, allow software to interact with much larger swaths of the physical world than just hardware or bare metal. Put ubiquitous computing on the Internet, and IoT makes the software context much larger. This has implications at two basic levels. At a high computer-architectural level: IoT extends the concept of computing environment well outside the von Neumann machine and weakens the concept of peripheral I/O. In an IoT-world interface, sensors are input and actuators are output. As IoT devices process increasingly at the edge (within individual nodes), the devices that appear peripheral to other nodes are actually doing an awful lot of computation. At a high business-logic level: the more stuff outside the computer-box affects the program, the less predictable the program behavior becomes at runtime. The same bizarrely-birthed memory leak might slow down the UI in a smartphone context but contribute to a cascading electrical grid failure in an IoT context. This means that IoT demands more self-monitoring and self-repairing code. Two Types of Solutions Plenty of researchers are working on ambitious solutions to the programming challenges presented by IoT. Two of the more exciting examples include: Abstract Task Graph—a data-driven model that maps the network graph to an application graph [1] Computational REST—replaces content resources with computation resources [2] There are also a few more strategies you can use right now to solve some of the IoT programming challenges mentioned above. Reactive ProgrammingThis general purpose paradigm responds to all major application-level challenges and embraces opportunities presented by IoT. The four definitive attributes of a reactive application are: event-driven, scalable, resilient, and responsive [3]. These four are excellent guiding principles for IoT applications at a high, cross-stack level. Flow-based Programming and the Actor ModelBoth present strongly independent components where only messages can affect processes. Both are deeply amenable to concurrency (for example, shared state is discouraged), nondeterminism, and scaling without exponential complexity growth, because components are black boxes. FBP is a bit more pragmatic and restrictive while the actor model is less restrictive and a bit harder to implement. FBP has already been implemented in Javascript (NoFlo), and the actor model has been implemented in Java (Akka) [4][5][6]. What’s important to remember is that there are already tools and techniques that can help you build IoT applications. FBP, actors, and reactive programming all have key attributes for creating applications that leverage the strengths of IoT to overcome its challenges. [1] https://www.usenix.org/legacy/event/mobisys05/eesr05/tech/full_papers/bakshi/bakshi.pdf [2] http://isr.uci.edu/tech_reports/UCI-ISR-10-3.pdf [3] http://www.reactivemanifesto.org/ [4] http://jpaulmorrison.com/fbp/ [5] http://arxiv.org/ftp/arxiv/papers/1008/1008.1459.pdf [6] http://noflojs.org/ [7] http://akka.io/ 2014 Guide to Internet of Things The 2014 Guide to Internet of Things covers 39 different IoT SDKs, developer programs, and hardware options, plus: Key findings from our survey of over 2,000 developers "How to IoT Your Life: The Complete Shopping List" "The Scale of IoT" Infographic Glossary of common IoT terms Four in-depth articles from industry experts DOWNLOAD NOW

August 14, 2014

by John Esposito

· 16,368 Views

6 Rules of Thumb for MongoDB Schema Design: Part 3

By William Zola, Lead Technical Support Engineer at MongoDB This is our final stop in this tour of modeling One-to-N relationships in MongoDB. In the first post, I covered the three basic ways to model a One-to-N relationship. Last time, I covered some extensions to those basics: two-way referencing and denormalization. Denormalization allows you to avoid some application-level joins, at the expense of having more complex and expensive updates. Denormalizing one or more fields makes sense if those fields are read much more often than they are updated. Read part two if you’ve missed it. Whoa! Look at All These Choices! So, to recap: You can embed, reference from the “one” side, or reference from the “N” side, or combine a pair of these techniques You can denormalize as many fields as you like into the “one” side or the “N” side Denormalization, in particular, gives you a lot of choices: if there are 8 candidates for denormalization in a relationship, there are 2 8 (1024) different ways to denormalize (including not denormalizing at all). Multiply that by the three different ways to do referencing, and you have over 3,000 different ways to model the relationship. Guess what? You now are stuck in the “paradox of choice” — because you have so many potential ways to model a “one-to-N” relationship, your choice on how to model it just got harder. Lots harder. Rules of Thumb: Your Guide Through the Rainbow Here are some “rules of thumb” to guide you through these indenumberable (but not infinite) choices One: favor embedding unless there is a compelling reason not to Two: needing to access an object on its own is a compelling reason not to embed it Three: Arrays should not grow without bound. If there are more than a couple of hundred documents on the “many” side, don’t embed them; if there are more than a few thousand documents on the “many” side, don’t use an array of ObjectID references. High-cardinality arrays are a compelling reason not to embed. Four: Don’t be afraid of application-level joins: if you index correctly and use the projection specifier (as shown in part 2) then application-level joins are barely more expensive than server-side joins in a relational database. Five: Consider the write/read ratio when denormalizing. A field that will mostly be read and only seldom updated is a good candidate for denormalization: if you denormalize a field that is updated frequently then the extra work of finding and updating all the instances is likely to overwhelm the savings that you get from denormalizing. Six: As always with MongoDB, how you model your data depends — entirely — on your particular application’s data access patterns. You want to structure your data to match the ways that your application queries and updates it. Your Guide To The Rainbow When modeling “One-to-N” relationships in MongoDB, you have a variety of choices, so you have to carefully think through the structure of your data. The main criteria you need to consider are: What is the cardinality of the relationship: is it “one-to-few”, “one-to-many”, or “one-to-squillions”? Do you need to access the object on the “N” side separately, or only in the context of the parent object? What is the ratio of updates to reads for a particular field? Your main choices for structuring the data are: For “one-to-few”, you can use an array of embedded documents For “one-to-many”, or on occasions when the “N” side must stand alone, you should use an array of references. You can also use a “parent-reference” on the “N” side if it optimizes your data access pattern. For “one-to-squillions”, you should use a “parent-reference” in the document storing the “N” side. Once you’ve decided on the overall structure of the data, then you can, if you choose, denormalize data across multiple documents, by either denormalizing data from the “One” side into the “N” side, or from the “N” side into the “One” side. You’d do this only for fields that are frequently read, get read much more often than they get updated, and where you don’t require strong consistency, since updating a denormalized value is slower, more expensive, and is not atomic. Productivity and Flexibility The upshot of all of this is that MongoDB gives you the ability to design your database schema to match the needs of your application. You can structure your data in MongoDB so that it adapts easily to change, and supports the queries and updates that you need to get the most out of your application.

August 13, 2014

by Francesca Krihely

· 8,756 Views

An Early Mover's Guide to the Internet of Things

[This article was written by Andreea Borca, developer of patient-empowering solutions for the healthcare industry, co-host of Farstuff: The IoT Podcast, and featured author in DZone's 2014 Guide to Internet of Things]. The creation of the Internet was a significant shift in the way people acquire information, interact with each other, and make decisions. Now, the Internet is expanding its reach to a range of devices that can gather and analyze physical data and react to that data in a variety of applications that we’ve never seen before. This “Internet of Things” marks another dynamic shift in the history of technology. This new stage in the Internet’s evolution is changing it from a tool that we actively need to engage with—deliberately using a browser to access it—to one that passively endows the world around us with a “mind” of its own. We are developing a world where things interact intelligently and cooperate to achieve goals without explicit guidance from human operators. Defining the Internet of Things First, we need to define the Internet of Things (also called “The Internet of Everything” by Cisco). A system falls under the Internet of Things definition if it meets the following criteria, known as the “3 Cs”: 1. It must Connect – to the physical world around itself collecting information, to other things in order to interact with them effectively, to the internet or a network, etc. 2. It must Compute – by processing the inputs it receives in some way and making them meaningful to other systems. 3. It must Communicate – with the network, with other things, and with the user if necessary (more often than not, as you’ll see, communicating to the user may be an unnecessary burden). Challenges for the Internet of Things Efficiency Devices within the Internet of Things (IoT) only need to do the bare minimum necessary to effectively work within the existing ecosystem. Many of the newest products rely heavily on the power of your smartphone to connect to the Internet and orchestrate devices, but there is also extensive pressure to reduce the size, energy consumption, and cost of the processing entities within IoT devices. In order to reduce power consumption and manage node outages, there is a concept of daisy-chaining across a network of devices into a more powerful central hub. This is known as mesh networking, and it’s becoming quite popular for IoT systems. Security, Privacy & the need to Share A core requirement of a well-functioning IoT device is to collect, transfer, and store data from a wide variety of sources. As more sensors arrive in cities and healthcare institutions, that increasingly connected information will unavoidably lead to more concern about security and privacy. The debate is still raging over balancing the clear benefits of new discoveries from processing Big Data with the strong personal fear of losing privacy. With IoT now in the picture, there is concern about devices that continuously and passively collect information on users. One recent clash over always-on sensors came with the release of Microsoft’s Xbox One Kinect console, which has a camera that is constantly pointed at your living room. Although the camera itself is not always on, the backlash over that possibility was fierce [1]. Finding this balance will quickly become a requirement for continued progress. Furthermore, the very nature of IoT and the connectivity network necessary for its success does make it particularly vulnerable in certain instances. Devices are especially vulnerable when connected over WiFi, because low tech sensor nodes with minimal computing power tend to be less secure, making them the ideal point of entry for infiltrators. Standards As with all new technologies, the battle over standards is always a struggle. Nest, the company that developed the most popular smart home thermostat, and its new owner, Google, are now making significant strides trying to establish the Nest platform as the foundation for all consumer-based IoT devices and their software counterparts [2]. Cisco, Qualcomm, IBM, Microsoft, and most other major players have a similar strategy for creating standard models for approaching the Internet of Things. The pressure to standardize is especially clear when new devices are appearing weekly. ZigBee already has extensive reach as an established standard for many household IoT devices. However, as a preferred codebase has yet to emerge as the standard of choice, it is recommended to connect with major standardization organizations like the IEEE, IETF, and the ZigBee Alliance [3]. Currently, the most common sensor networks use protocols such as Bluetooth Low Energy (BLE), RFID tags, ZigBee, and Wi-Fi. There are also iBeacons, which allow devices like smartphones to better identify their location and potential needs with NFC-powered micro-location and GPS technology. Opportunities for the Internet of Things There are numerous prospects to consider when looking to develop IoT products. Given the multi-trillion dollar projections for the future IoT economy, we should take a look at these emerging markets for IoT tech [4]. Consumers The consumer IoT space has bred a small but growing segment of followers that have invested early into “smart” tech. At this year’s CES, we saw everything from the Babolat Tennis Racket that becomes your personal tennis coach to the Kolibree Toothbrush that monitors your gum health while you brush. The fastest growing consumer IoT segment seems to be in smart home technology, with products such as self-managing refrigerators and resident-sensing door locks. Commercial Retailers have already proven adept at collecting a consumer’s shopping history. With the functionality of NFC-powered beacons, these retailers are eager to personalize your shopping experience in a whole new way. Essentially, each physical shopping trip can now be as littered with targeted ads as any typical online search, much like a scene from the 2002 sci-fi film Minority Report. Walk into a store and instantly the advertising screens on the wall change to address your particular demographic, income level, and shopping preferences. If you’ve connected your Google calendar to certain applications, these screens would show outfits targeting your next big event. Signs on clothing racks sense you coming near and change prices, fully leveraging a custom pricing model that would have economists drooling. And as you try on outfits, the smart mirror in the dressing room recommends accessories or comments on alternatives that might be a better fit for your body type. After all of these IoT events have helped you with your purchase, there’s no need to checkout. You’ve registered with the store and there’s a beacon at the exit that registers what you picked up and charges your card automatically as you leave. Healthcare With the recent U.S. mandate that all health records must be digital, there has been an explosion in the marketplace of new, patient-centered, smart health devices. The excitement of a healthcare revolution among top innovative companies, incubators, and startups predicts that this trend is not likely to taper off anytime soon. The key areas of focus so far have been: monitoring technologies like wearables (especially passive monitoring), function improving technologies, education, and notification technologies. Wearables are generally the first consumer touch point in the IoT health sphere. With the popularity of Fitbit pedometers and Withings scales, the market is starting to experiment with internal monitoring and potentially replacing some organs completely in the near future. A study at Boston University has had incredibly positive results creating an artificial pancreas for Type 1 diabetics by inserting an insulin and glucagon pump that responds when an attached glucometer goes below a certain level, just like an actual pancreas. Proteus, a promising startup out of San Francisco, has created an all-natural microchip in a pill that the patient swallows in order to monitor whether they are remembering to take their medication. The pill sends data to an armband that the user is wearing, which then can send notifications to family members regarding the patient’s status. The most impressive feature is the fact that these chips are powered by the energy in the patient’s digestive system. Cities, Infrastructure, and Industry The long-term vision of the future includes technology such as self-driving cars and city lights that alert police when there’s been an accident. In this stage of development, the majority of value is coming from technologies that monitor and collect data in urban settings. From an evolutionary perspective, the IoT city as a whole is still in what many would consider a learning phase. The main objective is to collect as much data as possible, make it available via open APIs, and encourage motivated data analysts to find opportunities for improvement in utility usage, environmental impacts, and service management for larger populations. This is one area where being an industrial country like the U.S. may actually impede the ability to progress as quickly as our less established counterparts. Third world countries that haven’t yet built a solid infrastructure allow for the creativity and flexibility to implement sophisticated solutions unfettered by generations of previous development. Silicon Valley powerhouses like Facebook and Google are actively engaged in projects to create a free global Wi-Fi network, and key locations in Africa have allowed them to experiment with these projects. Being an Early Mover In the very near future, as more and more things connect to the internet, internet connectivity from IoT devices will dwarf the amount of traditional web browsing. The core standards and assumptions that will drive this next revolution in computing technology are still being established and, as a result, building anything that can add value to this exploding industry (software, hardware, devices, sensors, beacons etc.) is a remarkable opportunity for the right developer. Right now is the time to start contributing to the development of these technologies if you want to be an early mover in IoT. 2014 Guide to Internet of Things The 2014 Guide to Internet of Things covers 39 different IoT SDKs, developer programs, and hardware options, plus: Key findings from our survey of over 2,000 developers "How to IoT Your Life: The Complete Shopping List" "The Scale of IoT" Infographic Glossary of common IoT terms Four in-depth articles from industry experts DOWNLOAD NOW

August 12, 2014

by Benjamin Ball

· 15,100 Views

Distributed Big Balls of Mud

if you want evidence that the software development industry is susceptible to fashion, just go and take a look at all of the hype around microservices. it's everywhere! for some people microservices is "the next big thing", whereas for others it's simply a lightweight evolution of the big soap service-oriented architectures that we saw 10 years ago "done right". i do like a lot of what the current microservice architectures are doing, but it's by no means a silver bullet. okay, i know that sounds obvious, but i think many people are jumping on them for the wrong reason. i often show this slide in my conference talks, and i've blogged about this before , but basically there are different ways to build software systems. on the one side we have traditional monolithic systems, where everything is bundled up inside a single deployable unit. this is probably where most of the industry is. caveats apply, but monoliths can be built quickly and are easy to deploy, but they provide limited agility because even tiny changes require a full redeployment. we also know that monoliths often end up looking like a big ball of mud because of the way that software often evolves over time. for example, many monolithic systems are built using a layered architecture, and it's relatively easy for layered architectures to be abused (e.g. skipping "around" a service to call the repository/data access layer directly). on the other side we have service-based architectures, where a software system is made up of many separately deployable services. again, caveats apply but, if done well, service-based architectures buy you a lot of flexibility and agility because each service can be developed, tested, deployed, scaled, upgraded and rewritten separately, especially if the services are decoupled via asynchronous messaging. the downside is increased complexity because your software system now has many more moving parts than a monolith. as robert says, the complexity is still there, you're just moving it somewhere else . there is, of course, a mid-ground here. we can build monolithic systems that are made up of in-process components, each of which has an explicit well-defined interface and set of responsibilities. this is old-school component-based design that talks about high cohesion and low coupling, but i usually sense some hesitation when i talk about it. and this seems odd to me. before i explain why, let me quote something from a blog post that i read earlier this morning about the rationale behind a team adopting a microservices approach. when we started building karma, we decided to split the project into two main parts: the backend api, and the frontend application. the backend is responsible for handling orders from the store, usage accounting, user management, device management and so forth, while the frontend offers a dashboard for users which accesses this api. along the way we noticed that if the whole backend api is monolithic it doesn't work very well because everything gets entangled. the blog post also mentions scaling, versioning and multiple languages/frameworks as other reasons to choose microservices. again, there are no silver bullets here, everything is a trade-off. anyway, "everything getting entangled" is not a reason to switch from monoliths to microservices. if you're building a monolithic system and it's turning into a big ball of mud, perhaps you should consider whether you're taking enough care of your software architecture. do you really understand what the core structural abstractions are in your software? are their interfaces and responsibilities clear too? if not, why do you think moving to a microservices architecture will help? sure, the physical separation of services will force you to not take some shortcuts, but you can achieve the same separation between components in a monolith. a little design thinking and an architecturally-evident coding style will help to achieve this without the baggage of going distributed. many of the teams i've spoken to are building monolithic systems and don't want to look at component-based design. the mid-ground seems to be a hard-sell. i ran a software architecture sketching workshop with a team earlier this year where we diagrammed one of their software systems. the diagram started as a strictly layered architecture (presentation, business services, data access) with all arrows pointing downwards and each layer only ever calling the layer directly beneath it. the code told a different story though and the eventual diagram didn't look so neat anymore. we discussed how adopting a package by component approach could fix some of these problems, but the response was, "meh, we like building software using layers". it seems as if teams are jumping on microservices because they're sexy, but the design thinking and decomposition strategy required to create a good microservices architecture are the same as those needed to create a well structured monolith. if teams find it hard to create a well structured monolith, i don't rate their chances of creating a well structured microservices architecture. as michael feathers recently said, " there's a bit of overhead involved in implementing each microservice. if they ever become as easy to create as classes, people will have a freer hand to create trouble - hulking monoliths at a different scale. ". i agree. a world of distributed big balls of mud worries me.

August 4, 2014

by Simon Brown

· 9,297 Views

Glassfish 4 - Performance Tuning, Monitoring and Troubleshooting

This is the third blog in C2B2 series looking at Glassfish 4. The previous two are available here: Part 1 - Getting started with Glassfish 4 Part 2 - Glassfish 4 - Features For High Availability In this blog I will look at 3 areas: Performance Tuning, where I will look at some of the areas to look at when setting up a system for production. Monitoring, where I will look at some of the tools we use for monitoring a system both during performance testing and tuning and once a system is up and running. Troubleshooting, where I will look at some of the tools you can use to help diagnose and detect performance issues. Performance Tuning Glassfish out of the box (as with most app servers) is optimised for development purposes. Developers want the ability to deploy and undeploy continuously, create and remove resources, debug, etc. However, this configuration is not suitable for a production system. When configuring any application server you have to take into account what you are trying to achieve and what is best suited for the applications you intend to run. One size does not fit all! It can be a long and complex process and I'm afraid I can't give you a one-stop solution. However, I can give you some pointers to some of the things you can do to prepare your system for production. So, what kind of things do we look at when we are looking to performance tune a Glassfish system. Some of the most common things are: JVM Settings Garbage Collection Glassfish Settings Logging JVM Settings The standard JVM defaults are not suitable for a production system. One of the simplest changes that can be made is to use the -server flag, rather than the default -client. Although the Server and Client VMs are similar, the Server VM has been specially tuned to maximise peak operating speed. It is intended for executing long-running server applications, which need the fastest possible operating speed more than a fast start-up time or smaller runtime memory footprint. Allocate more memory to the JVM by modifying the value of the -Xmx flag. How much depends on the size and complexity of your enterprise application and how much memory you have available. In addition we also want to make sure we allocate all of the memory on startup. This is done with the -Xms flag. We set the minimum and maximum perm gen to the same value in order to avoid allocation failures & subsequent full garbage collections. Garbage Collection There are a number of settings that can be tweaked regarding Garbage Collection. I'm not going to cover GC tuning as that is a whole topic all of it's own but here are some of the settings we would always recommend regarding GC in a production environment: Firstly we want to ensure we log all Garbage Collection information as this can prove extremely useful in diagnosing issues. -verbose:gc Next we want to make sure we log GC information to a file. This will make it easier to separate the GC from other details in the log files. -Xloggc:/path_to_log_file/gc.log We also want to ensure we have as much detail as possible. -XX:+PrintGCDetails and that the information is timestamped for easier diagnosis of long running errors and to be able to ascertain what normal levels are over time. -XX:+PrintGCDateStamps Finally, we want to ensure that developers aren't making explicit calls to System.gc(). Hopefully they don’t anyway and if they are you need to look into why (doing so is a bad idea since this forces major collections) but this will disable it just in case. -XX:+DisableExplicitGC Heap Dumps Heap dumps can be extremely useful for diagnosing memory issues. There are two settings we would definitely recommend. These tell the JVM to generate a heap dump when an allocation from the Java heap or the permanent generation cannot be satisfied. There is no overhead in running with these options but they can be useful for production systems where OutOfMemoryErrors can take a long time to surface. -XX:-HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/dumps/glassfish.hprof Configuring Glassfish There are three ways to configure Glassfish: Through the admin console By directly editing the config files Using the asadmin tool Although making changes through the admin console can often be the easiest way to make changes we’d recommend where possible to script all changes so you have a repeatable production server build. Also you should ensure copies of all config files are kept in Config Control so you know you have a working copy and can roll back to a previous version when needed. Turn off development features Turn off auto-deploy and dynamic application reloading. Both of these features are great for development, but can affect performance. Configure the JSP servlet not to check JSP files for changes on every request. Also, set the parameter genStrAsCharArray to true. This will ensure all String values are declared as static char arrays. One reason for this is that the array has less memory overhead than String. These changes will mean you cannot change JSP pages on your production server without redeploying the application, but on a production system this is generally what you want. Acceptor Threads and Request Threads There are two main thread values we would recommend setting, acceptor threads and request threads. Acceptor threads are used to accept new connections to the server and to schedule existing connections when a new request comes in. Set this value equal to the number of CPU cores in your server. So, if you have two quad core CPUs, this value should be set to eight. Request threads run HTTP requests. You want enough of these to keep the machine busy, but not so many that they compete for CPU resources which would cause your throughput to suffer greatly. Static resources By default, GlassFish does not tell the client to cache static resources. It is recommended to cache static resources, like CSS files and images particularly if you have a lot of them. Thread pools Max thread pool and min pool size should be set to the same value. Specifying the same value will allow GlassFish to use a slightly more optimised thread pool. This configuration should be considered unless the load on the server varies significantly. Increasing this value will reduce HTTP response latency times. What to set these values to depends heavily on what your application is doing. In order to get this value right you should look to incrementally increase the thread count and to monitor performance after each incremental increase. When performance stops improving stop increasing the thread count. Logging You should look to turn off as much logging as possible. In a production environment we would generally recommend logging at WARN and above. This includes the logging done by Glassfish as well as your own applications. Monitoring The fewer monitoring options that are enabled, the better the server's performance. All Glassfish monitoring is turned off by default. Switching monitoring on can be very useful when diagnosing issues and when doing initial system testing and performance tuning for monitoring what changes. What to monitor Used Heap Size - Compare this number with the maximum allowed heap size to see what portion of the heap is in use. If the used heap size nears the max heap size, the garbage collector urgently attempts to free memory and this is something that should be avoided where possible. Number of loaded classes - Useful for detecting performance and application development trends. JVM Threads - Important for performance tuning and for troubleshooting JVM crashes. Some of the most essential indicators are the current active JVM thread count and the peak values. Thread pools - You should compare a pools current usage with the maximum number allowed. Problems can start to occur when the current count nears the max threads number. JVM Tools for Monitoring The following is a list of a a few of the tools that come with the JDK that are useful for monitoring information from the JVM. jstat - This tool displays performance statistics regarding usage of the perm gen, new gen and old gen. It also provides class loading and compilation statistics jmap - Gives you visibility of memory usage, can produce a class histogram and can dump the memory to a file jconsole/jvisualvm - These tools can display all the previously mentioned monitoring indicators and graph them over time. This allows you to spot trends and to get a better overall picture of your normal performance levels and changes over time. Note - These should NOT be left running permanently on a production system! Troubleshooting Unfortunately, no matter how much tuning and testing you do all systems WILL go wrong from time to time. So, what should you do when your production server bursts into flames? Well, in that situation you should call the fire service but for more general problems: Gather data - get as much data as you can, there is no such thing as too much! Analyse that data - Data is worthless when you don’t know what it means. Visualise where possible – graphs and charts reveal trends and patterns over time Make educated decisions - Only make decisions based on data. If you go with your “gut instinct” and what “feels right” you will probably make things worse Gathering data First up, for most of the JVM tools you will need the process ID of the server. You can get this information in various ways. Two of the simplest are: jps -v This will list all current running Java processes. The -v flag is for verbose output. ps aux | grep glassfish The ps command with the options aux will show all processes from all users. This will display a LOT of information so pipe it through grep to filter for the glassfish process As mentioned earlier the jstat tool can be used for gathering info on JVM performance. Other useful tools include: jstack This will produce thread stack dumps for all threads running in the JVM. This can be very useful for discovering stuck threads or long running threads. jmap This tool can be used to create a heap dump. It outputs to a file in .hprof format which can be read by a number of analysis tools jrcmd and jrmc These tools are only available with the jRockit JDK. I won't go into any detail here as I have previously blogged about jrcmd here: http://blog.c2b2.co.uk/2012/11/troubleshooting-jrockit-using-jrcmd.html and my colleague has blogged about jrmc here: http://blog.c2b2.co.uk/2012/10/weblogic-troubleshooting-with-jrockit.html Glassfish asadmin The Glassfish asadmin tool has a built in command which will provide similar functionality to the above tools but without the need for the PID. asadmin generate-jvm-report --type=[type] Analysing the data There are various tools available for analysing performance data. The following are some of the most useful: IBM Support Assistant is a free troubleshooting application that helps you research, analyze, and resolve problems using various support features and tools. It contains a Garbage Collection and Memory Visualiser as well as a Heap Analyser. It will also provide a report telling you where issues might exist, and listing red flags with advice on what to change in your applications jRockit Mission Control is a very powerful tool which can be used to monitor live systems or analyse historical data in the form of flight recordings. JVisualVM GCViewer is an optional plugin for jVisualVM which can transform a tool which is already great for live monitoring into a powerful analysis tool jhat is a Java Heap Analysis Tool. It processes heap dump files and produces HTML reports. There are better analysis tools, but it’s always freely available if you’re running a JDK. Others There are many open source and freely available tools and projects to help you, here we’ve covered some very common and widely used ones, but our list is by no means exhaustive! Conclusion Remember, Glassfish out of the box (or out of the zip file!) is not designed to be run 'as is'. You should also note that there is no ideal configuration that will work for all systems. It will take time and effort to get the best configuration for what you require. Hopefully in this blog I have given you some useful guidelines and pointers. You should take time to work out what you want in terms of services, then strip back your config to match that. You should test, test and test again to ensure that your configuration matches the requirements with regards to the applications you will be running on your server. You should tune your JVM to ensure you have the best settings for your particular configuration. You should ensure you have monitoring in place to keep a check on everything and ensure that if your server does crash you have as much information as possible at hand to diagnose what caused it. The next blog in this series looks at Migrating to Glassfish 4: http://blog.c2b2.co.uk/2013/07/glassfish-4-migrating-to-glassfish.html

July 30, 2014

by Andy Overton

· 24,846 Views

Data-driven Unit Testing in Java

Data-driven testing is a powerful way of testing a given scenario with different combinations of values. In this article, we look at several ways to do data-driven unit testing in JUnit. Suppose, for example, you are implementing a Frequent Flyer application that awards status levels (Bronze, Silver, Gold, Platinum) based on the number of status points you earn. The number of points needed for each level is shown here: level minimum status points result level Bronze 0 Bronze Bronze 300 Silver Bronze 700 Gold Bronze 1500 Platinum Our unit tests need to check that we can correctly calculate the status level achieved when a frequent flyer earns a certain number of points. This is a classic problem where data-driven tests would provide an elegant, efficient solution. Data-driven testing is well-supported in modern JVM unit testing libraries such as Spock and Spec2. However, some teams don’t have the option of using a language other than Java, or are limited to using JUnit. In this article, we look at a few options for data-driven testing in plain old JUnit. Parameterized Tests in JUnit JUnit provides some support for data-driven tests, via the Parameterized test runner. A simple data-driven test in JUnit using this approach might look like this: @RunWith(Parameterized.class) public class WhenEarningStatus { @Parameters(name = "{index}: {0} initially had {1} points, earns {2} points, should become {3} ") public static Iterable data() { return Arrays.asList(new Object[][]{ {Bronze, 0, 100, Bronze}, {Bronze, 0, 300, Silver}, {Bronze, 100, 200, Silver}, {Bronze, 0, 700, Gold}, {Bronze, 0, 1500, Platinum}, }); } private Status initialStatus; private int initialPoints; private int earnedPoints; private Status finalStatus; public WhenEarningStatus(Status initialStatus, int initialPoints, int earnedPoints, Status finalStatus) { this.initialStatus = initialStatus; this.initialPoints = initialPoints; this.earnedPoints = earnedPoints; this.finalStatus = finalStatus; } @Test public void shouldUpgradeStatusBasedOnPointsEarned() { FrequentFlyer member = FrequentFlyer.withFrequentFlyerNumber("12345678") .named("Joe", "Jones") .withStatusPoints(initialPoints) .withStatus(initialStatus); member.earns(earnedPoints).statusPoints(); assertThat(member.getStatus()).isEqualTo(finalStatus); } } You provide the test data in the form of a list of Object arrays, identified by the _@Parameterized@ annotation. These object arrays contain the rows of test data that you use for your data-driven test. Each row is used to instantiate member variables of the class, via the constructor. When you run the test, JUnit will instantiate and run a test for each row of data. You can use the name attribute of the @Parameterized annotation to provide a more meaningful title for each test. There are a few limitations to the JUnit parameterized tests. The most important is that, since the test data is defined at a class level and not at a test level, you can only have one set of test data per test class. Not to mention that the code is somewhat cluttered - you need to define member variables, a constructor, and so forth. Fortunatly, there is a better option. Using JUnitParams A more elegant way to do data-driven testing in JUnit is to use [https://code.google.com/p/junitparams/|JUnitParams]. JUnitParams (see [http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22JUnitParams%22|Maven Central] to find the latest version) is an open source library that makes data-driven testing in JUnit easier and more explicit. A simple data-driven test using JUnitParam looks like this: @RunWith(JUnitParamsRunner.class) public class WhenEarningStatusWithJUnitParams { @Test @Parameters({ "Bronze, 0, 100, Bronze", "Bronze, 0, 300, Silver", "Bronze, 100, 200, Silver", "Bronze, 0, 700, Gold", "Bronze, 0, 1500, Platinum" }) public void shouldUpgradeStatusBasedOnPointsEarned(Status initialStatus, int initialPoints, int earnedPoints, Status finalStatus) { FrequentFlyer member = FrequentFlyer.withFrequentFlyerNumber("12345678") .named("Joe", "Jones") .withStatusPoints(initialPoints) .withStatus(initialStatus); member.earns(earnedPoints).statusPoints(); assertThat(member.getStatus()).isEqualTo(finalStatus); } } Test data is defined in the @Parameters annotation, which is associated with the test itself, not the class, and passed to the test via method parameters. This makes it possible to have different sets of test data for different tests in the same class, or mixing data-driven tests with normal tests in the same class, which is a much more logical way of organizing your classes. JUnitParam also lets you get test data from other methods, as illustrated here: @Test @Parameters(method = "sampleData") public void shouldUpgradeStatusFromEarnedPoints(Status initialStatus, int initialPoints, int earnedPoints, Status finalStatus) { FrequentFlyer member = FrequentFlyer.withFrequentFlyerNumber("12345678") .named("Joe", "Jones") .withStatusPoints(initialPoints) .withStatus(initialStatus); member.earns(earnedPoints).statusPoints(); assertThat(member.getStatus()).isEqualTo(finalStatus); } private Object[] sampleData() { return $( $(Bronze, 0, 100, Bronze), $(Bronze, 0, 300, Silver), $(Bronze, 100, 200, Silver) ); } The $ method provides a convenient short-hand to convert test data to the Object arrays that need to be returned. You can also externalize @Test @Parameters(source=StatusTestData.class) public void shouldUpgradeStatusFromEarnedPoints(Status initialStatus,int initialPoints, int earnedPoints,Status finalStatus){ ... } The test data here comes from a method in the StatusTestData class: public class StatusTestData{ public static Object[] provideEarnedPointsTable(){ return $( $(Bronze,0, 100,Bronze), $(Bronze,0, 300,Silver), $(Bronze,100,200,Silver) ); } } This method needs to be static, return an object array, and start with the word "provide". Getting test data from external methods or classes in this way opens the way to retrieving test data from external sources such as CSV or Excel files. JUnitParam provides a simple and clean way to implement data-driven tests in JUnit, without the overhead and limitations of the traditional JUnit parameterized tests. Testing with non-Java languages If you are not constrained to Java and/or JUnit, more modern tools such as Spock (https://code.google.com/p/spock/) and Spec2 provide great ways of writing clean, expressive unit tests in Groovy and Scala respectively. In Groovy, for example, you could write a test like the following: class WhenEarningStatus extends Specification{ def"should earn status based on the number of points earned"(){ given: def member =FrequentFlyer.withFrequentFlyerNumber("12345678") .named("Joe","Jones") .withStatusPoints(initialPoints) .withStatus(initialStatus); when: member.earns(earnedPoints).statusPoints() then: member.status == finalStatus where: initialStatus | initialPoints | earnedPoints | finalStatus Bronze |0 |100 |Bronze Bronze |0 |300 |Silver Bronze |100 |200 |Silver Silver |0 |700 |Gold Gold |0 |1500 |Platinum } } John Ferguson Smart is a specialist in BDD, automated testing, and software life cycle development optimization, and author of BDD in Action and other books. John runsregular courses in Australia, London and Europe on related topics such as Agile Requirements Gathering, Behaviour Driven Development, Test Driven Development, andAutomated Acceptance Testing. Blog Links >>

July 27, 2014

by John Ferguson Smart

· 24,707 Views · 1 Like

JBoss Data Grid: Installation and Development

In this blog, we will discuss one particular data grid platform from Redhat namely JBoss Data Grid (JDG). We will firstly cover how to access and install this data grid platform and then we will demonstrate how to develop and deploy a simple remote client/server data grid application which utilises the HotRod protocol. We will be using the latest release JDG 6.2 from Redhat in this article. Installation Overview To start using JDG, firstly log on to the redhat site https://access.redhat.com/home and download the software from the Downloads section of the site. We wish to download JDG 6.2 server by clicking on the appropriate links in the Downloads section. For future reference, it is also useful to download the quickstart and maven repository zip files. To install JDG, we simply unzip the JDG server package into an appropriate directory in your environment. JDG Overview In this section, we will provide a brief overview of the contents of the JDG installation package and the most notable configuration options available to users. Out of the box, users are provided with two runtime options either to run JDG in standalone or clustered mode. We can start JDG in either mode by invoking the stanadalone or clustered start up scripts in the / bin directory. To configure the JDG in either mode we need to configure the files standalone.xml and clustered.xml. In our case we will creating a distributed cache which will run on 3 node JDG cluster so we will be utilizing the clustered startup script. In order to set up and add new cache instances to JDG, we modify the infinispan subsystems in the appropriate xml configuration file above. We should also note the principal difference between the standalone and clustered configuration file is that in the clustered configuration file there is a JGroups subsystem configured element which allows for communication and messaging between configured cache instances running in a JDG cluster. Development Environment Setup and Configuration In this section, we will detail how to develop and configure a simple datagrid application which will be deployed to a 3 node JDG cluster. We will demonstrate how to configure and deploy a distributed cache in JDG and also show how to develop a HotRod Java client application which will be used to insert, update and display entries in the distributed cache. We will firstly discuss setting a new distributed cache on a 3 node JDG cluster. In this example, we will run our JDG cluster on a single machine by running each JDG instance on different ports. Firstly, we will create 3 instances of JDG by creating 3 directories (server1, server2, server3) on our host machine and unzipping each JDG installation into each directory. We will now configure each node in our cluster by copying and renaming the clustered.xml configuration file in the \server1\jboss-datagrid-6.2.0-server\standalone\configuration directory. We will name each of the cluster configuration files as "clustered1.xml", "clustered2.xml" and "clustered3.xml" for the JDG instances denoted by "server1", "server2" and "server3" respectively. We will now set up a new distributed cache on our JDG cluster by modifying the infinispan subsystem element in each clustered.xml file. We will demonstrate this for the node denoted "server1" here by modifying the file "clustered1.xml". The cache configuration shown here will be the same across all 3 nodes. To setup a new distributed cache named "directory-dist-cache", we configure the following elements in the file named "clustered1.xml" ......... ...... .............. ...... ...... /socket-binding-group> We will discuss the key elements and attributes relating to the configuration above. In the infinispan endpoint subsystem, we will configure hotrod clients to connect to the JDG server instance on socket 11222. The name of the cache container to host each of the cache instances will be held in the container named "clusteredcache". We have configured the infinispan core subsystem to the default cache container named "clusteredcacahe" whereby we will allow for jmx statistics to be collected relating the configured cache entries i.e statistics="true" We have created a new distributed cache named "directory-dist-cache" whereby there will be two copies of each cache entry held on two of the 3 cluster nodes. We have also set up an eviction policy whereby should there be more than 20 entries in our cache then cache entries will be removed using the LRU algorithm We should have configured nodes "server2" and "server3" to start up with a port offset of 100 and 200 respectively by configuring the socketing binding group element appropriately. Please view the socket bindings noted below. To set the socket binding element with a port offset of 100 on "server2", we configure "clustered2.xml" with the following entry: ...... ...... /socket-binding-group> To set the socket binding element with a port offset of 200 on "server3", we configure "clustered3.xml" with the following entry: ...... ...... /socket-binding-group> Before discussing the setup and configuration of our Hotrod client which will be used to interact with our JDG clustered HotRod server, we will start up each server instance to ensure our newly configured JDG distributed cache starts up correctly. Open up 3 Windows or Linux consoles and execute the following start up commands: Console 1: 1) Navigate to \server1\jboss-datagrid-6.2.0-server\bin 2) Execute this command to start the first instance of our JDG cluster denoted "server1": clustered -c=clustered1.xml -Djboss.node.name=server1 Console 2: 1) Navigate to \server2\jboss-datagrid-6.2.0-server\bin 2) Execute this command to start the second instance of our JDG cluster denoted "server2": clustered -c=clustered2.xml -Djboss.node.name=server2 Console 3: 1) Navigate to \server3\jboss-datagrid-6.2.0-server\bin 2) Execute this command to start the third instance of our JDG cluster denoted "server3": clustered -c=clustered3.xml -Djboss.node.name=server3 Providing all 3 JDG instances have started up correctly, you should see output in the console window whereby we can see there are 3 JDG instances in the JGroups view: HotRod Client Development Setup Now that the Hotrod server is up and running, we need to develop a Hotrod Java client which will interact with the clustered server application. The development environment consists of the following tools. 1) JDK Hotspot 1.7.0_45 2) IDE - Eclipse Kepler Build id: 20130919-0819 The HotRod client application is a simple application consisting of two Java classes. The application allows users to retrieve a reference to the distributed cache from the JDG server and then perform these actions: a) add new cinema objects. b) add and remove shows to each cinema object. c) print the list of all cinemas and shows stored in our distributed cache. The source code can be downloaded from github @ https://github.com/davewinters/JDG. We could use maven here to build and execute our application by configuring the maven settings.xml to point to the maven repository files we downloaded earlier and set up a maven project file (pom.xml) to build and execute the client application. In this article we will build our application using the Eclipse IDE and run the client application on the command line. To create a HotRod client application and execute the sample application, one should complete the following steps: 1) Create a new Java Project in Eclipse 2) Create a new package named uk.co.c2b2.jdg.hotrod and import the source code that has been downloaded from Github mentioned previously. 3) Now we need to configure the build path in Eclipse to contain the appropriate JDG client jar files which are required to compile the application. You should include all the client jar files in the project build path. These jar files are contained in the JDG installation zip file. For example on my machine these jar files are located in the directory: \server1\jboss-datagrid-6.2.0-server\client\hotrod\java 4. Providing the Eclipse build path has been configured appropriately, the application source should compile without issue. 5. We will need to execute the Hotrod application by opening the console window and executing the following command. Note the path specified here will differ depending on where the JDG client jar files and application class files are located in your environment: java -classpath ".;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\commons-pool-1.6-redhat-4.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\infinispan-client-hotrod-6.0.1.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\infinispan-commons-6.0.1.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\infinispan-query-dsl-6.0.1.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\infinispan-remote-query-client-6.0.1.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\jboss-logging-3.1.2.GA-redhat-1.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\jboss-marshalling-1.4.2.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\jboss-marshalling-river-1.4.2.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\protobuf-java-2.5.0.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\protostream-1.0.0.CR1-redhat-1.jar" uk/co/c2b2/jdg/hotrod/CinemaDirectory 6. The Hotrod client at runtime provides the end user with a number of different options to interact with the distributed cache as we can view from the console window below. Client Application Principal API Details We will not provide a detailed overview of the Hotrod application code however we will describe the principal API and code details briefly. In order to interact with the distributed cache on the JDG cluster using the Hotrod protocol, we will use the RemoteCacheManager Object which will allow us to retrieve a remote reference to the distributed cache. We have initialised a Properties object with the list of JDG instances and the associated with HotRod server port on each instance. We can add Cinema objects into the distributed cache using the RemoteCache.put() method. private RemoteCacheManager cacheManager; private RemoteCache cache; ..... Properties properties = new Properties(); properties.setProperty(ConfigurationProperties.SERVER_LIST, "127.0.0.1:11222;127.0.0.1:11322;127.0.0.1:11422"); cacheManager = new RemoteCacheManager(properties); cache = cacheManager.getCache("directory-dist-cache"); ..... cache.put(cinemaKey, cinemalist); In the webinar below, I describe in further detail how to set up a JDG cluster and how to develop and run the JDG application discussed above. For further details on JDG please visit: http://www.redhat.com/products/jbossenterprisemiddleware/data-grid/ Webinar: Introduction to JBoss Data Grid -- Installation, Configuration and Development In this webinar we will look at the basics of setting up JBoss Data Grid covering installation, configuration and development. We will look at practical examples of storing data, viewing the data in the cache and removing it. We will also take a look at the different clustered modes and what effect these have on the storage of your data:

July 25, 2014

by David Winters

· 16,094 Views

DocFlex/XML - XML Schema Documentation Generator and Toolkit

a powerful multi-format xml schema (xsd) documentation generator and a tool for rapid development of custom xsd documentation generators according to user needs. about docflex/xml "xsddoc" template set template processor template designer integrations generation of xsd diagrams apache ant & maven links about docflex/xml docflex/xml is a java-based software system for development and execution of high performance template-driven documentation generators from any data stored in xml files. the actual doc/report generators are programmed in the form of special templates using a graphic template designer , which represents the templates visually in a form resembling the output they generate. further, the templates are interpreted by a template processor , which takes on input the xml files and produces by them the result documentation. this article describes an application of docflex/xml for the task of generation of high-quality xml schema documentation. that includes the following features of docflex/xml system: " xsddoc " template set that implements the ready-to-use xml schema documentation generator itself. template processor makes the templates works. currently, it provides three interchangeable output generators for html, rtf, txt (plain text) formats. template designer provides a high quality gui to design/modify templates. if you need a special xml schema doc generator, the simplest way to create it is to modify the standard xsddoc templates. the template designer enables you to do that. integrations with altova xmlspy and oxygen xml editor . if you are a user of one of those popular xml editors, you can turn it also into a dynamically linked diagramming engine for docflex, so that to include automatically the xsd diagrams generated by xmlspy/oxygenxml into the xml schema documentation generated by docflex (with the full support of hyperlinks). "xsddoc" template set it is the implementation of xml schema documentation itself, which provides the following functionality: generation of single documentation by any number of xml schema (xsd) files together, in particular: highly navigable framed (javadoc-like) html documentation single-file html documentation rtf documentation (further convertible to pdf) processing of any referenced xml schemas, in particular: correct processing of all , , elements found across all involved xsd files. automatic loading and processing (i.e. inclusion in the documentation scope) all directly/indirectly referenced xsd files. sophisticated documenting of xsd components , including: component diagrams (with hyperlinks to everything depicted on them; see also integrations ) xml representation summary (a textual alternative to diagrams) lists of related components. for elements this includes also the list of possible containing elements . (such a list is never present in the output generated by xslt-based doc generators). list of usage locations support of any xml schema design patterns . this comes down mainly to the following: special treatment of local elements (see below) support and documenting of substitution groups support of importing, inclusion and redefinition of schema files special documenting of local elements . local elements are those components that are declared locally within other xsd components. w3c xml schema spec allows you to declare any number of local elements that may share the same name but have different content. that's because their meaning is local and there will be no collisions with other declarations. that, however, creates a problem for documenting, because in a documentation both global and local elements may appear simultaneously in various lists according to their common properties. if each element component is identified only by its name, you will get the lists with multiple repeating names but little clue what they mean. moreover, some xml schemas may contain lots of identical local element declarations (that is, they have the same both name and content). so, you'll get in those lists a mess of repeating names, some of which referencing to effectively the same entities, whereas others to complete different ones. in xsddoc , those problems are solved in two ways: adding extensions to local element names. the extension provides more information about the element (e.g. where it can be inserted or its global type or where it is defined). that makes the whole string identifying the element unique. here is how it looks. the grey text is the name extension: unifying local elements by type. on the left you can see a documentation generated with such unification. on the right, all local elements are documented straight as they are. click on each screenshot to view the docs: we believe the first documentation (on the left) is easier to understand and use. processing of xhtml markup . you can format your xml schema annotations with xhtml tags, which will be recognized and rendered with the appropriate formatting in both html and rtf output, as shown on the following screenshots (click to see more details): here, on the left you can see the xml source of an xml schema, whose annotations are heavily laden with xhtml markup (including insertion of images). the next is the html documentation generated by that schema. on the right is a page of rtf documentation also generated by that schema. possibility of unlimited customization : xsddoc is controlled by more than 400 parameters, which allow you to adjust the generated documentation within huge range of included details. template parameters serve the same role as options in traditional doc generators. the difference is that docflex template architecture makes the support/implementation of template parameters very cheap (typically, the most of efforts takes writing their descriptions). so, there may be hundreds of parameters controlling a large template application. if parameters are not enough, you can modify the templates themselves using the template designer . in case of html output, you can also apply your own css styles to change how the generated documentation looks. template processor the template processor (also called simply "generator") makes everything work. it consists of two logical parts: 1. template interpreter 2. output generator the output generator actually has three different implementations for each currently supported output format: html, rtf, txt (plain text). the plain-text output can be used to generate documentation in formats not supported directly by docflex. the template processor is started directly from java command line with the following arguments: ● main template ● template parameters ● initial xsd files to be processed (documented) ● xml catalogs (to redirect physical location of input files) ● destination directory/file ● output format (this selects which output generator will be used) ● output format options (specify settings to control the selected output generator) actually, the number of settings may be so large that the template processor provides a special gui to specify everything interactively (click to enlarge): template designer although docflex templates are stored as plain-text files (with an xml-like format), they are not supposed for editing manually. rather, a special graphic template designer must be used, which visualizes the templates in the form of template components they are made of. those components are the actual constructs of the template language (not some textual statements, operators, blocks etc.) the following screenshots show templates open in the template designer (click to see a lot more): that approach has a number of advantages, among them: the processing structures represented by template components may be displayed in a way that visually expresses what a component does (for instance, it may resemble the output it generates). that representation may be both expressive and compact (after all, it is not just a text), which allows you easily to navigate a template, understand what it does and modify anything you need. as template components are visual and interactive, they may have very complex internal structure, for instance, contain lots of properties and nested components. at that, you don't need to scroll and navigate some kind of enormous text, which encodes all of this (as it would be in case of a script). rather, you just need to invoke some property dialogs and expand/collapse some component sections. a template component may be easily copied, pasted and deleted as a whole. at that, you don't need to bother that the template syntax is restored after that. the template designer will also ensure that each component is created, copied or moved only in the allowed place. the highly structured nature of templates eliminates the need for most of various named identifiers. many connections between different template components are also maintained by the template designer (i.e. modified automatically when necessary). as template files are stored and read only programmatically, there is no need to know and understand their syntax. there will be no syntax errors either. the actual syntax of template files may be optimized not for human programmers, but for faster loading and processing of templates by the template processor . there is no need in a compilation phase. the separation of template semantics from the particular structure of template files helps for faster and easier evolution of the template language. the obsolete constructs of older template versions can be automatically converted into new structures. both old and new templates will look and work up-to-date. integrations generation of xsd diagrams docflex/xml is able to work with any kind of diagrams (i.e. inserting them automatically in the generated output). that is supported on the level of templates, along with the generation of hypertext imagemaps, as shown on the following screenshot (click to see a lot more): docflex/xml provides no diagramming engine of its own. instead, it includes integrations with two most popular xml editors that do generate xsd diagrams: ● altova xmlspy ● oxygen xml editor effectively, the third-party software is used as dynamically linked diagramming engine. the advantage of such integrations is that when you are the user of one of those xml editors, you will get in the documentation generated by docflex the same diagrams as you see in your xml editor. here is how such a documentation with diagrams looks (click on a screenshot to view the real html): apache ant & maven as a pure java application, docflex/xml can be run in any environment that runs java itself. the template processor can be easily integrated with ant (that can be specified just in the ant build file). in case of maven, docflex/xml includes a simple maven plugin. it is possible also to use all diagraming integrations with both ant and maven. links docflex/xml (home page): http://www.filigris.com/docflex-xml/ docflex/xml xsddoc: http://www.filigris.com/docflex-xml/xsddoc/ xsddoc examples: http://www.filigris.com/docflex-xml/xsddoc/examples/ xmlspy integration: http://www.filigris.com/docflex-xml/xmlspy/ oxygenxml integration: http://www.filigris.com/docflex-xml/oxygenxml/ free downloads: http://www.filigris.com/downloads/ this original article: http://www.filigris.com/ann/docflex-xsd/

July 23, 2014

by Leonid Rudy

· 7,653 Views

Building Extremely Large In-Memory InputStream for Testing Purposes

For some reason I needed extremely large, possibly even infinite InputStream that would simply return the same byte[]over and over. This way I could produce insanely big stream of data by repeating small sample. Sort of similar functionality can be found in Guava: Iterable Iterables.cycle(Iterable) and Iterator Iterators.cycle(Iterator). For example if you need an infinite source of 0 and 1, simply sayIterables.cycle(0, 1) and get 0, 1, 0, 1, 0, 1... infinitely. Unfortunately I haven't found such utility forInputStream, so I jumped into writing my own. This article documents many mistakes I made during that process, mostly due to overcomplicating and overengineering straightforward solution. We don't really need an infinite InputStream, being able to create very large one (say, 32 GiB) is enough. So we are after the following method: public static InputStream repeat(byte[] sample, int times) It basically takes sample array of bytes and returns an InputStream returning these bytes. However when sample runs out, it rolls over, returning the same bytes again - this process is repeated given number of times, until InputStreamsignals end. One solution that I haven't really tried but which seems most obvious: public static InputStream repeat(byte[] sample, int times) { final byte[] allBytes = new byte[sample.length * times]; for (int i = 0; i < times; i++) { System.arraycopy(sample, 0, allBytes, i * sample.length, sample.length); } return new ByteArrayInputStream(allBytes); } I see you laughing there! If sample is 100 bytes and we need 32 GiB of input repeating these 100 bytes, generatedInputStream shouldn't really allocate 32 GiB of memory, we must be more clever here. As a matter of fact repeat()above has another subtle bug. Arrays in Java are limited to 231-1 entries (int), 32 GiB is way above that. The reason this program compiles is a silent integer overflow here: sample.length * times. This multiplication doesn't fit in int. OK, let's try something that at least theoretically can work. My first idea was as follows: what if I create manyByteArrayInputStreams sharing the same byte[] sample (they don't do an eager copy) and somehow join them together? Thus I needed some InputStream adapter that could take arbitrary number of underlying InputStreams and chain them together - when first stream is exhausted, switch to next one. This awkward moment when you look for something in Apache Commons or Guava and apparently it was in the JDK forever... java.io.SequenceInputStream is almost ideal. However it can only chain precisely two underlying InputStreams. Of course since SequenceInputStreamis an InputStream itself, we can use it recursively as an argument to outer SequenceInputStream. Repeating this process we can chain arbitrary number of ByteArrayInputStreams together: public static InputStream repeat(byte[] sample, int times) { if (times <= 1) { return new ByteArrayInputStream(sample); } else { return new SequenceInputStream( new ByteArrayInputStream(sample), repeat(sample, times - 1) ); } } If times is 1, just wrap sample in ByteArrayInputStream. Otherwise use SequenceInputStream recursively. I think you can immediately spot what's wrong with this code: too deep recursion. Nesting level is the same as times argument, which will reach millions or even billions. There must be a better way. Luckily minor improvement changes recursion depth from O(n) to O(logn): public static InputStream repeat(byte[] sample, int times) { if (times <= 1) { return new ByteArrayInputStream(sample); } else { return new SequenceInputStream( repeat(sample, times / 2), repeat(sample, times - times / 2) ); } } Honestly this was the first implementation I tried. It's a simple application of divide and conquer principle, where we produce result by evenly splitting it into two smaller sub-problems. Looks clever, but there is one issue: it's easy to prove we create t (t = times) ByteArrayInputStreams and O(t) SequenceInputStreams. While sample byte array is shared, millions of various InputStream instances are wasting memory. This leads us to alternative implementation, creating just one InputStream, regardless value of times: import com.google.common.collect.Iterators; import org.apache.commons.lang3.ArrayUtils; public static InputStream repeat(byte[] sample, int times) { final Byte[] objArray = ArrayUtils.toObject(sample); final Iterator infinite = Iterators.cycle(objArray); final Iterator limited = Iterators.limit(infinite, sample.length * times); return new InputStream() { @Override public int read() throws IOException { return limited.hasNext() ? limited.next() & 0xFF : -1; } }; } We will use Iterators.cycle() after all. But before we have to translate byte[] into Byte[] since iterators can only work with objets, not primitives. There is no idiomatic way to turn array of primitives to array of boxed types, so I useArrayUtils.toObject(byte[]) from Apache Commons Lang. Having an array of objects we can create an infiniteiterator that cycles through values of sample. Since we don't want an infinite stream, we cut off infinite iterator usingIterators.limit(Iterator, int), again from Guava. Now we just have to bridge from Iterator toInputStream - after all semantically they represent the same thing. This solution suffers two problems. First of all it produces tons of garbage due to unboxing. Garbage collection is not that much concerned about dead, short-living objects, but still seems wasteful. Second issue we already faced previously:sample.length * times multiplication can cause integer overflow. It can't be fixed because Iterators.limit() takesint, not long - for no good reason. BTW we avoided third problem by doing bitwise and with 0xFF - otherwise byte with value -1 would signal end of stream, which is not the case. x & 0xFF is correctly translated to unsigned 255 (int). So even though implementation above is short and sweet, declarative rather than imperative, it's too slow and limited. If you have a C background, I can imagine how uncomfortable you were seeing me struggle. After all the most straightforward, painfully simple and low-level implementation was the one I came up with last: public static InputStream repeat(byte[] sample, int times) { return new InputStream() { private long pos = 0; private final long total = (long)sample.length * times; public int read() throws IOException { return pos < total ? sample[(int)(pos++ % sample.length)] : -1; } }; } GC free, pure JDK, fast and simple to understand. Let this be a lesson for you: start with the simplest solution that jumps to your mind, don't overengineer and don't be too smart. My previous solutions, declarative, functional, immutable, etc. - maybe they looked clever, but they were neither fast nor easy to understand. The utility we just developed was not just a toy project, it will be used later in subsequent article.

July 23, 2014

by Tomasz Nurkiewicz

· 7,573 Views

5 Reasons to Use a Java Data Grid in Your Application

In this post we explore 5 reasons to use a Java Data Grid for caching Java objects in-memory in your applications. In a later post we will explore some of the other data grid capabilities, beyond data storage, that can revolutionize your Java architectures, like on-grid computation and events. Memory is Fast Java Data Grids store Java objects in memory. Memory access is fast with low latency. So if access to data storage either disk or database is the primary bottleneck in your application then using a data grid as an in-memory cache in front of your storage tier will give you a performance boost. Scale out your Application Shared State If you need to share state across JVMs to scale out your application then using a Java Data Grid rather than a database will increase your scalability. A typical shared state architecture is shown below, the application server tier stores shared Java objects in the data grid and these objects are available to all application server nodes in your architecture. Separating the data grid tier from the application server tier has a number of advantages; Applications can be redeployed and restarted without losing the shared state Data Grid JVMs and Application JVMs can be tuned separately State can be shared across multiple different applications. Each tier can be scaled horizontally separately depending on work load Typical use cases for shared state include; PCI compliant storage of card security codes; In-game state in online games; web session data; prices and catalogues in ecommerce. Anything that needs low latency access can be stored in the shared data grid. High Availability for In-Memory Data As well as low latency access and scaling out shared state. Java Data Grids also provide high availability for your in-memory data. When storing Java objects in a data grid a primary object is stored in one of the Data Grid JVMs and secondary back up copies of the object are stored in different Data Grid JVM node, ensuring that if you lose a node then you don't lose any data. Clients of the data grid do not need to know where data is to access it so high availability is transparent to your application. Scale Out In-Memory Data Volumes Java objects, in data grids, aren't fully replicated across all Data Grid JVMs but are stored as a primary object and a secondary object. This means the more Data Grid JVM nodes we add the more JVM heap we have for storing Java objects in-memory (and remember memory is fast). For example if we build a Data Grid with 20 JVMs each with 4Gb free heap (after per JVM overhead) we could theoretically store 80Gb (4 times 20) of shared Java objects. If we assume we have 1 duplicate for high availability this cuts our storage in half so we can store 40Gb (.5 time 4 times 20 ) of Java Objects in memory. Native Integration with JPA Java Data Grids have native integration with JPA frameworks like TopLink and Hibernate whereby the Data Grid can act as a second level cache between JPA and the database. This can give a large performance boost to your database driven application if latency associated with database access is a key performance bottleneck.

July 22, 2014

by Steve Millidge

· 7,428 Views

R/plyr: ddply – Error in vector(type, length) : vector: cannot make a vector of mode ‘closure’.

In my continued playing around with plyr’s ddply function I was trying to group a data frame by one of its columns and return a count of the number of rows with specific values and ran into a strange (to me) error message. I had a data frame: n = c(2, 3, 5) s = c("aa", "bb", "cc") b = c(TRUE, FALSE, TRUE) df = data.frame(n, s, b) And wanted to group and count on column ‘b’ so I’d get back a count of 2 for TRUE and 1 for FALSE. I wrote this code: ddply(df, "b", function(x) { countr <- length(x$n) data.frame(count = count) }) which when evaluated gave the following error: Error in vector(type, length) : vector: cannot make a vector of mode 'closure'. It took me quite a while to realise that I’d just made a typo in assigned the count to a variable called ‘countr’ instead of ‘count’. As a result of that typo I think the R compiler was trying to find a variable called ‘count’ somwhere else in the lexical scope but was unable to. If I’d defined the variable ‘count’ outside the call to ddply function then my typo wouldn’t have resulted in an error but rather an unexpected resulte.g. > count = 10 > ddply(df, "b", function(x) { + countr <- length(x$n) + data.frame(count = count) + }) b count 1 FALSE 4 2 TRUE 4 Once I spotted the typo and fixed it things worked as expected: > ddply(df, "b", function(x) { + count <- length(x$n) + data.frame(count = count) + }) b count 1 FALSE 1 2 TRUE 2

July 10, 2014

by Mark Needham

· 8,800 Views

Designing a Data Architecture to Support both Fast and Big Data

Originally written by Scott Jarr for VoltDB. In post one of this series, we introduced the ideas that a Corporate Data Architecture was taking shape and that working with Fast Data is different from working with Big Data. In the second post we looked at examples of Fast Data and what is required of applications that interact with Fast Data. In this post, I will illustrate how I envision the corporate architecture that will enable companies to achieve the data dream that integrates Fast and Big. The following diagram depicts a basic view of how the “Big” side of the picture is starting to fill out. At the center is a Data Lake, or pool or reservoir or…. there is no shortage of clever names and debate over what to call it. What is clear is this is the spot in which the enterprise will dump ALL of its data. This component is not necessarily unique because of its design or functionality, but because it is an enormously cost effective system to store everything. Essentially, it is a distributed file system on cheap commodity machines. There may or may not be a single winning technology here. It may be HDFS or some other store (maybe S3 if you’re on Amazon), but the point is, this is where all data will go. This platform will: 1. Store data that will be sent to other data management products, and 2. Support frameworks for executing jobs directly against the data in the file system. Moving around the outside of our Data Lake are the complementary pieces of technology that allow people to gain insight and value from the data stored in the Data Lake. Starting at 12 o’clock in the diagram above and moving clockwise: BI – Reporting: Data warehouses do an excellent job of reporting, and will continue to offer this capability. Some data will be exported to those systems and temporarily stored there, while other data will be accessed directly from the Data Lake in a hybrid fashion. These data warehouse systems were specifically designed to run complex report analytics, and do this well. SQL on Hadoop: There is a lot of innovation here. The goal of many of these products is to displace the data warehouse. Advances have been made with the likes of Hawq and Impala. But make no mistake, there is a long way to go for these systems to get near the speed and efficiency of the data warehouses, especially those with columnar designs. SQL-on-Hadoop systems exist for a couple of important reasons: 1) SQL is still the best way to get at data, and 2) Processing can occur without moving big chunks of data around. Exploratory Analytics: This is the realm of the data scientist. These tools offer the ability to “find” things in data – patterns, obscure relationships, statistical rules, etc. Mahout and R are popular tools in this category. MapReduce: This is a lazily-named group of all the job scheduling and management tasks that often occur on Hadoop (I really should come up with something more accurate). Many Hadoop use cases today involve pre-processing or cleaning data prior to the use of the analytics tools described above. These are the tools and interfaces that allow that to happen. ETL of Enterprise Apps: Last at 6 o’clock is the ETL process that will help get all the legacy data from our trusty enterprise applications into our data lake that stores everything. These applications will slowly migrate to full-fledged Fast+Big Data apps in time, which I will discuss in a future post. But suffice it to say: once I add sensors to a manufacturing line, I have a Fast+Big Data problem. OK, we now have analytics … so what? Why do we do analytics in the first place? Simple. We want: Better decisions Better personalization Better detection Better …. Interaction. Interaction is what the application is responsible for, and the most valuable improvements come when you can do these interactions accurately and in real-time. This brings us to the second half of the architecture where we deal with Fast Data to make better, faster real-time applications, depicted in the diagram below. The first thing to notice is that there is a tight coupling of Fast and Big, although they are separate systems. They have to be, at least at scale. The database system designed to work with millions of event decisions per second is wholly different from the system designed to hold Petabytes of data and generate extensive reports. The nature of Fast Data produces a number of critical requirements to get the most out of it. These include the ability to: Ingest / interact with the data feed Make decisions on each event in the feed Provide visibility into fast-moving data with real-time analytics Seamlessly integrate into the systems designed to store Big Data Ability to serve analytic results and knowledge from the Big Data systems quickly to users and applications, closing the data loop. There is no better technology to meet these requirements than an operational database. The challenge we have faced is that there hasn’t been an operational database that can manage this kind of throughput. As a result, there have been a number of Band-Aids people have used to attempt to meet their needs, often giving up capabilities and always adding complexity. In a next post, I will detail the capabilities I see customers looking for to support their Fast Data applications. Then we will take a look at the results of attempting this solution with a popular alternative, stream processing. Originally written by Scott Jarr for VoltDB.

July 9, 2014

by John Piekos

· 14,147 Views

What is Scalable Machine Learning?

scalability has become one of those core concept slash buzzwords of big data. it’s all about scaling out, web scale, and so on. in principle, the idea is to be able to take one piece of code and then throw any number of computers at it to make it fast. the terms “scalable” and “large scale” have been used in machine learning circles long before there was big data. there had always been certain problems which lead to a large amount of data, for example in bioinformatics, or when dealing with large number of text documents. so finding learning algorithms, or more generally data analysis algorithms which can deal with a very large set of data was always a relevant question. interestingly, this issue of scalability were seldom solved using actual scaling in in machine learning, at least not in the big data kind of sense. part of the reason is certainly that multicore processors didn’t yet exist at the scale they do today and that the idea of “just scaling out” wasn’t as pervasive as it is today. instead, “scalable” machine learning is almost always based on finding more efficient algorithms, and most often, approximations to the original algorithm which can be computed much more efficiently. to illustrate this, let’s search for nips papers (the annual advances in neural information processing systems, short nips, conference is one of the big ml community meetings) for papers which have the term “scalable” in the title. here are some examples: scalable inference for logistic-normal topic models … this paper presents a partially collapsed gibbs sampling algorithm that approaches the provably correct distribution by exploring the ideas of data augmentation … partially collapsed gibbs sampling is a kind of estimation algorithm for certain graphical models. a scalable approach to probabilistic latent space inference of large-scale networks … with […] an efficient stochastic variational inference algorithm, we are able to analyze real networks with over a million vertices […] on a single machine in a matter of hours … stochastic variational inference algorithm is both an approximation and an estimation algorithm. scalable kernels for graphs with continuous attributes … in this paper, we present a class of path kernels with computational complexity $o(n^2(m + \delta^2 ))$ … and this algorithm has squared runtime in the number of data points, so wouldn’t even scale out well even if you could. usually, even if there is potential for scalability, it usually something that is “embarassingly parallel” (yep, that’s a technical term), meaning that it’s something like a summation which can be parallelized very easily. still, the actual “scalability” comes from the algorithmic side. so how do scalable ml algorithms look like? a typical example are the stochastic gradient descent (sgd) class of algorithms. these algorithms can be used, for example, to train classifiers like linear svms or logistic regression. one data point is considered at each iteration. the prediction error on that point is computed and then the gradient is taken with respect to the model parameters, giving information about how to adapt these parameters slightly to make the error smaller. vowpal wabbit is one program based on this approach and it has a nice definition of what it considers to mean scalable in machine learning: there are two ways to have a fast learning algorithm: (a) start with a slow algorithm and speed it up, or (b) build an intrinsically fast learning algorithm. this project is about approach (b), and it’s reached a state where it may be useful to others as a platform for research and experimentation. so “scalable” means having a learning algorithm which can deal with any amount of data, without consuming ever growing amounts of resources like memory. for sgd type algorithms this is the case, because all you need to store are the model parameters, usually a few ten to hundred thousand double precision floating point value, so maybe a few megabytes in total. the main problem to speed this kind of computation up is how to stream the data by fast enough. to put it differently, not only does this kind of scalability not rely on scaling out, it’s actually not even necessary or possible to scale the computation out because the main state of the computation easily fits into main memory and computations on it cannot be distributed easily. i know that gradient descent is often taken as an example for map reduce and other approaches like in this paper on the architecture of spark , but that paper discusses a version of gradient descent where you are not taking one point at a time, but aggregate the gradient information for the whole data set before making the update to the model parameters. while this can be easily parallelized, it does not perform well in practice because the gradient information tends to average out when computed over the whole data set. if you want to know more, this large scale learning challenge sören sonnneburg organized in 2008 still has valuable information on how to deal with massive data sets. of course, there are things which can be easily scaled well using hadoop or spark, in particular any kind of data preprocessing or feature extraction where you need to apply the same operation to each data point in your data set. another area where parallelization is easy and useful is when you are using cross validation to do model selection where you usually have to train a large number of models for different parameter sets to find the combination which performs best. again, even here there is more potential for even speeding up such computations using better algorithms like in this paper of mine . i’ve just scratched the surface of this, but i hope you got the idea that scalability can mean quite different things. in big data (meaning the infrastructure side of it) what you want to compute is pretty well defined, for example some kind of aggregate over your data set, so you’re left with the question of how to parallelize that computation well. in machine learning, you have much more freedom because data is noisy and there’s always some freedom in how you model your data, so you can often get away with computing some variation of what you originally wanted to do and still perform well. often, this allows you to speed up your computations significantly by decoupling computations. parallelization is important, too, but alone it won’t get you very far. luckily, there are projects like spark and stratosphere/flink which work on providing more useful abstractions beyond map and reduce to make the last part easier for data scientists, but you won’t get rid of the algorithmic design part any time soon.

July 3, 2014

by Mikio Braun

· 18,570 Views · 1 Like

Making Operations on Volatile Fields Atomic

The expected behaviour for volatile fields is that they should behave in a multi-threaded application the same as they do in a single threaded application. They are not forbidden to behave the same way, but they are not guaranteed to behave the same way. The solution in Java 5.0+ is to use AtomicXxxx classes however these are relatively inefficient in terms of memory (they add a header and padding), performance (they add a references and little control over their relative positions), and syntactically they are not as clear to use. IMHO A simple solution if for volatile fields to act as they might be expected to do, the way JVM must support in AtomicFields which is not forbidden in the current JMM (Java- Memory Model) but not guaranteed. Why make fields volatile? The benefit of volatile fields is that they are visible across threads and some optimisations which avoid re-reading them are disabled so you always check again the current value even if you didn't change them. e.g. without volatile Thread 2: int a = 5; Thread 1: a = 6; (later) Thread 2: System.out.println(a); // prints 5 With volatile Thread 2: volatile int a = 5; Thread 1: a = 6; (later) Thread 2: System.out.println(a); // prints 6 Why not use volatile all the time? Volatile read and write access is substantially slower. When you write to a volatile field it stalls the entire CPU pipeline to ensure the data has been written to cache. Without this, there is a risk the next read of the value sees an old value, even in the same thread (See AtomicLong.lazySet() which avoids stalling the pipeline) The penalty can be in the order of 10x slower which you don't want to be doing on every access. What are the limitations of volatile? A significant limitation is that operations on the field is not atomic, even when you might think it is. Even worse than that is that usually, there is no difference. I.e. it can appear to work for a long time even years and suddenly/randomly break due to an incidental change such as the version of Java used, or even where the object is loaded into memory. e.g. which programs you loaded before running the program. e.g. updating a value Thread 2: volatile int a = 5; Thread 1: a += 1; Thread 2: a += 2; (later) Thread 2: System.out.println(a); // prints 6, 7 or 8 This is an issue because the read of a and the write of a are done separately and you can get a race condition. 99%+ of the time it will behave as expect, but sometimes it won't/ What can you do about it? You need to use AtomicXxxx classes. These wrap volatile fields with operations which behave as expected. Thread 2: AtomicInteger a = new AtomicInteger(5); Thread 1: a.incrementAndGet(); Thread 2: a.addAndGet(2); (later) Thread 2: System.out.println(a); // prints 8 What do I propose? The JVM has a means to behave as expected, the only surprising thing is you need to use a special class to do what the JMM won't guarantee for you. What I propose is that the JMM be changed to support the behaviour currently provided by the concurrency AtomicClasses. In each case the single threaded behaviour is unchanged. A multi-threaded program which does not see a race condition will behave the same. The difference is that a multi-threaded program does not have to see a race condition but changing the underlying behaviour current method suggested syntax notes x.getAndIncrement() x++ or x += 1 x.incrementAndGet() ++x x.getAndDecrment() x-- or x -= 1 x.decrementAndGet() --x x.addAndGet(y) (x += y) x.getAndAdd(y) ((x += y)-y) x.compareAndSet(e, y) (x == e ? x = y, true : false) Need to add the comma syntax used in other languages. These operations could be supported for all the primitive types such as boolean, byte, short, int, long, float and double. Additional assignment operators could be supported such as current method suggested syntax notes Atomic multiplication x *= 2; Atomic subtraction x -= y; Atomic division x /= y; Atomic modulus x %= y; Atomic shift x <<= y; Atomic shift x >>= z; Atomic shift x >>>= w; Atomic and x &= ~y; clears bits Atomic or x |= z; sets bits Atomic xor x ^= w; flips bits What is the risk? This could break code which relies on these operations occasionally failing due to race conditions. It might not be possible to support more complex expressions in a thread safe manner. This could lead to surprising bugs as the code can look like the works, but it doesn't. Never the less it will be no worse than the current state. JEP 193 - Enhanced Volatiles There is a JEP 193 to add this functionality to Java. An example is; class Usage { volatile int count; int incrementCount() { return count.volatile.incrementAndGet(); } } IMHO there is a few limitations in this approach. The syntax is fairly significant change. Changing the JMM might not require many changes the the Java syntax and possibly no changes to the compiler. It is a less general solution. It can be useful to support operations like volume += quantity; where these are double types. It places more burden on the developer to understand why he/she should use this instead of x++; Conclusion Much of the syntactic and performance overhead of using AtomicInteger and AtomicLong could be removed if the JMM guaranteed the equivalent single threaded operations behaved as expected for multi-threaded code. This feature could be added to earlier versions of Java by using byte code instrumentation.

June 27, 2014

by Peter Lawrey

· 11,943 Views