DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest Big Data Topics

article thumbnail
The Near Future of IoT
[This article was written by Sean Lorenz.] Pundits within the technology sphere have been calling 2014 the year of the Internet of Things (IoT). The market revenue potentials are forecasted into the trillions and it’s a Fortune 500 land grab with major companies moving quickly to stake their claims [1]. If this sounds a bit like pages from an American Wild West history book, a frontier analogy isn’t too far off. This is an exciting turning point in technology that—thanks to advances in plummeting sensor costs, wireless communication, and chip size reduction—will soon make today’s futuristic IoT concepts seem humorous in retrospect. While it’s difficult to see where the market is going, given the exponential rate of change in IoT technology, I have noticed several key trends emerging. As a fellow IoT prospector on the frontier, this is my account of the most evident trends as well as some educated predictions for the future. Trends 1. Business Value Over Technology Focus Like any promising new technology still in its infancy stage, the true innovation stems from tech-savvy researchers and tinkerers that build fascinating devices that sometimes have no consumer base–I’m looking at you, robotics market. We have all heard about the smart toothbrushes and smart egg trays coming to market and thought: “Interesting! I wouldn’t buy one, but… sure!” Perhaps the biggest trend is a shift from thinking, “let’s build it because we can” to “what business problem are we solving here?” IoT developers are getting wise to this mentality and building user-focused MVPs (Minimum Viable Products) that will begin hitting the market in late 2014 and early 2015. 2. Keeping It Real At my company, Xively, we often get asked what are the real use cases for the IoT. Many times our customers walk in the door with a vague idea of how connecting their product or service to the Internet would be potentially interesting, but need a little help with seeing how an IoT-enabled product can transform their business—internally and externally. The reason for this is that most of the exciting, transformative elements happen under the hood. Right now, the true “wow” moments in the industry are far from sexy: energy savings in enterprise complexes, CRM & ERP integration, service and support, supply chain efficiencies, product part failure and alert, and so on… you get the idea. Smart homes that respond to our every whim are really great ideas, but these products aren’t integral to our lives yet. Large manufacturing companies and enterprises are using the Internet of Things to manage internal operations and efficiency while also engaging their customers more fully with new IoT data sources aggregated in existing services like Salesforce1 or SAP. 3. Publish-Subscribe The IoT protocol wars are heating up, but allegiances aside, publish-subscribe messaging is what the bulk of implemented models use for connecting devices to the cloud. Pub-sub protocols such as MQTT, CoAP, and AMQP are attractive for connected product development thanks to their ease of scalability and many-to-one/one-to-many possibilities. Given the massive variance of the IoT market, there is bound to be more than one protocol that wins in the end; yet before we get to that point, there are plenty of bugs and vulnerabilities to patch across all of the thriving, open IoT protocols out there. 4. Security Panic! Hacked refrigerators, big box stores, and security cameras… oh my! There has been no shortage of concern for privacy, security, and compliance in the Internet of Things space. Like any news story, some of this attention is warranted and some overblown. Just like your pre-IoT old-fashioned Internet, creating specific application keys and advanced permissioning systems for hardware connecting to the cloud is essential. The amount of nodes at the edge connecting to services across the Internet will be far larger than anything we see now, but IoT platforms are already addressing these complex device lifecycle management issues that are crucial for protecting personal and enterprise information in a connected world. Near-Term Predictions Now lets hop in the DeLorean and look into the future. Rather than focus on five, ten, or twenty years into the future, let’s focus only on the next few years. Why? As I mentioned in the beginning, the IoT landscape changes on a day-to-day basis, so even a prediction looking forward six-months from now can be unreliable. This list contains no self-driving cars or sentient AIs. Instead, it makes some pretty sure bets for what to expect over the horizon. 1. A Household Name Usually the second question after “what’s your name?” at a dinner party is the inevitable “so what do you do?” Mentioning the Internet of Things to non-techies still draws blank stares and looks of confusion. Those looks are justified given the not-so-great marketing name of IoT and the myriad definitions trying to explain what it actually is. Whether it’s called the Internet of Things, Internet of Everything, or just the good ol’ Internet, the concept of connecting any and everything to the Internet will begin to make sense for everyday consumers. 2. Consumers Slow to Adopt Many IoT products are still just toys in many people’s minds. Startups are building products that address problems which most consumers don’t see as a problem yet. This isn’t to say the consumer IoT market will evaporate. It just means we need to get smarter about what customers actually want from smart devices. Today’s wearable products remind me of the Newton—Apple’s infamous PDA. The problem wasn’t the idea, but rather the timing. The Apple Newton seemed clunky, not very powerful, and low on the usability scale. Years later, the iPhone and iPad came along with a set of features and a form factor that customers were looking for. The same feels true of wearables right now—they may need a few more years to incubate before the general public gives two thumbs up. Other consumer IoT markets such as the smart home or driverless cars seem to be in the same situation as the wearables market, but this is changing quickly with major players like Apple and Google moving into these arenas. For example, in the home automation space, frameworks like Apple HomeKit will be essential for unifying disparate protocols and clouds into one application that can handle various products’ data, automating much of the technology and pushing it into the background. I am sure there is a brilliant developer learning Swift and building the first killer smart home app as we speak. 3. Analytics and Automation This prediction probably comes as no surprise, but it is worth stating. Most companies willing to foray into the IoT unknown are, for now, happy with connecting their devices to an external application or cloud service. Having a place to send the data is usually the first step in constructing an IoT system. So what do you do with all this data once you have it? Reporting tools for IoT are just starting to become available, but this is just the tip of the iceberg. The real magic lies in the ability to use exploratory and predictive algorithms to make actionable intelligence a reality. These insights are beneficial to both businesses understanding their customers and to the customers themselves. One could imagine closing the feedback loop between sensor, cloud, and actuator by adding some beautiful supervised machine learning code into the cloud platform at some point in the chain. There are currently a handful of analytics startups focusing on IoT specifically, but this market is about to explode from both platform and application perspectives. 4. IoT Startups Galore For any developers out there interested in the IoT with a real customer pain that needs solving, now is the time to get coding and building that pitch deck. With hardware back en vogue, venture capital funding of IoT-centric companies ison the rise [2]. Having been to a number of IoT events, the amount of enthusiasm by VC and angel investors is palpable. There’s a definite need for developers with great, connected product and service ideas; so, if you haven’t already, I strongly suggest putting on your favorite prospecting gear and exploring the untamed wild west of the Internet of Things. [1] https://internetofeverything.cisco.com/sites/default/files/docs/en/ioe_public_sector_vas_white%20paper_121913final.pdf [2] http://www.cbinsights.com/blog/internet-of-things-investing-snapshot 2014 Guide to Internet of Things The 2014 Guide to Internet of Things covers 39 different IoT SDKs, developer programs, and hardware options, plus: Key findings from our survey of over 2,000 developers "How to IoT Your Life: The Complete Shopping List" "The Scale of IoT" Infographic Glossary of common IoT terms Four in-depth articles from industry experts DOWNLOAD NOW
August 28, 2014
by Benjamin Ball
· 10,519 Views
article thumbnail
How to Setup Realtime Analytics over Logs with ELK Stack
Once we know something, we find it hard to imagine what it was like not to know it. - Chip & Dan Heath, Authors of Made to Stick, Switch Update: I have recently published a book on ELK stack titled - Learning ELK Stack , more details can be found here. What is the ELK stack ? The ELK stack is ElasticSearch, Logstash and Kibana. These three provide a fully working real-time data analytics tool for getting wonderful information sitting on your data. ElasticSearch ElasticSearch,built on top of Apache Lucene, is a search engine with focus on real-time analysis of the data, and is based on the RESTful architecture. It provides standard full text search functionality and powerful search based on query. ElasticSearch is document-oriented/based and you can store everything you want as JSON. This makes it powerful, simple and flexible. Logstash Logstash is a tool for managing events and logs. You can use it to collect logs, parse them, and store them for later use.In ELK Stack logstash plays an important role in shipping the log and indexing them later which can be supplied to Elastic Search. Kibana Kibana is a user friendly way to view, search and visualize your log data, which will present the data stored from Logstash into ElasticSearch, in a very customizable interface with histogram and other panels which provides real-time analysis and search of data you have parsed into ElasticSearch. How Do I Get It ? http://www.elasticsearch.org/overview/elkdownloads/ How Do They Work Together ? Logstash is essentially a pipelining tool. In a basic, centralized installation a logstash agent, known as the shipper, will read input from one to many input sources and output that text wrapped in a JSON message to a broker. Typically Redis, the broker, caches the messages until another logstash agent, known as the collector, picks them up, and sends them to another output. In the common example this output is Elasticsearch, where the messages will be indexed and stored for searching. The Elasticsearch store is accessed via the Kibana web application which allows you to visualize and search through the logs. The entire system is scalable. Many different shippers may be running on many different hosts, watching log files and shipping the messages off to a cluster of brokers. Then many collectors can be reading those messages and writing them to an Elasticsearch cluster. (E)lasticSearch (L)ogstash (K)ibana (The ELK Stack) How Do I Fetch Useful Information Out of Logs? Fetching useful information from logs is one of the most important part of this stack and is being done in logstash using its grok filters and a set of input , filter and output plugins which helps to scale this functionality for taking various kinds of inputs ( file,tcp, udp, gemfire, stdin, unix, web sockets and even IRC and twitter and many more) , filter them using (groks,grep,date filters etc.) and finally write ouput to ElasticSearch,redis,email,HTTP,MongoDB,Gemfire , Jira , Google Cloud Storage etc. A Bit More About Log Stash Filters Transforming the logs as they go through the pipeline is possible as well using filters. Either on the shipper or collector, whichever suits your needs better. As an example, an Apache HTTP log entry can have each element (request, response code, response size, etc) parsed out into individual fields so they can be searched on more seamlessly. Information can be dropped if it isn’t important. Sensitive data can be masked. Messages can be tagged. The list goes on. e.g. input { file { path => ["var/log/apache.log"] type => "saurzcode_apache_logs" } } filter { grok { match => ["message","%{COMBINEDAPACHELOG}"] } } output{ stdout{} } Above example takes input from an apache log file applies a grok filter with %{COMBINEDAPACHELOG}, which will index apache logs information on fields and finally output to Standard Output Console. Writing Grok Filters Writing grok filters and fetching information is the only task that requires some serious efforts and if done properly will give you great insights in to your data like Number of Transations performed over time, Which type of products have most hits etc. Below links will help you a lot in writing grok filters and test them with ease - Grok Debugger http://grokdebug.herokuapp.com/ Grok Patterns Lookup https://github.com/elasticsearch/logstash/tree/v1.4.2/patterns References http://www.elasticsearch.org/overview/ http://logstash.net/ http://rashidkpc.github.io/Kibana/about.html
August 26, 2014
by Saurabh Chhajed
· 47,849 Views · 4 Likes
article thumbnail
The Programming Challenges of IoT
Pragmatic developers can look at the Internet of Things in two ways: This is amazing. I can only begin to imagine how I can directly improve the world outside the set of networked computer boxes. This is terrifying. If something goes wrong, then it’s on me—and this time the system affected extends outside the set of networked computer boxes. IoT is amazing in the way it bridges physical and virtual environments, but even the phrase “Internet of Things” should give a developer pause. Computers are pretty smart. Things are stupid. IoT tries to put Things online and tries to make them into inter-networked computers. That’s pop-philosophy, but you want to develop in the real world. So what real-world challenges will you face when you shoot for the IoT moon? Two Types of Challenges It seems there are two types of programming challenges for the Internet of Things: Data and control (the comp-sci and networking stuff) Information and business logic (the info-sci and human-computer interaction stuff) For this article, we’re going to talk about the programming problems we can solve around IoT. We’ll start at the bottom (data and control) and work our way up to the big picture (information and business logic). Type 1: Data and Control Challenge 1.1: Power This one is pretty obvious. Many IoT devices are wireless, and no one has invented thumbnail fusion reactors yet. One solution is equally obvious: pick your algorithms carefully. If you can save cycles to perform a given task, then do it. Libraries for implementing power-optimized algorithms will presumably spring up in greater numbers, but even so, you may need to inject some heavy-duty comp-sci know-how into IoT app development. The second solution is more complex than the first. Higher-level developers will have to think more about Dynamic Power Management (DPM), which just means: shutting down devices when they don’t need to be on and starting them up when they do. Normally the operating system worries about this, but an IoT application that orchestrates wearables and phones, for example, will know things that each device’s OS won’t—and therefore will be able to switch things on and off more intelligently than each device’s individual OS. Another option is to write or customize an embedded OS. Challenge 1.2: Latency Latency on IoT sits in two places: at the source and in the pipes. The basic problem is a physical one. Thing-chips often have to be small, which means that the chip can only be as powerful as current transistor technology allows. Another problem is power. Many small devices transmit and receive data in discrete active/sleep cycles (think TDMA) in order to save bandwidth and power, but this increases latency inversely to power saved. Another tradeoff is that network topologies optimized for IoT can involve more hops over slower devices. Mesh networks, for example, are immune to the failure of a few nodes. Similarly, “fog” and “edge” computing paradigms relieve Internet infrastructure by doing as much as possible without hub-nodes. The downside is that each node (a) can’t do very much on its own and (b) can only talk to neighboring nodes. The problem in the pipes is a matter of network infrastructure. Simply: the more Things, the less available bandwidth. Infrastructure technology will get faster, but cell networks won’t catch up overnight. And Things, unlike fancier computers, are often supposed to transmit blindly—that is, without anyone necessarily asking them to. This means there’s a massive potential for wasted bandwidth. Challenge 1.3: Unreliability The third challenge flows from the first two. Devices are unreliable–“Things” even more so. The distributed and decentralized virtues of IoT bring their own reliability problems. Here are just a few: Ubiquitous devices are cheap, so they fail more often. Truly ad-hoc connectivity implies ephemeral SLA, so uptime and recovery time may be unclear. Loosely controlled devices may have better things to do than give you their data (or computing resources), so concurrency may grow very complex. Less-reliable hardware generates less-reliable information (‘does my outlying datapoint just signify device failure?’), so you may need to chew your data more thoroughly at the application level. In a sense, IoT decouples low-level (the sub-session layer) from high-level channel capacity, because the distribution of error-sources on IoT is more heavily weighted toward originating or remote nodes. This means more error-correcting for application developers. Type 2: Information and Business Logic Challenge 2.1: Vast & Thin Data Sensors on smartphones are already generating oceans of raw data. These sensors are pretty sophisticated. Every major mobile OS provides a unified, simple API to access clean sensor and geo data. But start grabbing this data and it’s not immediately clear what to do with it. Try to think of killer applications for barometric data—besides weather and elevation (with GPS)—off the top of your head. Raw sensor data is extremely thin. It doesn’t explain itself, and we haven’t yet produced a complete mapping from physical measurements to business logic—let alone software design. Even if you know what to do with sensor/geo data eventually, you may have to learn new algorithms and data structures to process immediately. Geo-graphs aren’t CS101 graph data structures (for one thing, edge length is a first-class citizen of geo-graphs). The size of data over IoT is itself a problem. Wireless sensors beget tons of data. All the problems (and opportunities) of Big Data cascade naturally from IoT. Massively distributed computing on IoT devices is an exciting thought, but the toolchain for splitting calculations over a thousand idle Fitbits just isn’t here yet. 2. Context-Sensitivity Consider the term “ubiquitous computing,” defined as: what happens when wirelessly connected sensors and actuators, placed more or less everywhere, allow software to interact with much larger swaths of the physical world than just hardware or bare metal. Put ubiquitous computing on the Internet, and IoT makes the software context much larger. This has implications at two basic levels. At a high computer-architectural level: IoT extends the concept of computing environment well outside the von Neumann machine and weakens the concept of peripheral I/O. In an IoT-world interface, sensors are input and actuators are output. As IoT devices process increasingly at the edge (within individual nodes), the devices that appear peripheral to other nodes are actually doing an awful lot of computation. At a high business-logic level: the more stuff outside the computer-box affects the program, the less predictable the program behavior becomes at runtime. The same bizarrely-birthed memory leak might slow down the UI in a smartphone context but contribute to a cascading electrical grid failure in an IoT context. This means that IoT demands more self-monitoring and self-repairing code. Two Types of Solutions Plenty of researchers are working on ambitious solutions to the programming challenges presented by IoT. Two of the more exciting examples include: Abstract Task Graph—a data-driven model that maps the network graph to an application graph [1] Computational REST—replaces content resources with computation resources [2] There are also a few more strategies you can use right now to solve some of the IoT programming challenges mentioned above. Reactive ProgrammingThis general purpose paradigm responds to all major application-level challenges and embraces opportunities presented by IoT. The four definitive attributes of a reactive application are: event-driven, scalable, resilient, and responsive [3]. These four are excellent guiding principles for IoT applications at a high, cross-stack level. Flow-based Programming and the Actor ModelBoth present strongly independent components where only messages can affect processes. Both are deeply amenable to concurrency (for example, shared state is discouraged), nondeterminism, and scaling without exponential complexity growth, because components are black boxes. FBP is a bit more pragmatic and restrictive while the actor model is less restrictive and a bit harder to implement. FBP has already been implemented in Javascript (NoFlo), and the actor model has been implemented in Java (Akka) [4][5][6]. What’s important to remember is that there are already tools and techniques that can help you build IoT applications. FBP, actors, and reactive programming all have key attributes for creating applications that leverage the strengths of IoT to overcome its challenges. [1] https://www.usenix.org/legacy/event/mobisys05/eesr05/tech/full_papers/bakshi/bakshi.pdf [2] http://isr.uci.edu/tech_reports/UCI-ISR-10-3.pdf [3] http://www.reactivemanifesto.org/ [4] http://jpaulmorrison.com/fbp/ [5] http://arxiv.org/ftp/arxiv/papers/1008/1008.1459.pdf [6] http://noflojs.org/ [7] http://akka.io/ 2014 Guide to Internet of Things The 2014 Guide to Internet of Things covers 39 different IoT SDKs, developer programs, and hardware options, plus: Key findings from our survey of over 2,000 developers "How to IoT Your Life: The Complete Shopping List" "The Scale of IoT" Infographic Glossary of common IoT terms Four in-depth articles from industry experts DOWNLOAD NOW
August 14, 2014
by John Esposito
· 16,330 Views
article thumbnail
An Early Mover's Guide to the Internet of Things
[This article was written by Andreea Borca, developer of patient-empowering solutions for the healthcare industry, co-host of Farstuff: The IoT Podcast, and featured author in DZone's 2014 Guide to Internet of Things]. The creation of the Internet was a significant shift in the way people acquire information, interact with each other, and make decisions. Now, the Internet is expanding its reach to a range of devices that can gather and analyze physical data and react to that data in a variety of applications that we’ve never seen before. This “Internet of Things” marks another dynamic shift in the history of technology. This new stage in the Internet’s evolution is changing it from a tool that we actively need to engage with—deliberately using a browser to access it—to one that passively endows the world around us with a “mind” of its own. We are developing a world where things interact intelligently and cooperate to achieve goals without explicit guidance from human operators. Defining the Internet of Things First, we need to define the Internet of Things (also called “The Internet of Everything” by Cisco). A system falls under the Internet of Things definition if it meets the following criteria, known as the “3 Cs”: 1. It must Connect – to the physical world around itself collecting information, to other things in order to interact with them effectively, to the internet or a network, etc. 2. It must Compute – by processing the inputs it receives in some way and making them meaningful to other systems. 3. It must Communicate – with the network, with other things, and with the user if necessary (more often than not, as you’ll see, communicating to the user may be an unnecessary burden). Challenges for the Internet of Things Efficiency Devices within the Internet of Things (IoT) only need to do the bare minimum necessary to effectively work within the existing ecosystem. Many of the newest products rely heavily on the power of your smartphone to connect to the Internet and orchestrate devices, but there is also extensive pressure to reduce the size, energy consumption, and cost of the processing entities within IoT devices. In order to reduce power consumption and manage node outages, there is a concept of daisy-chaining across a network of devices into a more powerful central hub. This is known as mesh networking, and it’s becoming quite popular for IoT systems. Security, Privacy & the need to Share A core requirement of a well-functioning IoT device is to collect, transfer, and store data from a wide variety of sources. As more sensors arrive in cities and healthcare institutions, that increasingly connected information will unavoidably lead to more concern about security and privacy. The debate is still raging over balancing the clear benefits of new discoveries from processing Big Data with the strong personal fear of losing privacy. With IoT now in the picture, there is concern about devices that continuously and passively collect information on users. One recent clash over always-on sensors came with the release of Microsoft’s Xbox One Kinect console, which has a camera that is constantly pointed at your living room. Although the camera itself is not always on, the backlash over that possibility was fierce [1]. Finding this balance will quickly become a requirement for continued progress. Furthermore, the very nature of IoT and the connectivity network necessary for its success does make it particularly vulnerable in certain instances. Devices are especially vulnerable when connected over WiFi, because low tech sensor nodes with minimal computing power tend to be less secure, making them the ideal point of entry for infiltrators. Standards As with all new technologies, the battle over standards is always a struggle. Nest, the company that developed the most popular smart home thermostat, and its new owner, Google, are now making significant strides trying to establish the Nest platform as the foundation for all consumer-based IoT devices and their software counterparts [2]. Cisco, Qualcomm, IBM, Microsoft, and most other major players have a similar strategy for creating standard models for approaching the Internet of Things. The pressure to standardize is especially clear when new devices are appearing weekly. ZigBee already has extensive reach as an established standard for many household IoT devices. However, as a preferred codebase has yet to emerge as the standard of choice, it is recommended to connect with major standardization organizations like the IEEE, IETF, and the ZigBee Alliance [3]. Currently, the most common sensor networks use protocols such as Bluetooth Low Energy (BLE), RFID tags, ZigBee, and Wi-Fi. There are also iBeacons, which allow devices like smartphones to better identify their location and potential needs with NFC-powered micro-location and GPS technology. Opportunities for the Internet of Things There are numerous prospects to consider when looking to develop IoT products. Given the multi-trillion dollar projections for the future IoT economy, we should take a look at these emerging markets for IoT tech [4]. Consumers The consumer IoT space has bred a small but growing segment of followers that have invested early into “smart” tech. At this year’s CES, we saw everything from the Babolat Tennis Racket that becomes your personal tennis coach to the Kolibree Toothbrush that monitors your gum health while you brush. The fastest growing consumer IoT segment seems to be in smart home technology, with products such as self-managing refrigerators and resident-sensing door locks. Commercial Retailers have already proven adept at collecting a consumer’s shopping history. With the functionality of NFC-powered beacons, these retailers are eager to personalize your shopping experience in a whole new way. Essentially, each physical shopping trip can now be as littered with targeted ads as any typical online search, much like a scene from the 2002 sci-fi film Minority Report. Walk into a store and instantly the advertising screens on the wall change to address your particular demographic, income level, and shopping preferences. If you’ve connected your Google calendar to certain applications, these screens would show outfits targeting your next big event. Signs on clothing racks sense you coming near and change prices, fully leveraging a custom pricing model that would have economists drooling. And as you try on outfits, the smart mirror in the dressing room recommends accessories or comments on alternatives that might be a better fit for your body type. After all of these IoT events have helped you with your purchase, there’s no need to checkout. You’ve registered with the store and there’s a beacon at the exit that registers what you picked up and charges your card automatically as you leave. Healthcare With the recent U.S. mandate that all health records must be digital, there has been an explosion in the marketplace of new, patient-centered, smart health devices. The excitement of a healthcare revolution among top innovative companies, incubators, and startups predicts that this trend is not likely to taper off anytime soon. The key areas of focus so far have been: monitoring technologies like wearables (especially passive monitoring), function improving technologies, education, and notification technologies. Wearables are generally the first consumer touch point in the IoT health sphere. With the popularity of Fitbit pedometers and Withings scales, the market is starting to experiment with internal monitoring and potentially replacing some organs completely in the near future. A study at Boston University has had incredibly positive results creating an artificial pancreas for Type 1 diabetics by inserting an insulin and glucagon pump that responds when an attached glucometer goes below a certain level, just like an actual pancreas. Proteus, a promising startup out of San Francisco, has created an all-natural microchip in a pill that the patient swallows in order to monitor whether they are remembering to take their medication. The pill sends data to an armband that the user is wearing, which then can send notifications to family members regarding the patient’s status. The most impressive feature is the fact that these chips are powered by the energy in the patient’s digestive system. Cities, Infrastructure, and Industry The long-term vision of the future includes technology such as self-driving cars and city lights that alert police when there’s been an accident. In this stage of development, the majority of value is coming from technologies that monitor and collect data in urban settings. From an evolutionary perspective, the IoT city as a whole is still in what many would consider a learning phase. The main objective is to collect as much data as possible, make it available via open APIs, and encourage motivated data analysts to find opportunities for improvement in utility usage, environmental impacts, and service management for larger populations. This is one area where being an industrial country like the U.S. may actually impede the ability to progress as quickly as our less established counterparts. Third world countries that haven’t yet built a solid infrastructure allow for the creativity and flexibility to implement sophisticated solutions unfettered by generations of previous development. Silicon Valley powerhouses like Facebook and Google are actively engaged in projects to create a free global Wi-Fi network, and key locations in Africa have allowed them to experiment with these projects. Being an Early Mover In the very near future, as more and more things connect to the internet, internet connectivity from IoT devices will dwarf the amount of traditional web browsing. The core standards and assumptions that will drive this next revolution in computing technology are still being established and, as a result, building anything that can add value to this exploding industry (software, hardware, devices, sensors, beacons etc.) is a remarkable opportunity for the right developer. Right now is the time to start contributing to the development of these technologies if you want to be an early mover in IoT. 2014 Guide to Internet of Things The 2014 Guide to Internet of Things covers 39 different IoT SDKs, developer programs, and hardware options, plus: Key findings from our survey of over 2,000 developers "How to IoT Your Life: The Complete Shopping List" "The Scale of IoT" Infographic Glossary of common IoT terms Four in-depth articles from industry experts DOWNLOAD NOW
August 12, 2014
by Benjamin Ball
· 15,075 Views
article thumbnail
JBoss Data Grid: Installation and Development
In this blog, we will discuss one particular data grid platform from Redhat namely JBoss Data Grid (JDG). We will firstly cover how to access and install this data grid platform and then we will demonstrate how to develop and deploy a simple remote client/server data grid application which utilises the HotRod protocol. We will be using the latest release JDG 6.2 from Redhat in this article. Installation Overview To start using JDG, firstly log on to the redhat site https://access.redhat.com/home and download the software from the Downloads section of the site. We wish to download JDG 6.2 server by clicking on the appropriate links in the Downloads section. For future reference, it is also useful to download the quickstart and maven repository zip files. To install JDG, we simply unzip the JDG server package into an appropriate directory in your environment. JDG Overview In this section, we will provide a brief overview of the contents of the JDG installation package and the most notable configuration options available to users. Out of the box, users are provided with two runtime options either to run JDG in standalone or clustered mode. We can start JDG in either mode by invoking the stanadalone or clustered start up scripts in the / bin directory. To configure the JDG in either mode we need to configure the files standalone.xml and clustered.xml. In our case we will creating a distributed cache which will run on 3 node JDG cluster so we will be utilizing the clustered startup script. In order to set up and add new cache instances to JDG, we modify the infinispan subsystems in the appropriate xml configuration file above. We should also note the principal difference between the standalone and clustered configuration file is that in the clustered configuration file there is a JGroups subsystem configured element which allows for communication and messaging between configured cache instances running in a JDG cluster. Development Environment Setup and Configuration In this section, we will detail how to develop and configure a simple datagrid application which will be deployed to a 3 node JDG cluster. We will demonstrate how to configure and deploy a distributed cache in JDG and also show how to develop a HotRod Java client application which will be used to insert, update and display entries in the distributed cache. We will firstly discuss setting a new distributed cache on a 3 node JDG cluster. In this example, we will run our JDG cluster on a single machine by running each JDG instance on different ports. Firstly, we will create 3 instances of JDG by creating 3 directories (server1, server2, server3) on our host machine and unzipping each JDG installation into each directory. We will now configure each node in our cluster by copying and renaming the clustered.xml configuration file in the \server1\jboss-datagrid-6.2.0-server\standalone\configuration directory. We will name each of the cluster configuration files as "clustered1.xml", "clustered2.xml" and "clustered3.xml" for the JDG instances denoted by "server1", "server2" and "server3" respectively. We will now set up a new distributed cache on our JDG cluster by modifying the infinispan subsystem element in each clustered.xml file. We will demonstrate this for the node denoted "server1" here by modifying the file "clustered1.xml". The cache configuration shown here will be the same across all 3 nodes. To setup a new distributed cache named "directory-dist-cache", we configure the following elements in the file named "clustered1.xml" ......... ...... .............. ...... ...... /socket-binding-group> We will discuss the key elements and attributes relating to the configuration above. In the infinispan endpoint subsystem, we will configure hotrod clients to connect to the JDG server instance on socket 11222. The name of the cache container to host each of the cache instances will be held in the container named "clusteredcache". We have configured the infinispan core subsystem to the default cache container named "clusteredcacahe" whereby we will allow for jmx statistics to be collected relating the configured cache entries i.e statistics="true" We have created a new distributed cache named "directory-dist-cache" whereby there will be two copies of each cache entry held on two of the 3 cluster nodes. We have also set up an eviction policy whereby should there be more than 20 entries in our cache then cache entries will be removed using the LRU algorithm We should have configured nodes "server2" and "server3" to start up with a port offset of 100 and 200 respectively by configuring the socketing binding group element appropriately. Please view the socket bindings noted below. To set the socket binding element with a port offset of 100 on "server2", we configure "clustered2.xml" with the following entry: ...... ...... /socket-binding-group> To set the socket binding element with a port offset of 200 on "server3", we configure "clustered3.xml" with the following entry: ...... ...... /socket-binding-group> Before discussing the setup and configuration of our Hotrod client which will be used to interact with our JDG clustered HotRod server, we will start up each server instance to ensure our newly configured JDG distributed cache starts up correctly. Open up 3 Windows or Linux consoles and execute the following start up commands: Console 1: 1) Navigate to \server1\jboss-datagrid-6.2.0-server\bin 2) Execute this command to start the first instance of our JDG cluster denoted "server1": clustered -c=clustered1.xml -Djboss.node.name=server1 Console 2: 1) Navigate to \server2\jboss-datagrid-6.2.0-server\bin 2) Execute this command to start the second instance of our JDG cluster denoted "server2": clustered -c=clustered2.xml -Djboss.node.name=server2 Console 3: 1) Navigate to \server3\jboss-datagrid-6.2.0-server\bin 2) Execute this command to start the third instance of our JDG cluster denoted "server3": clustered -c=clustered3.xml -Djboss.node.name=server3 Providing all 3 JDG instances have started up correctly, you should see output in the console window whereby we can see there are 3 JDG instances in the JGroups view: HotRod Client Development Setup Now that the Hotrod server is up and running, we need to develop a Hotrod Java client which will interact with the clustered server application. The development environment consists of the following tools. 1) JDK Hotspot 1.7.0_45 2) IDE - Eclipse Kepler Build id: 20130919-0819 The HotRod client application is a simple application consisting of two Java classes. The application allows users to retrieve a reference to the distributed cache from the JDG server and then perform these actions: a) add new cinema objects. b) add and remove shows to each cinema object. c) print the list of all cinemas and shows stored in our distributed cache. The source code can be downloaded from github @ https://github.com/davewinters/JDG. We could use maven here to build and execute our application by configuring the maven settings.xml to point to the maven repository files we downloaded earlier and set up a maven project file (pom.xml) to build and execute the client application. In this article we will build our application using the Eclipse IDE and run the client application on the command line. To create a HotRod client application and execute the sample application, one should complete the following steps: 1) Create a new Java Project in Eclipse 2) Create a new package named uk.co.c2b2.jdg.hotrod and import the source code that has been downloaded from Github mentioned previously. 3) Now we need to configure the build path in Eclipse to contain the appropriate JDG client jar files which are required to compile the application. You should include all the client jar files in the project build path. These jar files are contained in the JDG installation zip file. For example on my machine these jar files are located in the directory: \server1\jboss-datagrid-6.2.0-server\client\hotrod\java 4. Providing the Eclipse build path has been configured appropriately, the application source should compile without issue. 5. We will need to execute the Hotrod application by opening the console window and executing the following command. Note the path specified here will differ depending on where the JDG client jar files and application class files are located in your environment: java -classpath ".;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\commons-pool-1.6-redhat-4.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\infinispan-client-hotrod-6.0.1.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\infinispan-commons-6.0.1.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\infinispan-query-dsl-6.0.1.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\infinispan-remote-query-client-6.0.1.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\jboss-logging-3.1.2.GA-redhat-1.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\jboss-marshalling-1.4.2.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\jboss-marshalling-river-1.4.2.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\protobuf-java-2.5.0.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\protostream-1.0.0.CR1-redhat-1.jar" uk/co/c2b2/jdg/hotrod/CinemaDirectory 6. The Hotrod client at runtime provides the end user with a number of different options to interact with the distributed cache as we can view from the console window below. Client Application Principal API Details We will not provide a detailed overview of the Hotrod application code however we will describe the principal API and code details briefly. In order to interact with the distributed cache on the JDG cluster using the Hotrod protocol, we will use the RemoteCacheManager Object which will allow us to retrieve a remote reference to the distributed cache. We have initialised a Properties object with the list of JDG instances and the associated with HotRod server port on each instance. We can add Cinema objects into the distributed cache using the RemoteCache.put() method. private RemoteCacheManager cacheManager; private RemoteCache cache; ..... Properties properties = new Properties(); properties.setProperty(ConfigurationProperties.SERVER_LIST, "127.0.0.1:11222;127.0.0.1:11322;127.0.0.1:11422"); cacheManager = new RemoteCacheManager(properties); cache = cacheManager.getCache("directory-dist-cache"); ..... cache.put(cinemaKey, cinemalist); In the webinar below, I describe in further detail how to set up a JDG cluster and how to develop and run the JDG application discussed above. For further details on JDG please visit: http://www.redhat.com/products/jbossenterprisemiddleware/data-grid/ Webinar: Introduction to JBoss Data Grid -- Installation, Configuration and Development In this webinar we will look at the basics of setting up JBoss Data Grid covering installation, configuration and development. We will look at practical examples of storing data, viewing the data in the cache and removing it. We will also take a look at the different clustered modes and what effect these have on the storage of your data:
July 25, 2014
by David Winters
· 16,074 Views
article thumbnail
5 Reasons to Use a Java Data Grid in Your Application
In this post we explore 5 reasons to use a Java Data Grid for caching Java objects in-memory in your applications. In a later post we will explore some of the other data grid capabilities, beyond data storage, that can revolutionize your Java architectures, like on-grid computation and events. Memory is Fast Java Data Grids store Java objects in memory. Memory access is fast with low latency. So if access to data storage either disk or database is the primary bottleneck in your application then using a data grid as an in-memory cache in front of your storage tier will give you a performance boost. Scale out your Application Shared State If you need to share state across JVMs to scale out your application then using a Java Data Grid rather than a database will increase your scalability. A typical shared state architecture is shown below, the application server tier stores shared Java objects in the data grid and these objects are available to all application server nodes in your architecture. Separating the data grid tier from the application server tier has a number of advantages; Applications can be redeployed and restarted without losing the shared state Data Grid JVMs and Application JVMs can be tuned separately State can be shared across multiple different applications. Each tier can be scaled horizontally separately depending on work load Typical use cases for shared state include; PCI compliant storage of card security codes; In-game state in online games; web session data; prices and catalogues in ecommerce. Anything that needs low latency access can be stored in the shared data grid. High Availability for In-Memory Data As well as low latency access and scaling out shared state. Java Data Grids also provide high availability for your in-memory data. When storing Java objects in a data grid a primary object is stored in one of the Data Grid JVMs and secondary back up copies of the object are stored in different Data Grid JVM node, ensuring that if you lose a node then you don't lose any data. Clients of the data grid do not need to know where data is to access it so high availability is transparent to your application. Scale Out In-Memory Data Volumes Java objects, in data grids, aren't fully replicated across all Data Grid JVMs but are stored as a primary object and a secondary object. This means the more Data Grid JVM nodes we add the more JVM heap we have for storing Java objects in-memory (and remember memory is fast). For example if we build a Data Grid with 20 JVMs each with 4Gb free heap (after per JVM overhead) we could theoretically store 80Gb (4 times 20) of shared Java objects. If we assume we have 1 duplicate for high availability this cuts our storage in half so we can store 40Gb (.5 time 4 times 20 ) of Java Objects in memory. Native Integration with JPA Java Data Grids have native integration with JPA frameworks like TopLink and Hibernate whereby the Data Grid can act as a second level cache between JPA and the database. This can give a large performance boost to your database driven application if latency associated with database access is a key performance bottleneck.
July 22, 2014
by Steve Millidge
· 7,397 Views
article thumbnail
Designing a Data Architecture to Support both Fast and Big Data
Originally written by Scott Jarr for VoltDB. In post one of this series, we introduced the ideas that a Corporate Data Architecture was taking shape and that working with Fast Data is different from working with Big Data. In the second post we looked at examples of Fast Data and what is required of applications that interact with Fast Data. In this post, I will illustrate how I envision the corporate architecture that will enable companies to achieve the data dream that integrates Fast and Big. The following diagram depicts a basic view of how the “Big” side of the picture is starting to fill out. At the center is a Data Lake, or pool or reservoir or…. there is no shortage of clever names and debate over what to call it. What is clear is this is the spot in which the enterprise will dump ALL of its data. This component is not necessarily unique because of its design or functionality, but because it is an enormously cost effective system to store everything. Essentially, it is a distributed file system on cheap commodity machines. There may or may not be a single winning technology here. It may be HDFS or some other store (maybe S3 if you’re on Amazon), but the point is, this is where all data will go. This platform will: 1. Store data that will be sent to other data management products, and 2. Support frameworks for executing jobs directly against the data in the file system. Moving around the outside of our Data Lake are the complementary pieces of technology that allow people to gain insight and value from the data stored in the Data Lake. Starting at 12 o’clock in the diagram above and moving clockwise: BI – Reporting: Data warehouses do an excellent job of reporting, and will continue to offer this capability. Some data will be exported to those systems and temporarily stored there, while other data will be accessed directly from the Data Lake in a hybrid fashion. These data warehouse systems were specifically designed to run complex report analytics, and do this well. SQL on Hadoop: There is a lot of innovation here. The goal of many of these products is to displace the data warehouse. Advances have been made with the likes of Hawq and Impala. But make no mistake, there is a long way to go for these systems to get near the speed and efficiency of the data warehouses, especially those with columnar designs. SQL-on-Hadoop systems exist for a couple of important reasons: 1) SQL is still the best way to get at data, and 2) Processing can occur without moving big chunks of data around. Exploratory Analytics: This is the realm of the data scientist. These tools offer the ability to “find” things in data – patterns, obscure relationships, statistical rules, etc. Mahout and R are popular tools in this category. MapReduce: This is a lazily-named group of all the job scheduling and management tasks that often occur on Hadoop (I really should come up with something more accurate). Many Hadoop use cases today involve pre-processing or cleaning data prior to the use of the analytics tools described above. These are the tools and interfaces that allow that to happen. ETL of Enterprise Apps: Last at 6 o’clock is the ETL process that will help get all the legacy data from our trusty enterprise applications into our data lake that stores everything. These applications will slowly migrate to full-fledged Fast+Big Data apps in time, which I will discuss in a future post. But suffice it to say: once I add sensors to a manufacturing line, I have a Fast+Big Data problem. OK, we now have analytics … so what? Why do we do analytics in the first place? Simple. We want: Better decisions Better personalization Better detection Better …. Interaction. Interaction is what the application is responsible for, and the most valuable improvements come when you can do these interactions accurately and in real-time. This brings us to the second half of the architecture where we deal with Fast Data to make better, faster real-time applications, depicted in the diagram below. The first thing to notice is that there is a tight coupling of Fast and Big, although they are separate systems. They have to be, at least at scale. The database system designed to work with millions of event decisions per second is wholly different from the system designed to hold Petabytes of data and generate extensive reports. The nature of Fast Data produces a number of critical requirements to get the most out of it. These include the ability to: Ingest / interact with the data feed Make decisions on each event in the feed Provide visibility into fast-moving data with real-time analytics Seamlessly integrate into the systems designed to store Big Data Ability to serve analytic results and knowledge from the Big Data systems quickly to users and applications, closing the data loop. There is no better technology to meet these requirements than an operational database. The challenge we have faced is that there hasn’t been an operational database that can manage this kind of throughput. As a result, there have been a number of Band-Aids people have used to attempt to meet their needs, often giving up capabilities and always adding complexity. In a next post, I will detail the capabilities I see customers looking for to support their Fast Data applications. Then we will take a look at the results of attempting this solution with a popular alternative, stream processing. Originally written by Scott Jarr for VoltDB.
July 9, 2014
by John Piekos
· 14,134 Views
article thumbnail
What is Scalable Machine Learning?
scalability has become one of those core concept slash buzzwords of big data. it’s all about scaling out, web scale, and so on. in principle, the idea is to be able to take one piece of code and then throw any number of computers at it to make it fast. the terms “scalable” and “large scale” have been used in machine learning circles long before there was big data. there had always been certain problems which lead to a large amount of data, for example in bioinformatics, or when dealing with large number of text documents. so finding learning algorithms, or more generally data analysis algorithms which can deal with a very large set of data was always a relevant question. interestingly, this issue of scalability were seldom solved using actual scaling in in machine learning, at least not in the big data kind of sense. part of the reason is certainly that multicore processors didn’t yet exist at the scale they do today and that the idea of “just scaling out” wasn’t as pervasive as it is today. instead, “scalable” machine learning is almost always based on finding more efficient algorithms, and most often, approximations to the original algorithm which can be computed much more efficiently. to illustrate this, let’s search for nips papers (the annual advances in neural information processing systems, short nips, conference is one of the big ml community meetings) for papers which have the term “scalable” in the title. here are some examples: scalable inference for logistic-normal topic models … this paper presents a partially collapsed gibbs sampling algorithm that approaches the provably correct distribution by exploring the ideas of data augmentation … partially collapsed gibbs sampling is a kind of estimation algorithm for certain graphical models. a scalable approach to probabilistic latent space inference of large-scale networks … with […] an efficient stochastic variational inference algorithm, we are able to analyze real networks with over a million vertices […] on a single machine in a matter of hours … stochastic variational inference algorithm is both an approximation and an estimation algorithm. scalable kernels for graphs with continuous attributes … in this paper, we present a class of path kernels with computational complexity $o(n^2(m + \delta^2 ))$ … and this algorithm has squared runtime in the number of data points, so wouldn’t even scale out well even if you could. usually, even if there is potential for scalability, it usually something that is “embarassingly parallel” (yep, that’s a technical term), meaning that it’s something like a summation which can be parallelized very easily. still, the actual “scalability” comes from the algorithmic side. so how do scalable ml algorithms look like? a typical example are the stochastic gradient descent (sgd) class of algorithms. these algorithms can be used, for example, to train classifiers like linear svms or logistic regression. one data point is considered at each iteration. the prediction error on that point is computed and then the gradient is taken with respect to the model parameters, giving information about how to adapt these parameters slightly to make the error smaller. vowpal wabbit is one program based on this approach and it has a nice definition of what it considers to mean scalable in machine learning: there are two ways to have a fast learning algorithm: (a) start with a slow algorithm and speed it up, or (b) build an intrinsically fast learning algorithm. this project is about approach (b), and it’s reached a state where it may be useful to others as a platform for research and experimentation. so “scalable” means having a learning algorithm which can deal with any amount of data, without consuming ever growing amounts of resources like memory. for sgd type algorithms this is the case, because all you need to store are the model parameters, usually a few ten to hundred thousand double precision floating point value, so maybe a few megabytes in total. the main problem to speed this kind of computation up is how to stream the data by fast enough. to put it differently, not only does this kind of scalability not rely on scaling out, it’s actually not even necessary or possible to scale the computation out because the main state of the computation easily fits into main memory and computations on it cannot be distributed easily. i know that gradient descent is often taken as an example for map reduce and other approaches like in this paper on the architecture of spark , but that paper discusses a version of gradient descent where you are not taking one point at a time, but aggregate the gradient information for the whole data set before making the update to the model parameters. while this can be easily parallelized, it does not perform well in practice because the gradient information tends to average out when computed over the whole data set. if you want to know more, this large scale learning challenge sören sonnneburg organized in 2008 still has valuable information on how to deal with massive data sets. of course, there are things which can be easily scaled well using hadoop or spark, in particular any kind of data preprocessing or feature extraction where you need to apply the same operation to each data point in your data set. another area where parallelization is easy and useful is when you are using cross validation to do model selection where you usually have to train a large number of models for different parameter sets to find the combination which performs best. again, even here there is more potential for even speeding up such computations using better algorithms like in this paper of mine . i’ve just scratched the surface of this, but i hope you got the idea that scalability can mean quite different things. in big data (meaning the infrastructure side of it) what you want to compute is pretty well defined, for example some kind of aggregate over your data set, so you’re left with the question of how to parallelize that computation well. in machine learning, you have much more freedom because data is noisy and there’s always some freedom in how you model your data, so you can often get away with computing some variation of what you originally wanted to do and still perform well. often, this allows you to speed up your computations significantly by decoupling computations. parallelization is important, too, but alone it won’t get you very far. luckily, there are projects like spark and stratosphere/flink which work on providing more useful abstractions beyond map and reduce to make the last part easier for data scientists, but you won’t get rid of the algorithmic design part any time soon.
July 3, 2014
by Mikio Braun
· 18,540 Views · 1 Like
article thumbnail
Top 10 Hadoop Shell Commands to Manage HDFS
So you already know what Hadoop is? Why it is used? What problems you can solve with it?
June 30, 2014
by Saurabh Chhajed
· 485,684 Views · 8 Likes
article thumbnail
Internet of Things (IoT) Reference Architecture
to converge internet of thing devices with corporate it solutions, teams require a reference architecture for the internet of things (iot). the reference architecture must include devices, server-side capabilities, and cloud architecture required to interact with and manage the devices. a reference architecture should provide architects and developers of iot projects with an effective starting point that addresses major iot project and system requirements. a high-level iot reference architecture may include the following layers (see figure 1): external communications - web/portal, dashboard, apis event processing and analytics (including data storage) aggregation / bus layer – esb and message broker device communications devices cross-‐cutting layers include: device and application management identity and access management a more detailed architecture component description can be found in the iot reference architecture white paper .
June 18, 2014
by Chris Haddad
· 17,729 Views
article thumbnail
Exploring Message Brokers: RabbitMQ, Kafka, ActiveMQ, and Kestrel
Explore different message brokers, and discover how these important web technologies impact a customer's backlog of messages, and cluster/data performance.
June 3, 2014
by Yves Trudeau
· 460,615 Views · 86 Likes
article thumbnail
How to Migrate from MySQL to MongoDB
In the last week I was working on a key project to migrate a BI platform from MySQL to MongoDB. We chose that database due to its support and scalability.
April 14, 2014
by Moshe Kaplan
· 115,897 Views · 6 Likes
article thumbnail
Running Hadoop MapReduce Application from Eclipse Kepler
it's very important to learn hadoop by practice. one of the learning curves is how to write the first map reduce app and debug in favorite ide, eclipse. do we need any eclipse plugins? no, we do not. we can do hadoop development without map reduce plugins this tutorial will show you how to set up eclipse and run your map reduce project and mapreduce job right from your ide. before you read further, you should have setup hadoop single node cluster and your machine. you can download the eclipse project from github . use case: we will explore the weather data to find maximum temperature from tom white’s book hadoop: definitive guide (3rd edition) chapter 2 and run it using toolrunner i am using linux mint 15 on virtualbox vm instance. in addition, you should have hadoop (mrv1 am using 1.2.1) single node cluster installed and running, if you have not done so, would strongly recommend you do it from here download eclipse ide, as of writing this, latest version of eclipse is kepler 1. create new java project 2. add dependencies jars right click on project properties and select java build path add all jars from $hadoop_home/lib and $hadoop_home (where hadoop core and tools jar lives) 3. create mapper package com.letsdobigdata; import java.io.ioexception; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.longwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; public class maxtemperaturemapper extends mapper { private static final int missing = 9999; @override public void map(longwritable key, text value, context context) throws ioexception, interruptedexception { string line = value.tostring(); string year = line.substring(15, 19); int airtemperature; if (line.charat(87) == '+') { // parseint doesn't like leading plus // signs airtemperature = integer.parseint(line.substring(88, 92)); } else { airtemperature = integer.parseint(line.substring(87, 92)); } string quality = line.substring(92, 93); if (airtemperature != missing && quality.matches("[01459]")) { context.write(new text(year), new intwritable(airtemperature)); } } } 4. create reducer package com.letsdobigdata; import java.io.ioexception; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.reducer; public class maxtemperaturereducer extends reducer { @override public void reduce(text key, iterable values, context context) throws ioexception, interruptedexception { int maxvalue = integer.min_value; for (intwritable value : values) { maxvalue = math.max(maxvalue, value.get()); } context.write(key, new intwritable(maxvalue)); } } 5. create driver for mapreduce job map reduce job is executed by useful hadoop utility class toolrunner package com.letsdobigdata; import org.apache.hadoop.conf.configured; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.hadoop.util.tool; import org.apache.hadoop.util.toolrunner; /*this class is responsible for running map reduce job*/ public class maxtemperaturedriver extends configured implements tool{ public int run(string[] args) throws exception { if(args.length !=2) { system.err.println("usage: maxtemperaturedriver "); system.exit(-1); } job job = new job(); job.setjarbyclass(maxtemperaturedriver.class); job.setjobname("max temperature"); fileinputformat.addinputpath(job, new path(args[0])); fileoutputformat.setoutputpath(job,new path(args[1])); job.setmapperclass(maxtemperaturemapper.class); job.setreducerclass(maxtemperaturereducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); system.exit(job.waitforcompletion(true) ? 0:1); boolean success = job.waitforcompletion(true); return success ? 0 : 1; } public static void main(string[] args) throws exception { maxtemperaturedriver driver = new maxtemperaturedriver(); int exitcode = toolrunner.run(driver, args); system.exit(exitcode); } } 6. supply input and output we need to supply input file that will be used during map phase and the final output will be generated in output directory by reduct task. edit run configuration and supply command line arguments. sample.txt reside in the project root. your project explorer should contain following ] 7. map reduce job execution 8. final output if you managed to come this far, once the job is complete, it will create output directory with _success and part_nnnnn , double click to view it in eclipse editor and you will see we have supplied 5 rows of weather data (downloaded from ncdc weather) and we wanted to find out the maximum temperature in a given year from input file and the output will contain 2 rows with max temperature in (centigrade) for each supplied year 1949 111 (11.1 c) 1950 22 (2.2 c) make sure you delete the output directory next time running your application else you will get an error from hadoop saying directory already exists. happy hadooping!
February 21, 2014
by Hardik Pandya
· 144,704 Views · 2 Likes
article thumbnail
Big Data Search, Part 6: Sorting Randomness
As it turns out, doing work on big data sets is quite hard. To start with, you need to get the data, and it is… well, big. So that takes a while. Instead, I decided to test my theory on the following scenario. Given 4 GB of random numbers, let us find how many times we have the number 1. Because I wanted to ensure a consistent answer, I wrote: public static IEnumerable RandomNumbers() { const long count = 1024 * 1024 * 1024L * 1; var random = new MyRand(); for (long i = 0; i < count; i++) { if (i % 1024 == 0) { yield return 1; continue; } var result = random.NextUInt(); while (result == 1) { result = random.NextUInt(); } yield return result; } } /// /// Based on Marsaglia, George. (2003). Xorshift RNGs. /// http://www.jstatsoft.org/v08/i14/paper /// public class MyRand { const uint Y = 842502087, Z = 3579807591, W = 273326509; uint _x, _y, _z, _w; public MyRand() { _y = Y; _z = Z; _w = W; _x = 1337; } public uint NextUInt() { uint t = _x ^ (_x << 11); _x = _y; _y = _z; _z = _w; return _w = (_w ^ (_w >> 19)) ^ (t ^ (t >> 8)); } } I am using a custom Rand function because it is significantly faster than System.Random. This generate 4GB of random numbers, at also ensure that we get exactly 1,048,576 instances of 1. Generating this in an empty loop takes about 30 seconds on my machine. For fun, I run the external sort routine in 32 bits mode, with a buffer of 256MB. It is currently processing things, but I expect it to take a while. Because the buffer is 256 in size, we flush it every 128 MB (while we still have half the buffer free to do more work). The interesting thing is that even though we generate random number, sorting then compressing the values resulted in about 60% compression rate. The problem is that for this particular case, I am not sure if that is a good thing. Because the values are random, we need to select a pretty high degree of compression just to get a good compression rate. And because of that, a significant amount of time is spent just compressing the data. I am pretty sure that for real world scenario, it would be better, but that is something that we’ll probably need to test. Not compressing the data in the random test is a huge help. Next, external sort is pretty dependent on the performance of… sort, of course. And sort isn’t that fast. In this scenario, we are sorting arrays of about 26 million items. And that takes time. Implementing parallel sort cut this down to less than a minute per batch of 26 million. That let us complete the entire process, but then it halts with the merge. The reason for that is that we push all the values into a heap, and there are 1 billion of them. Now, the heap never exceed 40 items, but those are still 1 billion * O(log 40) or about 5.4 billion comparisons that we have to do, and we do this sequentially, which takes time. I tried thinking about ways to parallel, but I am not sure how that can be done. We have 40 sorted files, and we want to merge all of them. Obviously we can sort each 10 files set in parallel, then sort the resulting 4, but the cost we have now is the actual sorting cost, not I/O. I am not sure how to approach this. For what is it worth, you can find the code for this here.
February 5, 2014
by Oren Eini
· 9,054 Views
article thumbnail
Big Data Search, Part 5: Sorting Optimizations
I mentioned several times that the entire point of the exercise was to just see how this works, not to actually do anything production worthy. But it is interesting to see how we could do better here. In no particular order, I think that there are at least several things that we could do to significantly improve the time it takes to sort. Right now we defined 2 indexes on top of a 1GB file, and it took under 1 minute to complete. That gives us a runtime of about 10 days over a 15TB file. Well, one of the reason for this performance is that we execute this in a serial fashion, that is, one after another. But we have to completely isolated indexes, there is no reason why we can’t parallelize the work between them. For that matter, we are buffering in memory up to a certain point, then we sort, then we buffer some more, etc. That is pretty inefficient. We can push the actual sorting to a different thread, and continue parsing and adding to a buffer while we are adding to the buffer. We wrote to intermediary files, but we wrote to those using plain file I/O. But it is usually a lot more costly to write to disk than to compress and then write to disk. We are writing sorted data, so it is probably going to compress pretty well. Those are the things that pop to mind. Can you think of additional options?
January 27, 2014
by Oren Eini
· 7,865 Views
article thumbnail
Big Data Search, Part 4: The Index Format is Horrible
I have completed my own exercise, and while I wanted to try it with “few allocations” rule, it is interesting to see just how far out there the code is. This isn’t something that you can really use for anything except as a basis to see how badly you are doing. Let us start with the index format. It is just a CSV file with the value and the position in the original file. That means that any search we want to do on the file is actually a binary search, as discussed in the previous post. But doing a binary search like that is an absolute killer for performance. Let us consider our 15TB data set. In my tests, a 1GB file with 4.2 million rows produced roughly 80MB index. Assuming the same is true for the larger file, that gives us a 1.2 TB file. In my small index, we have to do 24 seeks to get to the right position in the file. And as you should know, disk seeks are expensive. They are in the order of 10ms or so. So the cost of actually searching the index is close to quarter of a second. Now, to be fair, there is going to be a lot of caching opportunities here, but probably not that many if we have a lot of queries to deal with ere. Of course, the fun thing about this is that even with a 1.2 TB file, we are still talking about less than 40 seeks (the beauty of O(logN) in action), but that is still pretty expensive. Even worse, this is what happens when we are running on a single query at a time. What do you think will happen if we are actually running this with multiple threads generating queries. Now we will have a lot of seeks (effective random) that would generate a big performance sink. This is especially true if we consider that any storage solution big enough to store the data is going to be composed of an aggregate of HDD disks. Sure, we get multiple spindles, so we get better performance overall, but still… Obviously, there are multiple solutions for this issue. B+Trees solve the problem by packing multiple keys into a single page, so instead of doing a O(log2N), you are usually doing O(log36N) or O(log100N). Consider those fan outs, we will have 6 – 8 seeks to do to get to our data. Much better than the 40 seeks required using plain binary search. It would actually be better than that in the common case, since the first few levels of the trees are likely to reside in memory (and probably in L1, if we are speaking about that). However, given that we are storing sorted strings here, one must give some attention to Sorted Strings Tables. The way those work, you have the sorted strings in the file, and the footer contains two important bits of information. The first is the bloom filter, which allows you to quickly rule out missing values, but the more important factor is that it also contains the positions of (by default) every 16th entry to the file. This means that in our 15 TB data file (with 64.5 billion entries), we will use about 15GB just to store pointers to the different locations in the index file (which will be about 1.2 TB). Note that the numbers actually are probably worse. Because SST (note that when talking about SST I am talking specifically about the leveldb implementation) utilize many forms of compression, it is actually that the file size will be smaller (although, since the “value” we use is just a byte position in the data file, we won’t benefit from compression there). Key compression is probably a lot more important here. However, note that this is a pretty poor way of doing things. Sure, the actual data format is better, in the sense that we don’t store as much, but in terms of the number of operations required? Not so much. We still need to do a binary search over the entire file. In particular, the leveldb implementation utilizes memory mapped files. What this ends up doing is rely on the OS to keep the midway points in the file in RAM, so we don’t have to do so much seeking. Without that, the cost of actually seeking every time would make SSTs impractical. In fact, you would pretty much have to introduce another layer on top of this, but at that point, you are basically doing trees, and a binary tree is a better friend here. This leads to an interesting question. SST is probably so popular inside Google because they deal with a lot of data, and the file format is very friendly to compression of various kinds. It is also a pretty simple format. That make it much nicer to work with. On the other hand, a B+Tree implementation is a lot more complex, and it would probably several orders of magnitude more complex if it had to try to do the same compression tricks that SSTs do. Another factor that is probably as important is that as I understand it, a lot of the time, SSTs are usually used for actual sequential access (map/reduce stuff) and not necessarily for the random reads that are done in leveldb. It is interesting to think about this in this fashion, at least, even if I don’t know what I’ll be doing with it.
January 24, 2014
by Oren Eini
· 12,103 Views
article thumbnail
How to Set Up a Multi-Node Hadoop Cluster on Amazon EC2, Part 1
Learn how to set up a four node Hadoop cluster using AWS EC2, PuTTy(gen), and WinSCP.
January 23, 2014
by Hardik Pandya
· 135,937 Views · 3 Likes
article thumbnail
Big Data Search, Part 3: Binary Search of Textual Data
The index I created for the exercise is just a text file, sorted by the indexed key. When doing a search by a human, that makes it very easy to work with. Much easier than trying to work with a binary file, it also helps debugging. However, it does make it running a binary search on the data a bit harder. Mostly because there isn’t a nice way to say “give me the #th line”. Instead, I wrote the following: public void SetPositionToLineAt(long position) { // now we need to go back until we either get to the start of the file // or find a \n character const int bufferSize = 128; _buffer.Capacity = Math.Max(bufferSize, _buffer.Capacity); var charCount = _encoding.GetMaxCharCount(bufferSize); if (charCount > _charBuf.Length) _charBuf = new char[Utils.NearestPowerOfTwo(charCount)]; while (true) { _input.Position = position - (position < bufferSize ? 0 : bufferSize); var read = ReadToBuffer(bufferSize); var buffer = _buffer.GetBuffer(); var chars = _encoding.GetChars(buffer, 0, read, _charBuf, 0); for (int i = chars - 1; i >= 0; i--) { if (_charBuf[i] == '\n') { _input.Position = position - (bufferSize - i) + 1; return; } } position -= bufferSize; if (position < 0) { _input.Position = 0; return; } } } This code starts at an arbitrary byte position, and go backward until it find the new line character ‘\n’. This give me the ability to go to a rough location and get the line oriented input. Once I have that, the rest is pretty easy. Here is the binary search: while (lo <= hi) { position = (lo + hi) / 2; _reader.SetPositionToLineAt(position); bool? result; do { result = _reader.ReadOneLine(); } while (result == null); // skip empty lines if (result == false) yield break; // couldn't find anything var entry = _reader.Current.Values[0]; match = Utils.CompareArraySegments(expectedIndexEntry, entry); if (match == 0) { break; } if (match > 0) lo = position + _reader.Current.Values.Sum(x => x.Count) + 1; else hi = position - 1; } if (match != 0) { // no match yield break; } The idea is that this positions us on the location of the index that has an entry with a value that is equal to what we are searched on. We then write the following to actually get the data from the actual data file: // we have a match, now we need to return all the matches _reader.SetPositionToLineAt(position); while(true) { bool? result; do { result = _reader.ReadOneLine(); } while (result == null); // skip empty lines if(result == false) yield break; // end of file var entry = _reader.Current.Values[0]; match = Utils.CompareArraySegments(expectedIndexEntry, entry); if (match != 0) yield break; // out of the valid range we need _buffer.SetLength(0); _data.Position = Utils.ToInt64(_reader.Current.Values[1]); while (true) { var b = _data.ReadByte(); if (b == -1) break; if (b == '\n') { break; } _buffer.WriteByte((byte)b); } yield return _encoding.GetString(_buffer.GetBuffer(), 0, (int)_buffer.Length); } As you can see, we are moving forward in the index file, reading one line at a time. Then we take the second value, the position of the relevant line in the data file, and read that. We continue to do so as long as the indexed value is the same. Pretty simple, all told. But it comes with its own set of problems. I’ll discuss that in my next post.
January 22, 2014
by Oren Eini
· 5,880 Views
article thumbnail
Big Data Search, Part 2: Setting Up
the interesting thing about this problem is that i was very careful in how i phrased things. i said what i wanted to happen, but didn’t specify what needs to be done. that was quite intentional. for that matter, the fact that i am posting about what is going to be our acceptance criteria is also intentional. the idea is to have a non trivial task, but something that should be very well understood and easy to research. it also means that the candidate needs to be able to write some non trivial code. and i can tell a lot about a dev from such a project. at the same time, this is a very self contained scenario. the idea is that this is something that you can do in a short amount of time. the reason that this is an interesting exercise is that this is actually at least two totally different but related problems. first, in a 15tb file, we obviously cannot rely on just scanning the entire file. that means that we have to have an index. and that means we have to build it. interestingly enough, an index being a sorted structure, that means that we have to solve the problem of sorting more data than can fit in main memory. the second problem is probably easier, since it is just an implementation of external sort, and there are plenty of algorithms around to handle that. note that i am not really interested in actual efficiencies for this particular scenario. i care about being able to see the code. see that it works, etc. my solution, for example, is a single threaded system that make no attempt at parallelism or i/o optimizations. it clocks at over 1 gb / minute and the memory consumption is at under 150mb. queries for a unique value return the result in 0.0004 seconds. queries that returned 153k results completed in about 2 seconds. when increasing the used memory to about 650mb, there isn’t really any difference in performance, which surprised me a bit. then again, the entire code is probably highly inefficient. but that is good enough for now. the process is kicked off with indexing: 1: var options = new directoryexternalstorageoptions("/path/to/index/files"); 2: var input = file.openread(@"/path/to/data/crimes_-_2001_to_present.csv"); 3: var sorter = new externalsorter(input, options, new int[] 4: { 5: 1,// case number 6: 4, // ichr 7: 8: }); 9: 10: sorter.sort(); i am actually using the chicago crime data for this. this is a 1gb file that i downloaded from the chicago city portal in csv format. this is what the data looks like: the externalsorter will read and parse the file, and start reading it into a buffer. when it gets to a certain size (about 64mb of source data, usually), it will sort the values in memory and output them into temporary files. those file looks like this: initially, i tried to do that with binary data, but it turns out that that was too complex to be easy, and writing this in a human readable format made it much easier to work with. the format is pretty simple, you have the value of the left, and on the right you have start position of the row for this value. we generate about 17 such temporary files for the 1gb file. one temporary file per each 64 mb of the original file. this lets us keep our actual memory consumption very low, but for larger data sets, we’ll probably want to actually do the sort every 1 gb or maybe more. our test machine has 16 gb of ram, so doing a sort and outputting a temporary file every 8 gb can be a good way to handle things. but that is beside the point. the end result is that we have multiple sorted files, but they aren’t sequential. in other words, in file #1 we have values 1,4,6,8 and in file #2 we have 1,2,6,7. we need to merge all of them together. luckily, this is easy enough to do. we basically have a heap that we feed entries from the files into. and that pretty much takes care of this. see merge sort if you want more details about this. the end result of merging all of those files is… another file, just like them, that contains all of the data sorted. then it is time to actually handle the other issue, actually searching the data. we can do that using simple binary search, with the caveat that because this is a text file, and there is no fixed size records or pages, it is actually a big hard to figure out where to start reading. in effect, what i am doing is to select an arbitrary byte position, then walk backward until i find a ‘\n’. once i found the new line character, i can read the full line, check the value, and decide where i need to look next. assuming that i actually found my value, i can now go to the byte position of the value in the original file and read the original line, giving it to the user. assuming an indexing rate of 1 gb / minute a 15 tb file would take about 10 days to index. but there are ways around that as well, but i’ll touch on them in my next post. what all of this did was bring home just how much we usually don’t have to worry about such things. but i consider this research well spent, we’ll be using this in the future.
January 21, 2014
by Oren Eini
· 3,447 Views
article thumbnail
Top Posts of 2013: Google's Big Data Papers
I’ll review Google’s most important Big Data publications and discuss where they are (as far as they’ve disclosed).
December 30, 2013
by Mikio Braun
· 117,083 Views
  • Previous
  • ...
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×