Using Splunk for Machine Data Analytics
Machine data or “data exhaust” analysis is one of the fastest growing segments of “big data”–generated by websites, applications, servers, networks, mobile devices and other sources. The goal is to aggregate, parse and visualize this data – log files, scripts, messages, alerts, changes, IT configurations, tickets, user profiles etc – to spot trends and act.
By monitoring and analyzing data from customer clickstreams, transactions, log files to network activity and call records–and more, there is new breed of startups that are racing to convert “invisible” machine data into useful performance insights. The label for this type of analytics – operational or application performance intelligence.
In this posting we cover a low profile big data company, Splunk which recently went public. Splunk has >3500 customers already. Splunk ended its first day on the stock market with amazing 108.7 percent bump in price from its $17-per-share IPO.
Splunk’s search, analysis and visualization capabilities are used by companies — Comcast to Zynga — to make sense of the reams of log data they generate every second.
Some real-world customer examples include:
- E-commerce… Expedia uses Splunk to avoid website outages by monitoring server and application health and performance. Today, ~3,000 users at Expedia use Splunk to gain real-time visibility on tens of terabytes of unstructured, time-sensitive machine data (from not only their IT infrastructure, but also online bookings, deal analysis and coupon use).
- SaaS…. Salesforce.com uses Splunk to mine the large quantities of data generated from its entire technology stack. Salesforce.com has >500 users of Splunk dashboards from IT users monitoring customer experience to product managers performing analytics on services like ‘Chatter.’ With Splunk, SFDC claims to have taken application troubleshooting for 100,000 customers to the next level.
- Digital publishing… NPR uses Splunk to gain insight of their digital asset infrastructure. NPR uses Splunk to monitor and troubleshoot their end to-end asset delivery infrastructure. They use Splunk to measure program popularity, views by device, reconcile royalty payments for digital rights, measure abandonment rates and more.
Machine Data Basics
Data, in general, falls into 3 categories:
- Business application data,
- Human-generated content and
- Machine data.
Business data is the digital information used by organizations to conduct their daily operations, such as payroll, supply chain and financial data. Most biz apps rely on relational database technology and software that have pre-defined data structures, or schema for organizing, storing, accessing and reporting on structured data.
Human-generated content is the digital information derived from human-to-human (H2H) interactions, including email, spreadsheets and documents, mobile text messages, video, photos, recorded audio and social media messaging. Human-generated content comes in the form of unstructured data, which means that it’s not optimized for storage in a relational database.
Machine data (or data exhaust) is produced 24x7x365 by nearly every software application and electronic device. The applications, servers, network devices, sensors, browsers, desktop and laptop computers, mobile devices and various other systems deployed to support operations are continuously generating information relating to their status and activities.
Machine data can be found in a variety of formats such as application log files, call detail records, clickstream data associated with user web interactions, data files, system configuration files, alerts and tickets.
Machine data is generated by both machine-to-machine (M2M) as well as human-to-machine (H2M) interactions. Outside of traditional IT infrastructure, every processor-based system, including HVAC controllers, smart electrical meters, GPS devices and RFID tags, and many consumer-oriented systems, such as mobile devices, automobiles and medical devices that contain embedded electronic devices, are also continuously generating machine data.
Machine data can be structured or unstructured. The growth of machine data has accelerated in recent years. The increasing complexity of IT infrastructures driven by the adoption of mobile devices, virtual servers and desktops, as well as cloud-based services and RFID technologies, is contributing to the growth.
Leveraging Machine Data
The requirement for organizations: end-to-end visibility, analytics, and real-time intelligence across all of their applications, services and IT infrastructure to achieve required service levels, manage costs, mitigate security risks, demonstrate and maintain compliance and gain new insights to drive better business decisions.
Machine data provides a definitive, time-stamped record of current and historical activity and events within and outside an organization, including application and system performance, user activity, system configuration changes, electronic transaction records, security alerts, error messages and device locations.
Machine data in a typical enterprise is generated in a multitude of formats and structures, as each software application or hardware device records and creates machine data associated with their specific use. Machine data also varies among vendors and even within the same vendor across product types, families and models.
The figure below illustrates the type of machine data created and the business and IT insights that can be derived when a single web visitor makes a purchase in a typical ecommerce environment:
The illustration above is an example of the type and amount of valuable information generated by a single website visitor that is recorded. A typical ecommerce site serving thousands of users a day will generate gigabytes of machine data which can be used to provide significant insights into the IT infrastructure and business operations.
The Gap: Existing IT and BI Solutions are Unable to Handle Machine Data.
While machine data has always been generated by computing environments, many organizations have failed to recognize the value of this data or have encountered challenges extracting value from it. As a result, heterogeneous machine data is largely ignored and restricted to ad hoc use at the time of troubleshooting IT failures or errors.
A number of IT management products are available to analyze log files and other information related to specific devices, applications or use cases. However, these point solutions are narrowly scoped to only work with specific data formats and systems and are unable to correlate machine data from multiple sources, formats and systems for real-time analysis without significant configuration. Because each point solution targets a specific use case or data format, multiple point solutions are required to understand, cross-correlate and take advantage of the multitude of machine data sets available to an organization. This can lead to significant IT complexity as well as significant capital and IT resource expenditures.
While computing environments have always generated large amounts of machine data, current legacy IT operations management, security and compliance and BI technologies, such as relational databases, online analytical processing (OLAP) engines and other analytical tools are built on software optimized for structured data, namely data where the structure is known and can thus be placed into pre-defined relational databases.
Most of today’s enterprise applications are architected for managing data in legacy relational databases. However, because machine data exists in a variety of formats and can be structured or unstructured, these legacy systems are not optimized to address the massive amounts of dynamic machine data generated within an organization.
Unstructured data is extremely diverse and complex. Legacy data tools, which are generally designed to handle structured data, need to be re-architected to address the complexity of machine data. If either the analysis or the format of the data changes, the legacy systems needs to re-collect and normalize the data, and the application that leveraged the database need to modify their structure to handle the new data formats. Many legacy solutions are also expensive to install and maintain, often needing deployment and update cycles that require professional services, extensive training and technical support over several months, and sometimes years.
Point products as well as legacy IT systems were not built to address the challenges and opportunities of machine data. Moreover, existing solutions and systems are not architected to take advantage of price/performance improvements of computing and storage systems, and in many cases require significant investment in computing hardware. Because of these limitations, these solutions and systems are unable to provide historical and real-time operational intelligence across a wide variety of use cases.
Market Opportunity Being Addressed by Splunk
Splunk believe there is a big opportunity to help organizations unlock the value of machine data. Organizations need to capture the value locked in their machine data to enable more effective application management, IT operations management, security and compliance, and to derive intelligence and insight across the organization.
The software segments that operational intelligence have been estimated by Gartner to be ~$32 billion in 2012. Specifically, Gartner expects the market for products addressing IT operations, which includes application management, to be approximately $18.6 billion in 2012; the market for BI related products, including web analytics software, to be ~$12.5 billion in 2012; and the market for security information and event management software to be ~$1.3 billion in 2012.
Splunk started out building technology used by sysadmins to search computer log files for security issues, server-level bugs or other problem
The core of Splunk’s software is a proprietary machine data engine, comprised of collection, indexing, search and data management capabilities. The software can collect and index terabytes of information daily, irrespective of format or source.
The machine data engine uses an innovative data architecture that enables dynamic, schema creation on the fly, allowing users to run queries on data without having to understand the structure of the data prior to collection and indexing.
The machine data fabric for data collection and indexing delivers speed and scalability when processing massive amounts of machine data.
Deployment of Splunk
The software can be deployed in a variety of environments ranging from a single laptop to a distributed enterprise IT environment handling massive amounts of data. The combination of Splunk forwarders, indexers, and search heads together create a machine data fabric that allows for the efficient, secure and real-time collection and indexing of machine data regardless of network, data center or IT infrastructure topology.
The diagram below shows a representative Splunk deployment topology in a distributed environment making use of machine data fabric.