DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest Big Data Topics

article thumbnail
The Limitations of the IoT and How the Web of Things Can Help
Understand the limitations of the Internet of Things and how the Web of Things can help build an application layer for the IoT.
September 28, 2015
by Dominique Guinard
· 27,320 Views · 6 Likes
article thumbnail
Problems Solved by IoT
We spoke with 20 executives across the IoT space about problems the Internet of Things are addressing.
September 24, 2015
by Tom Smith DZone Core CORE
· 37,140 Views · 5 Likes
article thumbnail
Customer Journey Analytics and Data Science
Deciphering the "nuts-and-bolts” of individual customer journeys (and deducing intent) is core to improving customer experience and driving brand loyalty.
September 9, 2015
by Ravi Kalakota
· 8,506 Views · 1 Like
article thumbnail
Too Big Data: Coping with Overplotting
written by tim brock. scatter plots are a wonderful way of showing ( apparent ) relationships in bivariate data. patterns and clusters that you wouldn't see in a huge block of data in a table can become instantly visible on a page or screen. with all the hype around big data in recent years it's easy to assume that having more data is always an advantage. but as we add more and more data points to a scatter plot we can start to lose these patterns and clusters. this problem, a result of overplotting, is demonstrated in the animation below. the data in the animation above is randomly generated from a pair of simple bivariate distributions. the distinction between the two distributions becomes less and less clear as we add more and more data. so what can we do about overplotting? one simple option is to make the data points smaller. (note this is a poor "solution" if many data points share exactly the same values.) we can also make them semi-transparent. and we can combine these two options: these refinements certainly help when we have ten thousand data points. however, by the time we've reached a million points the two distributions have seemingly merged in to one again. making points smaller and more transparent might help things; nevertheless, at some point we may have to consider a change of visualization. we'll get on to that later. but first let's try to supplement our visualization with some extra information. specifically let's visualize the marginal distributions . we have several options. there's far too much data for a rug plot , but we can bin the data and show histograms . or we can use a smoother option - a kernel density plot . finally, we could use the empirical cumulative distribution . this last option avoids any binning or smoothing but the results are probably less intuitive. i'll go with the kernel density option here, but you might prefer a histogram. the animated gif below is the same as the gif above but with the smoothed marginal distributions added. i've left scales off to avoid clutter and because we're only really interested in rough judgements of relative height. adding marginal distributions, particularly the distribution of variable 2, helps clarify that two different distributions are present in the bivariate data. the twin-peaked nature of variable 2 is evident whether there are a thousand data points or a million. the relative sizes of the two components is also clear. by contrast, the marginal distribution of variable 1 only has a single peak, despite coming from two distinct distributions. this should make it clear that adding marginal distributions is by no means a universal solution to overplotting in scatter plots. to reinforce this point, the animation below shows a completely different set of (generated) data points in a scatter plot with marginal distributions. the data again comes from a random sample of two different 2d distributions, but both marginal distributions of the complete dataset fail to highlight this separation. as previously, when the number of data points is large the distinction between the two clusters can't be seen from the scatter plot either. returning to point size and opacity, what do we get if we make the data points very small and almost completely transparent? we can now clearly distinguish two clusters in each dataset. it's difficult to make out any fine detail though. since we've lost that fine detail anyway, it seems apt to question whether we really want to draw a million data points. it can be tediously slow and impossible in certain contexts. 2d histograms are an alternative. by binning data we can reduce the number of points to plot and, if we pick an appropriate color scale, pick out some of the features that were lost in the clutter of the scatter plot. after some experimenting i picked a color scale that ran from black through green to white at the high end. note, this is (almost) the reverse of the effect created by overplotting in the scatter plots above. in both 2d histograms we can clearly see the two different clusters representing the two distributions from which the data is drawn. in the first case we can also see that there are more counts from the upper-left cluster than the bottom-right cluster, a detail that is lost in the scatter plot with a million data points (but more obvious from the marginal distributions). conversely, in the case of the second dataset we can see that the "heights" of the two clusters are roughly comparable. 3d charts are overused, but here (see below) i think they actually work quite well in terms of providing a broad picture of where the data is and isn't concentrated. feature occlusion is a problem with 3d charts so if you're going to go down this route when exploring your own data i highly recommend using software that allows for user interaction through rotation and zooming. in summary, scatter plots are a simple and often effective way of visualizing bivariate data. if, however, your chart suffers from overplotting, try reducing point size and opacity. failing that, a 2d histogram or even a 3d surface plot may be helpful. in the latter case be wary of occlusion.
July 3, 2015
by Josh Anderson
· 13,592 Views
article thumbnail
Crowdsourcing our way to better food hygiene
The last few years has seen a tremendous boom in the number of sources online relaying information about restaurant quality. Whether it’s review sites or more general social media, there is no shortage of feedback on how people have found a particular restaurant. I wrote a few years ago about a project from the University of Rochester that aimed to mine Twitter for mentions of eating out, with the hope of producing a detailed and comprehensive map of food hygiene standards throughout restaurants in New York. The system, called nEmesis, analyzed millions of tweets, and was on the hunt for people sharing an attack of food poisoning after visiting a restaurant. You might think, or hope at least, that this would be a relatively small number, but over a four month period they found 480 such mentions in New York City alone from a total of 23,000 restaurant visitors. What’s more, the data collected correlated well with public health data on those diners. Crowdsourcing food hygiene A recent Harvard led project is hoping to provide similar assistance to the Boston food hygiene authorities by providing more intelligent information for the authorities to base their inspection checks on. Rather than using Twitter for data however, the Harvard project is turning to the review website Yelp. They have launched a NetFlix style competition to create an algorithm that can search through the ratings of restaurants in Boston and produce recommendations for which restaurants warrant a visit from the hygiene police. The competition, organized by the data company DrivenData, will see the raw data posted online and then an army of data scientists charged with solving the puzzle. The founders observed that whilst the collection of machine readable data was now mandated by the government, there was a literacy problem that rendered much of that data left dormant and unused. Bringing data science to the masses And so the competition was born to try and make data science affordable for organizations with a clear social need but no budget to afford what are still very expensive skill sets. Of course, the food hygiene challenge is but one of the challenges on the DrivenData website, with the venture coming along way from their first challenge to make a better algorithm for improving spending in schools. The organization try and ensure that whatever winning entries emerge from the competitions receive support and help to grow and improve. The winner of that initial competition, for instance, eventually turned their algorithm into a software tool for schools to use. The eventual aim is to establish a community of data scientists that are happy to deploy their talents for socially worthwhile endeavors. “Our mindset has grown; we want to solve the big-picture data literacy and data capacity problems in the social and public sectors,” the creators say. “We think competitions are a great mechanism to do that right now, but our goal is to do more, to serve that community in other ways.” Suffice to say, challenges have come a long way from their beginnings in the 18th century when the UK government launched such a competition to help find longitude more easily. The likes of the X Prize has taken them to newfound heights, and it’s great to see organizations like DrivenData apply the concept to more manageable challenges. Of course, they aren’t the only organization seeking to make algorithms more accessible. I wrote last year about the Algorithmia social network, which aims to connect up organizations with lots of data with algorithms that are being under-utilized. The aim is that this match up will create not just new insights but extra profits. Data science is undoubtedly a burgeoning field, and it’s one with a great many exciting developments in it. Original post
July 2, 2015
by Adi Gaskell
· 870 Views · 1 Like
article thumbnail
Emerging Niches and Technologies in Mobile App Development
If there have been wide array of successful consumer apps like Angry Bird or WhatsApp or DropBox. After years of reign in the publicity focus finally these consumer apps giants understood the importance of offering enterprise grade features. In last few years suddenly the focus shifted to enterprise mobile apps. Rapid development, tracking or monitoring apps, wearable apps, Internet of Things Apps, Geo-location technologies like iBeacon and Geofencing in business apps, the list of emerging app niches and technologies seem to be too long. Let us have a quick look at some of the most definitive app niches and technologies in recent times. Enterprise apps While smartphones and mobile devices continue to move off the shelves and millions of apps continue to make the app stores brimming with energy, activity and competition, most consumer app still fail to make a earning to survive beyond the year of their launch. This has been the sordid storyline for consumer apps for years. So, for some time the focus of developers is shifting towards enterprise apps. Moreover, now businesses are bent on going mobile and they are keen to develop apps that make their business process more productive. Although enterprise mobile apps have just started to take off this new and broad app niche already shown huge promise to take over consumer apps in just more than a year down the line. Rapid development As enterprises now focusing all out to embrace mobile apps in their business process, the new demand of enterprise grade apps made rapid development cycle obvious. When winning competition for businesses is boiling down to a fast and user focused mobile presence, fast paced development will naturally be the rule. This overwhelming demand of business apps and enterprise grade software made rapid development a criterion in the present scenario. Shortening the development lifecycle has now become the major focus for most mobile app development companies around the world. Mobile monitoring apps Wide adaptation of mobile devices and apps among all age groups and people in recent years gave rise to certain concerns. Child security concern, parental concern for negative influence on children, employer’s concern on employee productivity and information security, etc. are some of the major concerns centered on the mobile devices. IOS or Android monitoring software, child phone tracker apps, mobile spy software, text message tracking apps, are few of the app types getting increasingly popular these days to address the aforementioned concerns in family or workplace environments. Internet of Things (IOT) apps The world around us is becoming connected with the mobile devices and gadgets and devices around us are increasingly finding themselves equipped with mobile control interface. This new horizon of interconnected devices is referred as Internet of Things or IOT. Now an electric toaster can be controlled from its respective app on the mobile device. Similarly, the music system with the respective mobile app can be turned on and off, tuned in and given other commands. This new breed of apps is being called IOT apps. Wearable apps The smartphones or smart mobile devices are now playing the central role in connecting all types of wearable smart devices. Most smartwatch apps are still now in character only the extension of their mobile counterparts. But as smartwatch is slowly picking up to be the next big device platform as commonest wearable, a new breed of apps are being developed targeting smartwatch and wearable users besides offering their respective mobile apps as well. From smart jewelries to health trackers and fitness bands to optically mounted computers like Google Glass, these new wearable devices will be the target development platform for a vast majority of mobile app developers in the time to come. More user-optimized mobile UI design UI design is presently the most focus driven area for mobile app development around the world. Experiments and analysis on making UIs better and user optimized is continuing and a wide variety of new techniques and design approaches are giving birth to unprecedented level of excellence in user experience. From motivational design to flat design to and playful interfaces, we have come across quite a few dominating design trends and techniques. Geo-location technologies Contextual and user specific push notification is the new maneuver to engage users with a mobile app and to garner revenue from the process. This cannot be better done than by knowing the user location. When you know the location of a user close to your retail shop you can notify him with an offer to grab his attention and push him for a visit to your store. Thus knowing the user location translates to far better contextual and business driven messaging and notifications. Several mobile friendly Geo-location technologies like iBeacon, Geofencing, Geomagnetics, etc. are there to let you integrate location based user engagement features in your app.
July 2, 2015
by Juned Ghanchi
· 3,829 Views
article thumbnail
The Secret to More Efficient Data Science with Neo4j and R [OSCON Preview]
It’s a sad but true fact: Most data scientists spend 50-80% of their time cleaning and munging data and only a fraction of their time actually building predictive models. This is most often true in a traditional stack, where most of this data munging consists of writing lines upon lines of some flavor of SQL, leaving little time for model-building code in statistical programming languages such as R. These long, cryptic SQL queries not only slow development time but also prevent useful collaboration on analytics projects, as contributors struggle to understand each others’ SQL code. For example, in graduate school, I was on a project team where we used Oracle to store Twitter data. The kinds of queries my classmates and I were writing were unmaintainable and impossible to understand unless the author was sitting next to you. No one worked on the same queries together because they were so unwieldy. This not only hindered our collaboration efforts but also slowed our progress on the project. If we had been using an appropriate data store (like a graph database) we would have spent significantly less time pulling our hair out over the queries. Why Today’s Data Is Different This data-munging problem has persisted in the data science field because data is becoming increasingly social and highly-connected. Forcing this kind of interconnected data into an inherently tabular SQL database, where relationships are only abstract, leads to complicated schemas and overly complex queries. Yet, several NoSQL solutions – specifically in the graph database space – exist to store today’s highly-connected data. That is, data where relationships matter. A lot of data analysis today is performed in the context of better understanding people’s behavior or needs, such as: How likely is this visitor to click on advertisement X? Which products should I recommend to this user? How are User A and User B connected? Written by Nicole White People, as we know, are inherently social, so most of these questions can be answered by understanding the connections between people: User A is similar to User B, and we already know that User B likes this product, so let’s recommend this product to User A. The Good News: Data-Munging No More Data science doesn’t have to be 80% data munging. With the appropriate technology stack, a data scientist’s development process is seamless and short. It’s time to spend less time writing queries and more time building models by combining the flexibility of an open-source, NoSQL graph database with the maturity and breadth of R – an open-source statistical programming language. The combination of Neo4j’s ability to store highly-connected, possibly-unstructured data and R’s functional, ad-hoc nature creates the ideal data analysis environment. You don’t have to spend an hour writing CREATE TABLE statements. You don’t have to spend all day on StackOverflow figuring out how to traverse a tree in SQL. Just Cypher and go. Learn More at OSCON 2015 At my upcoming OSCON session we will walk through a project in which we analyze #OSCON Twitter data in a reproducible, low-effort workflow without writing a single line of SQL. For this highly-connected dataset we will use Neo4j, an open-source graph database, to store and query the data while highlighting the advantages of storing such data in a graph versus a relational schema. Finally, we will cover how to connect to Neo4j from an R environment for the purposes of performing common data science tasks, such as analysis, prediction and visualization.
June 30, 2015
by Mark Needham
· 1,646 Views
article thumbnail
JBoss BPM Suite Quick Guide: Import External Data Models to BPM Project
You are working on a big project, developing rules, events and processes at your enterprise for mission critical business needs. Part of the requirements state that a certain business unit will be providing their data model for you to leverage. This data model will not be designed in the JBoss BPM Suite Data Modeler but you need to have access to it while working on your rules, events and processes from the business central dashboard. For this article we will be using the JBoss BPM Travel Agency demo project as a reference, with it's current data model built externally to the JBoss BPM Suite business central. The external data model is called the acme-data-model and is found in the project directory: This data model is built during installation and provides you with an object data model as a Java Archive (JAR) file which is installed into the JBoss BPM Suite business central component by placing it into the following location: jboss-eap-6.4/standalone/deployments/business-central.war/WEB_INF/lib/acmeDataModel-1.0.jar Authoring --> Artifact repository. This way of deploying the data model means that it is available to all projects you work on in JBoss BPM Suite business central, something that might not always be preferable. What we need is a way to deploy external data models into JBoss BPM Suite and then selectively add them to projects as needed. Within JBoss BPM Suite there is an Artifact Repository that is made just for this purpose. We can upload through the business central dashboard UI all our models and then pick and choose from the repository artifacts (your data model is one artifact) on a per project basis. This gives you absolute control over the models that a project can access. Choose external data model file. There are a few steps involved that we will take you through here to change the current installation of JBoss BPM Travel Agency where the acmeDataModel-1.0.jar file will be removed from the previously mentioned business central component and uploaded into the Artifact Repository and added to the Special Trips Agency project. Here is how you can do it yourself: obtain and install JBoss BPM Travel Agency demo project remove current data model from global business central application: $ rm ./target/jboss-eap-6.4/standalone/deployments/business-central.war/WEB_INF/lib/acmeDataModel-1.0.jar Upload external model jar file. start JBoss BPM Suite server after installation as stated in the installation instructions login to JBoss BPM Suite at http://localhost:8080/business-centralwith: u: erics p: bpmsuite1! go to AUTHORING --> ARTIFACT REPOSITORY go to UPLOAD --> CHOOSE FILE... --> projects/acme-data-model/target/acmeDataModel-1.0.jar --> click button to UPLOAD this puts the external data model into the JBoss BPM Suite artifact repository Select dependencies to add to project. got to AUTHORING --> PROJECT AUTHORING --> OPEN PROJECT EDITOR in project editor select GENERAL PROJECT SETTINGS --> DEPENDENCIES in dependencies select ADD FROM REPOSITORY -> in pop-upSELECT entry acmeDataModel-1.0.jar This will result in the external data model being added only to the Special Trips Agency project and not available to other projects unless they add this same dependency from the JBoss BPM Suite artifact repository. If you build & deploy the project, run it as described in the project instructions you will find that the external data model is available and used by the various rules and process components that are the JBoss BPM Travel Agency. As a closing note, this works exactly the same for JBoss BRMS projects.
June 29, 2015
by Eric D. Schabell DZone Core CORE
· 3,162 Views · 1 Like
article thumbnail
Spark Grows Up and Scales Out
Written by Craig Wentworth. To understand the furor that’s greeted recent vendor announcements around open source analytics computing engine Spark, and some commentary seemingly setting up a Spark versus Hadoop battle, it’s worth taking a moment to recap on what each actually is (and is not). As I covered in last year’s MWD report on Hadoop and its family of tools, when people talk about Apache Hadoop they’re often referring to a whole framework of tools designed to facilitate distributed parallel processing of large datasets. That processing was traditionally confined to MapReduce batch jobs in Hadoop’s early days, though Hadoop 2 brought the YARN resource scheduler and opened up Hadoop to streaming, real-time querying and a wider array of analytical programming applications (beyond MapReduce). Spark has been designed to run on top of Hadoop’s Distributed File System (amongst other data platforms) as an alternative to MapReduce – tuned for real-time streaming data processing and fast interactive queries, and with multi-genre analytics applicability (machine learning, time series, graph, SQL, streaming out-of-the-box). It gets that speed advantage by caching in-memory (rather than writing interim results to disk, as MapReduce does), but with that approach comes a need for higher-spec physical machines (compared with MapReduce’s tolerance for commodity hardware). So, Spark isn’t about to replace Hadoop -- but it may well supplant MapReduce (especially in growing real-time use cases). Those “Spark vs Hadoop” headlines are about as meaningful as one proclaiming “mushrooms vs pizza." Yes, mushroom might be a more suitable topping than, say, pepperoni (especially in a vegetarian use case), but it’ll still be deployed on the same dough and tomato sauce pizza platform. Nobody’s about to suggest the mushroom should go it alone! But what’s behind the headlines and the hype is a story of enterprise adoption – or at least vendors anticipating that adoption and investing in ‘the weaponization of Spark’ as it faces the more exacting standards of security, scaling performance, consistency, etc. which come with mainstream enterprise deployment. Big names like IBM, Databricks (the company formed by the originators of Spark), and MapR made commitments in and around the Spark Summit earlier this month. MapR has announced three new Quick Start Solutions for its Hadoop distribution to help customers get started with Spark in real-time security log analytics, genome sequencing, and time series analytics; Databricks’ cloud-hosted Spark platform (formerly known as Databricks Cloud) has become generally available; and IBM announced a raft of measures designed to give Spark a significant shot in the arm – it’s open sourcing its SystemML technology to bolster Spark’s machine learning capabilities, integrating Spark into its own analytics platforms, investing in Spark training and education, committing 3,500 of its researchers and developers to work on Spark-related projects, and offering Spark as a service on its Bluemix developer cloud. Given the overlap with Databricks’ business model (of offering development, certification, and support for Spark), IBM’s intentions are likely to tread on some toes before long – but for now, at least, both companies are content to focus on the combined push benefiting the Spark community and its enterprise aspirations overall (though clearly IBM’s betting on all this investment buying it some influence over where Spark goes next). It’s worth bearing in mind that not all its supporters champion Spark wholesale and all the interested parties tend to be interested in particular bits of Spark (as wide-ranging as it is) because of overlaps with their own preferred toolsets. For instance, although Spark supports many analytics genres, Cloudera focuses on its machine learning capabilities (as it has its own SQL-on-Hadoop tool in Impala), and MapR and Hortonworks also promote Drill and Hive as their favoured source of SQL-on-Hadoop. IBM’s support is focused on Spark’s machine learning and in-memory capabilities (hence the SystemML open sourcing news). In the face of such strong vendor preferences, how long before some of Spark’s current features fall away (or at least start to show the effects of being starved of as much care and feeding as is bestowed upon vendors’ favourite Spark components)? The Spark community is at much the same place the Hadoop one was at a while back – it’s showing great promise and suitability in key growth workloads (in Spark’s case, such as real-time IoT applications). However, the product as it stands is too immature for many enterprise tastes. Cue enterprise software vendors stepping up to help grow Spark up fast. Their challenge though is to smooth out the edges without smothering what made it so interesting in the first place.
June 28, 2015
by Angela Ashenden
· 2,353 Views
article thumbnail
Analyzing Application Workload Data with Apprenda and R
One of the most important kinds of data a Platform as a Service (PaaS) can leverage is its knowledge of guest applications that run within its purview. A PaaS should know all sorts of things about guest applications – their architecture, dependencies, scale across infrastructure, and more. Data including application resource utilization metrics (CPU, RAM, etc.) are key for things like data center capacity planning, policy enforcement, and application isolation in the enterprise. A PaaS such as Apprenda provides this information through a centralized single lens – in our case, a collection of RESTful APIs – making it easier than ever before to run analytics on application metrics in the data center. Apprenda’s approach as a PaaS is to provide developers and platform operators with helpful information through platform extensibility and APIs. This is because there are plenty of tools in the data center that provide advanced analytic capabilities, so long as you can feed them the information they need. We integrate with tools like System Center, New Relic, and more all the time because these are the tools our customers have invested in, and they are great at what they do. Our job is not to reinvent these tools but instead to provide data. Apprenda captures information about applications such as their duration of deployment, resource policy (allocation of CPU and memory), actual utilization of resources, scale (# of instances), custom metadata, and more. All of this information can be fed into data center tools that help IT make important, data-driven, decisions. In the land of DevOps, however, it is not uncommon for folks to use this data in creative and innovative ways. Often times this means using the mechanism “du jour,” which can be scripting (PowerShell), a programming language (R), or an entire runtime (Node.js) to quickly and effectively grab, process, and manipulate data. In a big example, let’s look at R, which is a powerful programming language centered on data mining and statistical analysis. It provides straightforward facilities for many types of data-analytics techniques, and is extensible using community maintained packages. In the simple example below, I use standard R functions plus three packages (easily included using R’s install.packages() function): 1. jsonlite for parsing JSON data that the Apprenda API returns. 2. httr for handling the HTTP requests necessary to authenticate and retrieve data. 3. plotrix for help rendering a plot of retrieved data. From there it’s pretty straightforward. The first step is to authenticate with your Apprenda environment: I’ve now stored my Apprenda session token in a variable called ‘token.’ I’ll include that token as a header in my API call to get application data: GET() is a function provided by the httr package that simplifies an HTTP request to the API. I’ve added the Apprenda session token to the HTTP Headers for authentication, and included a query string parameter that will help return all currently running application workloads on the platform. The data that is returned is parsed and stored in the variable (in R, a vector) called ‘r’ which now has 151 records, one for each application workload. Each record in ‘r’ has 15 variables (properties) that we can use to run analytics across the entire collection of results. For the purposes of illustration, I’m going to use the variable componentType, which represents Apprenda’s knowledge of the type of application workload that was deployed – there are seven self-explanatory types: UserInterface, PublicUserInterface,WindowsService, JavaWebApplication, LinuxService, WcfService, and Database. When the collection is then grouped bycomponentType, it becomes pretty simple to plot a chart showing the distribution of workload by the type of component: The resulting plot (pie3D() comes from the plotrix package) looks like this: I’ve had conversations with IT folks who couldn’t describe the architectural makeup of their application portfolio in any level of detail, yet in this case we pulled the data in real time with one line of R. Admittedly, a pie chart is a pretty watered down way to look at this information, but the point is the data is available and can be grouped, filtered, manipulated, and analyzed very simply with R. For this example, I used the open-source edition of RStudio. Some other powerful information that could be gleaned from the platform’s APIs: 1. The average discrepancy between resource allocation and actual utilization per workload. (This is helpful in capacity planning.) 2. The longest -running application workload. 3. The most distributed applications. (This could aid in scaling decisions.) There are many more. A PaaS such as Apprenda is, by nature, in a unique spot in the data center stack because it maintains knowledge of both infrastructure and applications. It also serves as a hub for data that, when analyzed creatively, provides new insights. These insights are an opportunity for enterprises to enhance their practices to better serve developers and applications while operating more efficiently than ever.
June 27, 2015
by Matthew Ammerman
· 3,189 Views
article thumbnail
Web Data Mining Services Give Business Intelligence to Your Start-up!
business sphere nowadays has become an extremely competitive arena. dynamics change in a blink. times have become highly unpredictable and hence; businesses today need to be agile while being equipped with reliable, accurate, relevant and actionable business intelligence. every business venture has its own fair share of ebbs and tides. it becomes more of a challenge to prove your capabilities and achieve a strong hold in the market; especially when you have just started taking your first step in. for startups, getting the minutest nuances of how to run a business; right from the day one, forms the most crucial part! to smoothly sail through this enormously competitive space; startups need to perform above and beyond the expectations right from the very beginning. the initial barriers can be easily overcome when your business is armed with smallest details of the market. but how to catch the nerve of market, you will ask? - data extraction or data mining services is the answer! data mining equips you with rich business intelligence that in turn gives a firm control of things and empowers you to make informed business decisions as well as create more targeted, applicable and growth-oriented business strategies. data extraction services gather huge volume of data that is highly varied, precise, and relevant. most importantly - it is very useful for your new startup . a meticulous study of this database allows you to analyze things in great details and arranging this scattered information into meaningful clusters; helps you get the whole picture! which are the different ways for startups to effectively use web data mining? web data mining is a wide array, which can be employed for a variety of purposes to generate various kinds of important data to gain actionable insights. in fact, for a startup, the most critical part is to decide where and how to use this powerful technique to get valuable information which can help in creating a difference for overall future prospects of the company. let’s check out on some of those interesting avenues; where you can apply impactful web data extraction techniques: digging information for social rankings and backlinks for any startup; the most crucial business process is to analyze its competitors. this is one area where web data extraction comes across an instrumental enabler. many startups, in the past, have effectively used data mining to fish out critically useful information related to social rankings of competing companies. social ranking is equally important factor, since any ‘social actions’ on the internet are building blocks of several opinions as well as builds a reputation in this day and age. keeping these things in mind, you can use web data extraction to dig out for social rankings related to content created by your competitors in the cyber space. with thorough analysis; you can get a very clear picture of the entire situation and it helps you to arrive to a concrete conclusions in terms of what your competitors are doing well at, and what sells the best. obtaining contact information building strong networking is the best bet which helps you to get through the volatile market; specifically when you are a newbie in the market. whether it is with prospective or existing customers, industry peers, associates, or competitors; excellent networking is the driving force where there is open and transparent communication, ensures success of your startup. and to have such an effective communication and networking channel, you need a huge, robust list of contact information that is in sync with - your exact requirements. mining data from multiple web sources is by all means a perfect method to achieve this. in a short period of time you can easily collect rich contact information that can be leveraged in a number of ways. you can form a long lasting business relationship or make potential customers know what you offer; this information gives a thrust to your startup and propels it to new levels of recognition. for building brand, promotion and advertisement for startups, the very first wave of promotion is the key that builds a strong brand value in the market and ensures long-term business success. it is during this initial phase that the first and foremost public perception of your company is created, and the essentials of public opinion starts shaping up. for this reason, it is required to be precise with your marketing and promotion these formative years. to achieve this, you need a strong, in-depth understanding of the audience that you need to target. you require to classify your target audience based on factors like age, gender, income, demographics, and preferences. such detailed understanding can be attained only when you have a voluminous social data related to the targeted audience. and there is no better way to achieve this, other than web data extraction. with such a powerful weapon in your arsenal, you can certainly boost up your startup and take it a long way with clever decisions and timely implementations. web data extraction can be the absolute tool that a startup may ever have! its appropriate use should give you tons of required and relevant business intelligence, which should help you to shine in this competitive market.
June 26, 2015
by Ritesh Sanghani
· 1,619 Views
article thumbnail
Spring Integration Kafka 1.2 is Available, With 0.8.2 Support and Performance Enhancements
Spring Integration Kafka 1.2 is out with a major performance overhaul.
June 25, 2015
by Pieter Humphrey
· 2,998 Views
article thumbnail
8 Key Findings About IoT Development
IoT is really hot, but can also be a bit confusing. Read about these 8 development key findings.
June 24, 2015
by Burke Holland
· 1,902 Views
article thumbnail
Information Builders Showcases Hot Business Intelligence Trends in "Summer Shorts" Webcast Series
London, UK – June 23, 2015 – Information Builders, a leader in business intelligence (BI) and analytics, information integrity, and integration solutions, today announced a new webcast series, “Summer Shorts,” designed to provide viewers quick overviews of the hottest topics in BI and analytics. Information Builders’ Summer Shorts will help enterprises rethink information strategies in a world transformed by the forces of mobile, social, cloud, advanced analytics, and big data. In each session, an Information Builders expert will offer a fun, informative presentation on a different BI and analytics discipline. Viewers can join one or all of the sessions below to learn tips for leveraging emerging technologies for better BI. 8 July | 14:00 BST / 15:00 CET | The Art of Dashboard Design for Business Intelligence – What are your dashboards telling you and your customers? Peter O’Grady will walk through design theories, design and layout considerations, and form-factor awareness and responsive design. Be empowered to change your data visualisation strategies, practices, and processes. 22 July | 14:00 BST / 15:00 CET | Advanced Data Visualization – Data visualisation is red hot, and for good reason. Companies in all sectors are finding hidden insights with sophisticated data visualisation. In this webcast by Porter Thorndike, attendees will learn advanced tips for data analysis, visualisation plug-in architecture, polished finished examples, and visualisation-based InfoApps™ from Information Builders. 5 August | 14:00 BST / 15:00 CET | Social and Feedback Analysis – Join this social media analytics webcast to learn how to better understand customer sentiment and behavior. Dan Grady will discuss how to capitalise on the opportunities presented by social media, including integrating social data with enterprise data, improving customer engagement, and picking the right platform to consolidate and share this information. 19 August | 14:00 BST / 15:00 CET | 5 Hot Trends for Business Intelligence – Mobile, social, cloud, advanced analytics, and big data aren’t just big trends, they also raise big questions in BI and analytics. Chris Banks will describe in this webcast why BI is vital to making these trends work for companies. It will cover how to build once and responsibly deploy BI to mobile devices, how to expose relevant analytics to customers and partners, and best practices for harnessing big data.
June 23, 2015
by Fran Cator
· 1,094 Views
article thumbnail
This Week In Modern Software: Inside Obama’s Geek Squad
[This article was written by Kevin Casey] Welcome to This Week in Modern Software, orTWiMS, New Relic’s weekly roundup of the need-to-know news, stories, and events of interest surrounding software analytics, cloud computing, application monitoring, development methodologies, programming languages, and the myriad of other issues that influence modern software. This week, our top story goes inside President Obama’s secret team of tech geeks, 140 of them and counting: TWiMS Top Story: Inside Obama’s Stealth Startup—Fast Company What it’s about:If the President of the United States walked into the room and personally recruited you to rebuild the country’s technology infrastructure, could you turn him down? He’s serious, and that room is theRoosevelt Room in the West Wing of the White House, by the way. AsLisa Gelobtersays: “What are you going to say that?” Gelobter’s answer was “Yes”—she’s now chief digital officer for the US Department of Education, part of a 140-person-and-counting tech team that’s functioning something like an elite startup embedded inside the federal government. Its business? Only modernizing the technical infrastructure, applications, and processes of just about every federal agency. Why you should care:What was once something of a tech desert—the federal government—is beginning to draw top private-sector talent inside the Beltway. The team, led by Mikey Dickerson (who helped lead the team that rescuedHealthcare.gov) andformer US CTO Todd Park, also includes the likes of former Googler Matthew Weaver, and it hopes to hit 500 people by the end 2016, shortly before President Obama will leave office. Its challenges are immense, from tackling government bureaucracy (to test just how entrenched the suits were, Weaver requested the official title “Rogue Leader”—and he got it) to the fact that its recruiting pitch includes the phrase: “You’ll have to take a pay cut.” But its mission is both noble and necessary, and the appeal of working on major problems with enormous public impacts appears to be working. Recommended reading. Further reading: Mikey Dickerson’s 10 Tips for Dealing with Bureaucracy—New Relic Blog [Video] Airbnb Open Sources Software to Lure Talent Amid ‘Insane’ Competition—CIO Journal What it’s about:Airbnb added three new apps to its open source portfolio earlier this month, but the motivation wasn’t just trying to give employees the best business tools or contribute to the software community at large. Sure, that might have been part of the equation, but the rental booking site hopes open-sourcing some of its toolkit will help recruit the best software talent in the face of what director of engineeringMike Curtiscalls “insane” competition in the Silicon Valley labor market. Why you should care:In the software arms race, any little edge counts. Curtis tellsCIO Journalthat Airbnb will keep the proprietary stuff closely guarded, of course. But it will open source “generic” tools with wider industry use cases, such as its recently releasedAerosolvemachine-learning package and itsAirpalcloud-based data querying tool. The latter, which works with Facebook’s open sourcePrestoDB, aims to simplify SQL queries to the point where you don’t need to be a big data wonk or business intelligence guru to run it. Indeed, one in three Airbnb employees have run a query on it in the year since it launched. Airbnb has contributed a dozen open source tools on its aptly namedNerds site(gotta love that!) to date, something the company hopes both contributes to greater good but also advertises its software innovation to potential hires. Google Is Wielding Its Own Secret Weapon in the Cloud—The New York Times What it’s about:In thecutthroat competitionfor public cloud business, Google may be its own best customer testimonial. In advance of this week’sOpen Network Summit, theTimes’Bits bloglooked at Google’s plan to not only unveil cloud customers such as HTC but reveal much more than ever before about its own infrastructure. Google did just that on Wednesday, offering a look inside itsdata center networking, including its massive-capacity, lightning-fast Jupiter network. Why you should care:As major cloud players continue to zap prices with their shrink-rays, it’s increasingly clear that features and underlying platforms will distinguish one from the other when enterprise users make their pick. Google is taking a big step toward writing its own story in this regard, and the synopsis might read something like: “We’re pretty good at this stuff.” Its Jupiter fabrics deliver 1 petabit per second of bisection bandwidth, according to Google, or “enough for 100,000 servers to exchange information at 10Gb/s each, enough to read the entire scanned contents of the Library of Congress in less than 1/10th of a second.” If it sounds like a bit of bragging, well, yeah—it is. But it’s bragging with a purpose: Attracting devs who want access to the same technology without having to build it themselves.Google’s Amin Vahdat connected the dots in a blog post: “The same networks that power all of Google’s internal infrastructure and services also power Google Cloud Platform.” Move Over, Meeker: Byron Deeter’s State of the Cloud Report—Bessemer Venture Partners What it’s about:With a nod to Mary Meeker’s classicState of the Internet report,Bessemer Venture Partners’Byron Deeterchecks in with his 2015 State of the Cloud Report. Given cloud computing’s relative youth and rampant ascension, it’s no surprise the stats are staggering. Here’s one to start: Cloud revenues have increased tenfold in the last six years, from a scant $5.6 billion in 2008 to more than $56 billion in 2014. And it’s going to double again in the next four years, according to BVP’s projections, to $127.5 billion in 2018. Why you should care:Deeter’s full presentation is worth a weekend watch or read, but it’s the forward-looking slides that may be most compelling for software pros. Deeter notes both the immense risks and opportunities in cloud security, unveiling a 10-point security plan for cloud startups on slide 37. To underscore the security landscape, Deeter quotes an unnamed cloud CEO who says aDDoSattack that took down the firm’s API caused more customer churn in one day than in the rest of its history. Wow. He also addresses the exploding market for cloud services built specifically for developers including, yes, New Relic. And for mobile developers, slide 44 underscores something we’ve talked about before in this space:the real money’s in enterprise apps, and it’s still a largely untapped market. Click through thefull slide deck hereorwatch video of Deeter’s presentation here. Bandwidth: The Next Frontier of Cloud Computing—ZDnet What it’s about:Is networking the next big thing in the everything-as-a-service age? It just might be, as firms likePacnetvie to deliver networking capacity on a pay-for-what-you-use model that some industry folks say better suits cloud environments facing significant but uneven networking needs. Why you should care:As author Drew Turney notes, there’s a common blind spot when it comes to cloud computing’s many shapes and sizes: Moving all that data from points A to Z, and everywhere in between, which can cause both performance problems and undue financial pressures. The promise of Networking-as-a-Service (NaaS), industry execs tell Turney, is that it can provide more efficient, scalable networking for short-term usage bursts such as customer traffic spikes or large cloud backup-and-storage jobs, enabling companies to later dial down their capacity as needed. Combined withSoftware-Defined Networking (SDN),NaaS makes it possible to build intelligent applications that manage their own networking needs, which might be the most significant enterprise potential of NaaS, saysNuage NetworksarchitectMarten Hauville. Page Bloat: Average Web Page Now More Than 2MB—The Performance Beacon (SOASTA) What it’s about:Do you need to put your website on a diet? Apparently so: The average Web page topped 2 MB as of May 2015, according to ongoing tracking atThe Performance Beacon. That’s double the average page weight from just three years ago. The site projects average page weight will exceed 3 MB in late 2017. Why you should care:Performance, performance, performance:Slow speedsare a killerin the modern software era. While author andSOASTAUX evangelistTammy Evertsrightly notes that page weight is not the only factor in Web optimization, we’re simply not paying it enough attention when designing and building Web pages. Images are the big culprit in the Web’s expanding waistline: they comprise nearly two-thirds of the average page’s weight, and video is a growing part of our Web diet, too. But other factors such as custom fonts play a role, adding weight even as the Web sheds previous performance hogs like Flash. The ideal weight? 1 MB, she says, which will save crucial seconds in load times. Sounds like it’s time to hit the virtual treadmill.
June 23, 2015
by Fredric Paul
· 1,086 Views
article thumbnail
Big Data TCO Lessons From Virtualization Technology Sprawl
The complexity of big data makes it a difficult concept for many to grasp, and utilizing it effectively is one of the biggest challenges businesses face today. There is little doubt that big data offers organizations a number of clear advantages, but applying them across the entire enterprise is one obstacle that can truly be described as formidable, even daunting, to even the most technologically savvy companies. One department might be able to create its own business solutions through big data analytics, while another department might come up with answers of their own, but lack of true coordination and collaboration remains a significant problem. Businesses aren’t without help in this area, however, because they’ve encountered similar problems before. Many companies have encountered issues such as virtualization technology sprawl, and the lessons learned from addressing that problem could prove to be exceptionally valuable when dealing with big data true cost of ownership (TCO). To understand the problem and the solution, we must first look back at the rapid growth of virtualization technology, more specifically server virtualization. As businesses adopted virtualization, the mainframe systems soon diverged into multiple systems. The more popular virtualization became, the more projects were taken on and the more technologies diverged. Larger companies eventually sought technology specialists to work within their areas of expertise. The result of the use of these individual teams was virtualization technology sprawl, an inefficient development that eventually lead to even higher operational costs. For all the benefits virtualization technology offered, many of them were outweighed by the increased demands and greater management complexity that came from technology sprawl. Businesses were quick to come up with new solutions for the problem. The most common was to adopt a converged infrastructure . This strategy directly addressed the higher operational costs that resulted from technology sprawl, basically breaking through the silos by taking multiple technologies and combining them into single stacks for computing, storage, and networking. This made the management of virtualization technology much easier since operational complexity was significantly reduced. In other words, management of this technology was kept at a reasonable size. The same principle can apply to big data management across an entire organization. When it comes to management of big data and hadoop security, it’s easy to get caught up in the immensity of it all. The fact that big data is so versatile and can be applied to so many different use cases also means it can apply to any number of different divisions within a company. This creates silos and a general desire to hold onto data sets. In other words, big data ends up in a sprawl of its own, becoming that much more unwieldy and complicated, which is a major problem for a technology that’s already so complex to begin with. The lesson that every company should take away from the solution to virtualization technology sprawl is the breaking down of barriers to big data management. It all comes down to ready access to all the necessary data no matter what roles an employee may have within a company. Businesses shouldn’t have to worry over the cost it takes to store and process data since the insights gained from big data analytics are particularly valuable. Most importantly, it’s about avoiding big data from getting too big, to the point where it becomes unmanageable and merely adds to the overall operating costs of a company. It’s true that big data introduces more complexity, but businesses that have learned how to store and process it efficiently, sometimes through big data platforms or cloud-based services, are in a more advantageous position than companies still dealing with technology sprawl. The lessons learned from previous problems can indeed play a helpful role in solving the problems many experience today.
June 22, 2015
by Rick Delgado
· 1,945 Views
article thumbnail
ParStream to Present Requirements of an Analytics Platform for IoT at the TDWI Munich Conference 2015
COLOGNE, Germany – June 22, 2015 – ParStream, the IoT analytics company, today announced its participation at the TDWI Munich Conference 2015, one of the largest gatherings of expert Business Intelligence, Big Data and data warehousing leaders and educators in Europe. The conference will take place June 22-24, 2015 at the MOC Order and Event Center in Munich, Germany. Albert Aschauer, Sales Director DACH at ParStream, will present on requirements for an analytics platform for the Internet of Things (IoT) based on real-world use cases from the renewable energy and telecommunications industries. Big Data, fast data, edge analytics and real-time insights are driving new technology innovation to meet the demand for getting more value from IoT data. Additional details on the speaking session are below. What: “Requirements of an Analytics Platform for the Internet of Things” When: Monday, June 22, 2015 at 11:35 a.m. CEST Who: Albert Aschauer, Sales Director DACH at ParStream Where: MOC Munich, Germany – Room F112 To schedule a one-on-one meeting with Albert Aschauer and ParStream at TDWI Munich Conference 2015, send an email to events(at)parstream(dot)com.
June 22, 2015
by Fran Cator
· 1,112 Views
article thumbnail
Spring XD 1.2 GA, Spring XD 1.1.3 and Flo for Spring XD Beta Released
Written by Mark Pollack. Today, we are pleased to announce the general availability of Spring XD 1.2, Spring XD 1.1.3 and the release of Flo for Spring XD Beta. 1.2.0.GA: zip 1.1.3.RELEASE: zip Flo for Spring XD Beta You can also install XD 1.2 using brew and rpm The 1.2 release includes a wide range of new features and improvements. The release journey was an eventful one, mainly due to Spring XD’s popularity with so many different groups, each with their respective request priorities. However the Spring XD team rose to the challenge and it is rewarding to look back and review the amount of innovation delivered to meet our commitments toward simplifying big data complexity. Here is a summary of what we have been busy with for the last 3 months and the value created for the community and our customers. Flo for Spring XD and UI improvements Flo for Spring XD is an HTML5 canvas application that runs on top of the Spring XD runtime, offering a graphical interface for creation, management and monitoring streaming data pipelines. Here is a short screencast showing you how to build an advanced stream definition. You can browse the documentation for additional information and links to additional screen casts of Flo in action. The XD admin screen also includes a new Analytics section that allows you to easily view gauges, counters, field-value counters and aggregate counters. Performance Improvements Anticipating increased high-throughput and low-latency IoT requirements, we’ve made several performance optimizations within the underlying message-bus implementation to deliver several million messages per second transported between Spring XD containers using Kafka as a transport. With these optimizations, we are now on par with the performance from Kafka’s own testing tools. However, we are using the more feature rich Spring Integration Kafka client instead of Kafka’s high level consumer library. For anyone who is interested in reproducing these numbers, please refer to the XD benchmarking blog, which describes the tests performed and infrastructure used in detail. Apache Ambari and Pivotal HD To help automate the deployment of Spring XD on an Apache HadoopⓇ cluster, we added an Apache AmbariⓇ plugin for Spring XD. The plugin is supported on both Pivotal HD 3.0 and Hortonworks HDP 2.2 distributions. We also added support in Spring XD for Pivotal HD 3.0, bringing the total number of Hadoop versions supported to five. New Sources, Processors, Sinks, and Batch Jobs One of Spring XD’s biggest value propositions is its complete set of out-of-the-box data connectivity adapters that can be used to create real-time and batch-based data pipelines, and these require little to no user-code for common use-cases. With the help of community contributions, we now have MongoDB, VideCap, and FTP as source modules, an XSLT-transformer processor, and FTP sink module. The XD team also developed a Cassandra sink and a language-detection processor. Recognizing the important role in the Pivotal Big Data portfolio, we have also added native integration with Pivotal Greenplum Database and Pivotal HAWQ through gpfdist sink for real-time streaming and also support for gpload based batch jobs. Adding to our developer productivity theme and the use of Spring XD in production for high-volume data ingest use-cases, we are delighted to recognize Simon Tao and Yu Cao (EMC² Office of The CTO & Labs China), who have been operationalizing Spring XD data pipelines in production since 2014 and also for the VideCap source module contribution. Their use-case and implementation specifics (in their own words) are below. “There are significant demands to extract insights from large magnitude of unstructured video streams for the video surveillance industry. Prior to being analyzed by data scientists, the video surveillance data needs to be ingested in the first place. To tackle this challenge, we built a highly scalable and extensible video-data ingestion platform using Spring XD. This platform is operationally ready to ingest different kinds of video sources into a centralized Big Data Lake. Given the out-of-the-box features within Spring XD, the platform is designed to allow rich video content processing capabilities such as video transcoding and object detection, etc. The platform also supports various types of video sources—data processors and data exporting destinations (e.g. HDFS, Gemfire XD and Spark)—which are built as custom modules in Spring XD and are highly reusable and composable. With a declarative DSL, a video ingestion stream will be handled by a video ingestion pipeline defined as Directed Acyclic Graph of modules. The pipeline is designed to be deployed in a clustered environment with upstream modules transferring data to downstream ones efficiently via the message bus. The Spring-XD distributed runtime allows each module in the pipeline to have multiple instances that run in parallel on different nodes. By scaling out horizontally, our system is capable of supporting large scale video surveillance deployment with high volume of video data and complex data processing workloads.” Custom Module Registry and HA Support Though we have had the flexibility to configure shared network location for distributed availability of custom modules (via: xd.customModule.home), we also recognized the importance of having the module-registry resilient under failure scenarios—hence, we have an HDFS backed module registry. Having this setup for production deployment provides consistent availability of custom module bits and the flexibility of choices, as needed by the business requirements. Pivotal Cloud Foundry Integration Furthering the Pivotal Cloud Foundry integration efforts, we have made several foundation-level changes to the Spring XD runtime, so we are able to run Spring XD modules as cloud-native Apps in Lattice and Diego. We have aggressive roadmap plans to launch Spring XD on Diego proper. While studying Diego’s Receptor API (written in Go!), we created a Java Receptor API, which is now proposed to Cloud Foundry for incubation. Next Steps We have some very interesting developments on the horizon. Perhaps the most important, we will be launching new projects that focus on message-driven and batch-oriented “data microservices”. These will be built directly on Spring Boot as well as Spring Integration and Spring Batch, respectively. Our main goal is to provide the simplest possible developer experience for creating cloud-native, data-centric microservice apps. In turn, Spring XD 2.0 will be refactored as a layer above those projects, to support the composition of those data microservices into streams and jobs as well as all of the “as a service” aspects that it provides today, but it will have a major focus on deployment to Cloud Foundry and Lattice. We will be posting more on these new projects soon, so stay tuned! Feedback is very important, so please get in touch with questions and comments via * StackOverflowspring-xd tag * Spring JIRA or GitHub Issues Editor’s Note: ©2015 Pivotal Software, Inc. All rights reserved. Pivotal, Pivotal HD, Pivotal Greenplum Database, Pivotal Gemfire and Pivotal Cloud Foundry are trademarks and/or registered trademarks of Pivotal Software, Inc. in the United States and/or other countries. Apache, Apache Hadoop, Hadoop and Apache Ambari are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. All Posts Engineering Releases News and Events
June 21, 2015
by Pieter Humphrey
· 3,729 Views
article thumbnail
Data's Hierarchy of Needs
This post originally published in the AppsFlyer blog. A couple of weeks ago Nir Rubinshtein and I presented AppsFlyer’s data architecture in a meetup ofBig Data & Data Science Israel. One of the concepts that I presented there, which is worth expanding upon is “Data’s Hierarchy of Needs:” Data should Exist Data should be Accessible Data should be Usable Data should be Distilled Data should be Presented How can we make data “achieve its pinnacle of existence” and be acted upon? In other words, what are the areas that should be addressed when designing a data architecture if you want it to be complete and enable creating insights and value from the data you generate and collect. If done properly, your users might just act upon the data you provide. This list might seem a little simplistic but it is not a prescription of what to do but rather a set of reminders of areas we need to cover and questions we need answered to properly create a data architecture. Data Should Exist Well, of course data should exist, and it probably does. You should ask yourself however, is if the data that exists is the right data? Does the retention policy you have service the business needs? Does the availability fit your needs? Do you have all the needed links (foreign keys) to other data so you’d be able to connect it later for analysis? To make this more concrete, consider the following example: AppsFlyer accepts several types of events (launches, in-app events, etc.) which are tied to apps. Apps are connected to accounts (an account would have one or more applications, usually at least, an iOS app and an Android one). If we would save the accounts as the latest snapshot and an app changes ownership, the historical data before that change would be skewed. If we treat the accounts as a slowly changing dimension of the events, then we’d be able to handle the transition correctly. Note that we may still choose to provide the new owner the historic data but now it not the only option the system support and the decision can be based on the business needs. Data Should Be Accessible If data is written to disk it is accessible programmatically at least, however, there can be many levels of accessibility and we need to think about our end users needs and the level of access they’d require. At AppsFlyer, the data existence (mentioned above) is handled by processing all the messages that go through our queues using Kafka but that data is saved in sequence files and stored by event time. Most of our usage scenarios do have a time component but they are primarily handled by the app or account. Any processing that needs a specific account and would access the raw events would have to sift through tons of records (3.7+ billion a day at the time of this post) to find the few relevant ones. Thus, one basic move toward accessibility of data is to sort by apps so that queries will only need to access a small subset of the data and thus run much faster. Then we need to consider the “hotness” of the data i.e. what response times we need and for which types of data. For instance, aggregations, such as retention reports need to be accessed online (so called “sub-second” response), latest counts need near real-time , explorations of data for new patterns can take hours etc. To enable support of these varied usage scenarios, we need to create multiple projections of our data, most likely using several different technologies. AppsFlyer stores raw data in sequence files, processed data in parquet files (accessible via Apache Spark), aggregations and recent data in columnar RDBMS and near real-time is stored in-memory. The three different storage mechanisms I mentioned above (Parquet, columnar RDBMS and In-Memory Data Grid) used in AppsFlyer all have SQL access; this is not by chance. While we (the industry) went through a short period of NoSQL, SQL or almost-SQL is getting back to be the norm, even for semi-structured and poly-structured data. Providing an SQL interface to your data is another important aspect of data accessibility as it allows expanding the user base for the data beyond R&D. Again, this is important not just for your relational data… Data Should Be Usable What’s the difference between accessible data and usable data? For one there’s data cleansing. This is a no-brainer if you pull data from disparate systems but it is also needed if your source is a single system. Data cleansing is what traditional ETL is all about and the techniques still apply. Another aspect of making data usable is enriching it or connecting it to additional data. Enriching can happen from internal sources like linking CRM data to the account info. This can also be facilitated by external sources as with getting the app category from the app store or getting device screen size from a device database. Last but not least, is to consider legal and privacy aspects of the data. Before allowing access to the data you may need to mask sensitive information or remove privacy-related data (sometimes you shouldn’t even save it in the first place). At AppsFlyer we take this issue very seriously and make major efforts to comply when working with partners and clients to make sure privacy-related data is handled correctly. In fact, we are also undergoing independent SOC auditing to make sure we are compliant with the highest standards. To summarize, to make the data usable you have to make sure it is correct, connect it to other data and you need to make sure that it is compliant with legal and privacy issues. Data Should Be Distilled Distilling insights is the reason we perform all the previous steps. Data in itself is of little use if it doesn’t help us make better decisions. There are multiple types of insights you can generate here beginning from the more traditional BI scenarios of slice and dice analytics going through real-time aggregations and trend analysis, ending in applying machine learning or “advanced analytics”. You can see one example of the type of insights that can be gleaned from our data by looking at theGaming Advertising Performance Index we recently published. Data Should Be Presented This point ties in nicely with the Gaming Advertising Performance Index example provided above. Getting insights is an important step, but if you fail to present them in a coherent and cohesive manner then the actual value users would be able to make of it is limited at best. Note that even if you use insights for making decisions (e.g. recommending a product to a user) you’d still need to present how well this decision is doing. There are many issues that need to be dealt with from UX perspective both in how users interact with the data and how the data is presented. An example of the former is deciding on chart types for the data. A simple example for the latter is when presenting projected or inaccurate data it should be clear to the users that they are looking at approximations to prevent support calls on numbers not adding up. Making sure all the areas discussed above are covered and handled properly is a lot of work but providing a solution that actually helps your users make better decisions is well worth it. The data’s hierarchy of needs is not a prescription of how to get there, it is merely a set of waypoints to help navigate toward this end goal. It helps me think holistically about AppsFlyer data needs and I hope following this post it would also help you. For more information about our architecture, check out the presentation from the meetup: Architecture for Real-Time and Batch Big Data Analytics Distilling insights @ AppsFlyer
June 21, 2015
by Arnon Rotem-gal-oz
· 1,133 Views
article thumbnail
Enabling DataOps with Easy Log Analytics
DataOps is becoming an important consideration for organizations. Why? Well, DataOps is about making sure data is collected, analyzed, and available across the company – i.e. Ops insight for your decision-making systems like Hubspot, Tableau, Salesforce and more. Such systems are key to day-to-day operations and in many cases are as important as keeping your customer facing systems up and running. If you think about it, today every online business is a data driven business! Everyone is accountable to have up to the minute answers on what is happening across their systems. You can’t do this reliably without having DataOps in place. We have seen this trend across our own customer base at Logentries where more and more customers using log data to implement DataOps across their organization. Using log data for DataOps allows you to perform the following: Troubleshoot your systems managing your data by identifying errors and correlating data sources Get notified when one of these systems is experiencing issues via real time alerts or anomaly detection Analyze how these systems are used by the organization Logentries has always been great at 1 and 2 above, and this week we have enhanced Logentries to now allow you to perform easier and more powerful analytics with our new easy-to-use SQL like query language – Logentries QL (LEQL). LEQL is designed to make analyzing your log data dead simple. There are too many log management tools that are built around complex query languages and require data scientists to operate. Logentries is all about making log data accessible to anyone. With LEQL you are going to be able to use analytical functions like CountUnique, Min, Max, GroupBy, Sort…A number of our users have already been testing these out via our beta program. One great example is how Pluralsight has been using Logentries to manage and understand the usage of their Tableau environment. For example: Calculating the rate of errors over the the past 24 hours e.g. using LEQL Count function Understanding user usage patterns e.g. using GroupBy to understand queries performed grouped by different users Sorting the data to find the most popular queries and how long they are taking Being able to answer these types of questions enables DataOps teams to understand where they need to invest time going forward. For example, do I need to add capacity to improve query performance? Are internal teams having a good user experience or are they getting a lot of errors when they try to access data? At Logentries we are all about making the power of log data accessible to everyone and as we do this we are constantly seeing cool new use cases when using logs. If you have some cool use cases do let us know!
June 21, 2015
by Trevor Parsons
· 968 Views
  • Previous
  • ...
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×