For a look at what's been happening outside of the Big Data Zone, we've assembled a collection of links from around the web covering all the tutorials, tools, new releases, rants, and raves you might have missed over the past couple of weeks:
Tutorials & Tools
The range of technologies available by which to collect and examine data is constantly on the rise- both in web and desktop applications, which provide several great interfaces.
This time I would like to show you how to sanitize your database using a very simple Ruby script. This example has a very specific goal. You will find this useful if you are using
Postgresas your database engine and
Railsas your backend platform.
Railsis vastly used with
Postgresas a database engine. So I think this example could come in handy for a great number of developers.
OlegDB is a database that meets the bottom line head-on. It operates under a startling new enterprise-ready paradigm which we call MAYO: Marginally available Yoke and Oil database.
Splunk, nominally used for system logs, shows signs of evolving into a data processing platform via Tableau Software partnership.
Piglet is a simple data processing environment for processing and analyzing small data sets (inspired by Apache Pig). Piglet supports a small number of capabilities that were inspired by Apache Pig. This includes a simple type system (BYTEARRAY, CHARARRAY, LONG, DOUBLE, BOOLEAN).
In February 2014, the Apache Storm community released Storm version 0.9.1. Storm is a distributed, fault-tolerant, and high-performance real-time computation system that provides strong guarantees on the processing of data.
Solr is poster boy in search market . Generally it is used by running as HTTP server and making queries . I was more interested in running it in embedded manner and getting results via JNI .
When we set out to look at natural language processing solutions, the first thing we realized was that NLTK has found a sweet spot. NLTK (or Natural Language ToolKit) is a Python solution for natural language processing. It is able to solve many (if not all) of the problems that heavier, more established solutions like the Stanford NLP tools can. However, unlike those tools, it is accessible to those of us who are not experts in NLP.
Google Compute Engine VMs provide a fast and reliable way to run Apache Hadoop. Today, we’re making it easier to run Hadoop on Google Cloud Platform with the Preview release of the Google Cloud Storage connector for Hadoop that lets you focus on your data processing logic instead of on managing a cluster and file system.
In the Big Data ecosystem, solid-state drives (SSDs) are increasingly considered a viable, higher-performance alternative to rotational hard-disk drives (HDDs). However, few results from actual testing are available to the public.
News & Opinion
Foremost in recent discussions has been the need to consolidate definitions of differing levels of privacy risk; from personally identifiable records through to truly anonymous information. One sticking point has been where information falls somewhere between these two extremes. The latest proposal includes an attempt to establish a third, intermediate classification, but this step is easier said than done.
With Hadoop turning into a one-size-fits-all repository for data, an array of search solutions specifically for Hadoop have come to the fore over the past year. One of those contenders, LucidWorks, has joined with Hortonworks, one of the major distributors of Hadoop, to offer the LucidWorks edition of Hadoop search engine Solr as a reference architecture for searches on the Hortonworks Data Platform, or HDP.
Part of the reason why internet-connected devices are increasingly common is due to a decrease in costs – not only are sensors becoming cheaper to make but it is also less expensive to store the data.
New Relic Insights allows developers to harvest real-time statistics about running apps and crunch the results in its cloud.
Anything that looks too good to be true usually is. Such might be the case with Apache Hadoop, the much-ballyhooed open-source project that everyone keeps talking about.
The possibilities are far reaching, but it requires a mindset shift. For old industries, it requires leadership with clear enough vision to respond to the opportunities of data centric applications and services.
Why You Should Never Trust Data Visualization
The Counter Fraud Management Software offering uses analytics to root out fraudelent claims.
Pete Warden is spot on about being skeptical of data, but it is data visualization, not data science, where caution is most crucial.