Big Data/BI Zone is brought to you in partnership with:
  • submit to reddit
Rob J Hyndman04/10/14
0 replies

Interpreting Noise

What is going on here is that the com­men­ta­tors are assum­ing we live in a noise-​​free world. How­ever, the world is noisy — real data are sub­ject to ran­dom fluc­tu­a­tions, and are often also mea­sured inac­cu­rately. So to inter­pret every lit­tle fluc­tu­a­tion is silly and misleading.

Ayende Rahien04/10/14
0 replies

Sorting with Lucene

How do you do sorting on a field value? The answer is, not easily.

Matthew Dubins04/09/14
0 replies

Ontario First Nations Libraries Compared Using Ontario Open Data

The author of this article uses data concerning First Nations libraries in Ontario to demonstrate variations of data visualization in R.

Mehdi Daoudi04/08/14
0 replies

Simpson’s Paradox: DevOps’ Big Data Problem

Simpson’s Paradox is a phenomenon in which a trend identified from a population is reversed when investigated at the sub-population levels. Think about that again – conclusions drawn from an overall set of data are not indicative of the behavior of the underlying subsets.

Ayende Rahien04/08/14
0 replies

Peeking into Lucene indexing

Continuing his trip into the Lucene codebase, the author is now looking into the process indexing as they are happening. Interestingly enough, that is something that we never really had to look at before.

Joe Stein04/07/14
0 replies

Beyond MapReduce and Apache Hadoop 2.X

In this recap of a podcast with Bikas Saha and Arun Murthy, the author got to hear about some of what is in 2.4 and coming in 2.5 of Hadoop.

Istvan Szegedi04/07/14
0 replies

Apache Spark - a Fast Big Data Analytics Engine

Apache Spark is an increasingly popular alternative to replace MapReduce with a more performant execution engine but still use Hadoop HDFS as storage engine for large data sets.

Paul Miller04/07/14
0 replies

Microsoft Corporate Vice President Discusses Data, Data Platforms, and More

The Data Platform Group at Microsoft does a lot, from SQL Server and their Hadoopey HDInsight offering through to Business Intelligence and analytics capabilities which sit in or on top of the humble Excel spreadsheet.

Arthur Charpentier04/07/14
0 replies

Data News: "'Data' the Buzzword vs. Data the Actual Thing," and More

This installment of Arthur Charpentier's regular collection of data science-related links includes "'Big Data' the Buzzword vs. Data the Actual Thing, the influential interaction between data visualization and story telling, what big data can say about relationships between world leaders, and more.

Sarah Ervin04/06/14
0 replies

The Best of the Week (Mar. 28): Big Data Zone

Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (Mar. 28 to Apr. 03). This week's best include the Apache Solr/Lucene 4.7.1 announcement, a discussion of how some tools can make things harder instead of easier, and an overview of the upcoming ApacheCon.

Dmitry Kan04/04/14
0 replies

Implementing own LuceneQParserPlugin for Solr

One convenience of this implementation is that we can deploy the above classes in a jar under solr core's lib directory. We do not need to overhaul solr source code and deal with deploying some "custom" solr shards.

Chris Haddad04/04/14
0 replies

Big Data Blocking and Tackling

Are you practicing Big data blocking and tackling actions? While Hadoop makes it easier to warehouse data, effective analytics across disparate data sources still requires defining data semantics, data mapping, and master data sources. Don’t forget these important foundational building blocks.

Rafał Kuć04/03/14
0 replies

Apache Solr and Lucene 4.7.1

Today Apache Lucene and Solr PMC announced another version of Apache Lucene library and Apache Solr search server numbred 4.7.1.

Arthur Charpentier04/03/14
0 replies

Data News: Google Data Flu, "Simplifying Data Analysis & Making Sense of Big Data," and More

This installment of Arthur Charpentier's regular collection of data science-related links includes problems with Google's data-based flu tracker, "Simplifying Data Analysis & Making Sense of Big Data," and More.

Arthur Charpentier04/03/14
0 replies

Linear ‘Prediction’ for AR Time Series

In this article, Arthur Charpentier details the mathmatics behind linear prediction for AR time series.

Mark Hinkle04/02/14
0 replies

ApacheCon Approaches

ASF is the home for the majority of open source big data projects and ApacheCon is a must-attend event if you care about big data. Being able to converse with many members of various Apache project communities is invaluable.

Bill Bejeck04/02/14
0 replies

MapReduce Algorithms - Understanding Data Joins Part II

In this post the author demonstrates how to perform a map-side join when both data sets are large and can’t fit into memory. Map-side joins offer substantial gains in performance since we are avoiding the cost of sending data across the network.

Radu Gheorghe04/02/14
0 replies

Encrypting Logs on Their Way to Elasticsearch

If your Elasticsearch cluster is in a remote location you might need to forward your data over an encrypted channel.

Kosta Stojanovski04/02/14
0 replies

Eclipse's BIRT: Using Design Engine API

This article is dedicated to manipulation of tables as part of eclipse BIRT rptdesign.xml-files via the Designe Engine API.

Ayende Rahien04/01/14
0 replies

An Exploration Into Lucene Disk Format

The author wanted to know a lot more about exactly how Lucene is storing data on disk. They know the general stuff about segments and files, etc. But the author wanted to see the actual bits & bytes. So they started tracing into Lucene, trying to figure out what it is doing.

Michael Brenner04/01/14
0 replies

Big Data Is Driving Content Marketing Strategy

According to a new report by Gartner, one-third of companies will face an information crisis within the next 3 years.

Oliver Hookins03/31/14
0 replies

Tools That Make Your Life Harder

Both Hive and Pig require approximately the same amount of lines to set up the log parsing, mostly because it involves setting up each field label and data type individually and then a regex to parse the fields out of the input files. If you have a deserializer UDF this is made much easier in either case.

Sarah Ervin03/30/14
0 replies

The Best of the Week (Mar. 21): Big Data Zone

Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (Mar. 21 to Mar. 27). This week's best include Java 8's impact on database access, how to use CustomScoreQuery with Solr/Lucene Scoring, and Apache Accumulo's ability to preserve security.

Arthur Charpentier03/28/14
0 replies

Data News: "74,476 Reasons You Should Always Get the Bigger Pizza", and More

This installment of Arthur Charpentier's regular collection of data science-related links includes "74,476 Reasons You Should Always Get the Bigger Pizza," the distribution of user-selected Twitter languages, climate forecasts, and a picture of the human journey created from global DNA data.

Arthur Charpentier03/27/14
0 replies

Seasonal or Periodic Time Series

This might explain why, in R, when we ask for an autoregressive process or order P, then we get a model with P parameters to estimate, and even if some are not significant, we usually keep them for the forecast.