Big Data/BI Zone is brought to you in partnership with:
  • submit to reddit
Rob J Hyndman04/10/14
1190 views
0 replies

Interpreting Noise

What is going on here is that the com­men­ta­tors are assum­ing we live in a noise-​​free world. How­ever, the world is noisy — real data are sub­ject to ran­dom fluc­tu­a­tions, and are often also mea­sured inac­cu­rately. So to inter­pret every lit­tle fluc­tu­a­tion is silly and misleading.

Ayende Rahien04/10/14
2470 views
0 replies

Sorting with Lucene

How do you do sorting on a field value? The answer is, not easily.

Matthew Dubins04/09/14
388 views
0 replies

Ontario First Nations Libraries Compared Using Ontario Open Data

The author of this article uses data concerning First Nations libraries in Ontario to demonstrate variations of data visualization in R.

Mehdi Daoudi04/08/14
1406 views
0 replies

Simpson’s Paradox: DevOps’ Big Data Problem

Simpson’s Paradox is a phenomenon in which a trend identified from a population is reversed when investigated at the sub-population levels. Think about that again – conclusions drawn from an overall set of data are not indicative of the behavior of the underlying subsets.

Ayende Rahien04/08/14
1839 views
0 replies

Peeking into Lucene indexing

Continuing his trip into the Lucene codebase, the author is now looking into the process indexing as they are happening. Interestingly enough, that is something that we never really had to look at before.

Joe Stein04/07/14
1793 views
0 replies

Beyond MapReduce and Apache Hadoop 2.X

In this recap of a podcast with Bikas Saha and Arun Murthy, the author got to hear about some of what is in 2.4 and coming in 2.5 of Hadoop.

Istvan Szegedi04/07/14
2449 views
0 replies

Apache Spark - a Fast Big Data Analytics Engine

Apache Spark is an increasingly popular alternative to replace MapReduce with a more performant execution engine but still use Hadoop HDFS as storage engine for large data sets.

Paul Miller04/07/14
285 views
0 replies

Microsoft Corporate Vice President Discusses Data, Data Platforms, and More

The Data Platform Group at Microsoft does a lot, from SQL Server and their Hadoopey HDInsight offering through to Business Intelligence and analytics capabilities which sit in or on top of the humble Excel spreadsheet.

Arthur Charpentier04/07/14
258 views
0 replies

Data News: "'Data' the Buzzword vs. Data the Actual Thing," and More

This installment of Arthur Charpentier's regular collection of data science-related links includes "'Big Data' the Buzzword vs. Data the Actual Thing, the influential interaction between data visualization and story telling, what big data can say about relationships between world leaders, and more.

Sarah Ervin04/06/14
4113 views
0 replies

The Best of the Week (Mar. 28): Big Data Zone

Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (Mar. 28 to Apr. 03). This week's best include the Apache Solr/Lucene 4.7.1 announcement, a discussion of how some tools can make things harder instead of easier, and an overview of the upcoming ApacheCon.

Dmitry Kan04/04/14
1540 views
0 replies

Implementing own LuceneQParserPlugin for Solr

One convenience of this implementation is that we can deploy the above classes in a jar under solr core's lib directory. We do not need to overhaul solr source code and deal with deploying some "custom" solr shards.

Chris Haddad04/04/14
1786 views
0 replies

Big Data Blocking and Tackling

Are you practicing Big data blocking and tackling actions? While Hadoop makes it easier to warehouse data, effective analytics across disparate data sources still requires defining data semantics, data mapping, and master data sources. Don’t forget these important foundational building blocks.

Rafał Kuć04/03/14
5348 views
0 replies

Apache Solr and Lucene 4.7.1

Today Apache Lucene and Solr PMC announced another version of Apache Lucene library and Apache Solr search server numbred 4.7.1.

Arthur Charpentier04/03/14
1364 views
0 replies

Data News: Google Data Flu, "Simplifying Data Analysis & Making Sense of Big Data," and More

This installment of Arthur Charpentier's regular collection of data science-related links includes problems with Google's data-based flu tracker, "Simplifying Data Analysis & Making Sense of Big Data," and More.

Arthur Charpentier04/03/14
294 views
0 replies

Linear ‘Prediction’ for AR Time Series

In this article, Arthur Charpentier details the mathmatics behind linear prediction for AR time series.

Mark Hinkle04/02/14
1897 views
0 replies

ApacheCon Approaches

ASF is the home for the majority of open source big data projects and ApacheCon is a must-attend event if you care about big data. Being able to converse with many members of various Apache project communities is invaluable.

Bill Bejeck04/02/14
565 views
0 replies

MapReduce Algorithms - Understanding Data Joins Part II

In this post the author demonstrates how to perform a map-side join when both data sets are large and can’t fit into memory. Map-side joins offer substantial gains in performance since we are avoiding the cost of sending data across the network.

Radu Gheorghe04/02/14
503 views
0 replies

Encrypting Logs on Their Way to Elasticsearch

If your Elasticsearch cluster is in a remote location you might need to forward your data over an encrypted channel.

Kosta Stojanovski04/02/14
549 views
0 replies

Eclipse's BIRT: Using Design Engine API

This article is dedicated to manipulation of tables as part of eclipse BIRT rptdesign.xml-files via the Designe Engine API.

Ayende Rahien04/01/14
1467 views
0 replies

An Exploration Into Lucene Disk Format

The author wanted to know a lot more about exactly how Lucene is storing data on disk. They know the general stuff about segments and files, etc. But the author wanted to see the actual bits & bytes. So they started tracing into Lucene, trying to figure out what it is doing.

Michael Brenner04/01/14
1229 views
0 replies

Big Data Is Driving Content Marketing Strategy

According to a new report by Gartner, one-third of companies will face an information crisis within the next 3 years.

Oliver Hookins03/31/14
4015 views
0 replies

Tools That Make Your Life Harder

Both Hive and Pig require approximately the same amount of lines to set up the log parsing, mostly because it involves setting up each field label and data type individually and then a regex to parse the fields out of the input files. If you have a deserializer UDF this is made much easier in either case.

Sarah Ervin03/30/14
2725 views
0 replies

The Best of the Week (Mar. 21): Big Data Zone

Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (Mar. 21 to Mar. 27). This week's best include Java 8's impact on database access, how to use CustomScoreQuery with Solr/Lucene Scoring, and Apache Accumulo's ability to preserve security.

Arthur Charpentier03/28/14
2180 views
0 replies

Data News: "74,476 Reasons You Should Always Get the Bigger Pizza", and More

This installment of Arthur Charpentier's regular collection of data science-related links includes "74,476 Reasons You Should Always Get the Bigger Pizza," the distribution of user-selected Twitter languages, climate forecasts, and a picture of the human journey created from global DNA data.

Arthur Charpentier03/27/14
1448 views
0 replies

Seasonal or Periodic Time Series

This might explain why, in R, when we ask for an autoregressive process or order P, then we get a model with P parameters to estimate, and even if some are not significant, we usually keep them for the forecast.