Big Data/BI Zone is brought to you in partnership with:
  • submit to reddit
Chris Haddad04/04/14
1744 views
0 replies

Big Data Blocking and Tackling

Are you practicing Big data blocking and tackling actions? While Hadoop makes it easier to warehouse data, effective analytics across disparate data sources still requires defining data semantics, data mapping, and master data sources. Don’t forget these important foundational building blocks.

Rafał Kuć04/03/14
5189 views
0 replies

Apache Solr and Lucene 4.7.1

Today Apache Lucene and Solr PMC announced another version of Apache Lucene library and Apache Solr search server numbred 4.7.1.

Arthur Charpentier04/03/14
1314 views
0 replies

Data News: Google Data Flu, "Simplifying Data Analysis & Making Sense of Big Data," and More

This installment of Arthur Charpentier's regular collection of data science-related links includes problems with Google's data-based flu tracker, "Simplifying Data Analysis & Making Sense of Big Data," and More.

Arthur Charpentier04/03/14
274 views
0 replies

Linear ‘Prediction’ for AR Time Series

In this article, Arthur Charpentier details the mathmatics behind linear prediction for AR time series.

Mark Hinkle04/02/14
1863 views
0 replies

ApacheCon Approaches

ASF is the home for the majority of open source big data projects and ApacheCon is a must-attend event if you care about big data. Being able to converse with many members of various Apache project communities is invaluable.

Bill Bejeck04/02/14
535 views
0 replies

MapReduce Algorithms - Understanding Data Joins Part II

In this post the author demonstrates how to perform a map-side join when both data sets are large and can’t fit into memory. Map-side joins offer substantial gains in performance since we are avoiding the cost of sending data across the network.

Radu Gheorghe04/02/14
479 views
0 replies

Encrypting Logs on Their Way to Elasticsearch

If your Elasticsearch cluster is in a remote location you might need to forward your data over an encrypted channel.

Kosta Stojanovski04/02/14
480 views
0 replies

Eclipse's BIRT: Using Design Engine API

This article is dedicated to manipulation of tables as part of eclipse BIRT rptdesign.xml-files via the Designe Engine API.

Ayende Rahien04/01/14
1426 views
0 replies

An Exploration Into Lucene Disk Format

The author wanted to know a lot more about exactly how Lucene is storing data on disk. They know the general stuff about segments and files, etc. But the author wanted to see the actual bits & bytes. So they started tracing into Lucene, trying to figure out what it is doing.

Michael Brenner04/01/14
1186 views
0 replies

Big Data Is Driving Content Marketing Strategy

According to a new report by Gartner, one-third of companies will face an information crisis within the next 3 years.

Oliver Hookins03/31/14
3927 views
0 replies

Tools That Make Your Life Harder

Both Hive and Pig require approximately the same amount of lines to set up the log parsing, mostly because it involves setting up each field label and data type individually and then a regex to parse the fields out of the input files. If you have a deserializer UDF this is made much easier in either case.

Sarah Ervin03/30/14
2669 views
0 replies

The Best of the Week (Mar. 21): Big Data Zone

Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (Mar. 21 to Mar. 27). This week's best include Java 8's impact on database access, how to use CustomScoreQuery with Solr/Lucene Scoring, and Apache Accumulo's ability to preserve security.

Arthur Charpentier03/28/14
2156 views
0 replies

Data News: "74,476 Reasons You Should Always Get the Bigger Pizza", and More

This installment of Arthur Charpentier's regular collection of data science-related links includes "74,476 Reasons You Should Always Get the Bigger Pizza," the distribution of user-selected Twitter languages, climate forecasts, and a picture of the human journey created from global DNA data.

Arthur Charpentier03/27/14
1406 views
0 replies

Seasonal or Periodic Time Series

This might explain why, in R, when we ask for an autoregressive process or order P, then we get a model with P parameters to estimate, and even if some are not significant, we usually keep them for the forecast.

Joe Stein03/26/14
2164 views
0 replies

Big Data with Apache Accumulo Preserving Security with Open Source

Apache Accumulo is a system built for doing random i/o with peta bytes of data. Distributing the computation to the data with cell level security is where Accumulo really shines.

Rob J Hyndman03/26/14
427 views
0 replies

Fitting Models to Short Time Series

I often get asked how few data points can be used to fit a time series model. As with almost all sam­ple size ques­tions, there is no easy answer.

Madhuka Udantha03/26/14
601 views
0 replies

Apache ZooKeeper Intro and Sample

ZooKeeper is an open source distributed configuration service, synchronization service, and naming registry for large distributed systems. ZooKeeper was a sub project of Hadoop.

Lukas Eder03/25/14
30935 views
12 replies

Java 8 Will Revolutionize Database Access

Java 8 is finally here! After years of waiting, Java programmers will finally get support for functional programming in Java.

Daniel Bartl03/25/14
1231 views
0 replies

Does Datameer Support a Full Big Data Analysis Process?

DAS is a platform for Hadoop which includes data source integration, an analytics engine and visualization functionality. This promise of a fully integrated Big Data analysis process motivated me to test the product.

Doug Turnbull03/25/14
2654 views
0 replies

Using CustomScoreQuery For Custom Solr/Lucene Scoring

Frequently you’ll find that existing Lucene queries will do fine with matching but you’d like to take control of just the scoring/ordering. That’s what CustomScoreQuery gives you – the ability to wrap another Lucene Query and rescore it.

Itamar Syn-hershko03/24/14
521 views
0 replies

Multi-lingual search with Lucene and Elasticsearch

Last night I gave a talk at SkillsMatter London on multi-lingual search with Lucene and Elasticsearch. The talk covered various challenges with indexing texts in various languages: tokenization, term normalization and stemming. I started with...

Sarah Ervin03/23/14
2636 views
0 replies

The Best of the Week (Mar. 14): Big Data Zone

Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone. This week's best include world domination by algorithms, the sometimes scary potential of the internet of things, and tips on how to get started with Avro.

Rafał Kuć03/21/14
4159 views
0 replies

Solr 4.7 – Efficient Deep Paging

The deeper you want to go in the results, the slower the query will be. This is because Solr needs to prepare the data from the beginning for each query. Until Solr 4.7 there wasn’t a good solution for that problem.

Ravi Kalakota03/21/14
448 views
0 replies

Big Data Performance Anxiety and Data Grids

My advice is, be careful about “Big Data” hype. We’re hearing about implementations that simply take too much time, between 9-12 months to produce value. The “Data Scientist” concept is often divorced from the domain expertise needed to produce real business insight.

Rob J Hyndman03/20/14
391 views
0 replies

The Forecast Mean After Back-​​Transformation

Many func­tions in the fore­cast pack­age for R will allow a Box-​​Cox trans­for­ma­tion. The mod­els are fit­ted to the trans­formed data and the fore­casts and pre­dic­tion inter­vals are back-​​transformed. Occa­sion­ally the mean fore­cast is required.