Ayende Rahien04/24/14
What Lucene Does, a Look Under the Hood

There is a lot going on behind the scenes.

Peter Zaitsev04/23/14
Using Apache Hadoop and Impala Together with MySQL for Data Analysis

Hadoop + Impala will give us an easy way to analyze large datasets using SQL with the ability to scale even on the old hardware.

Jim King04/23/14
Compute Data Outside Database to Alleviate Warehouse Expansion Pressure

Computation outside of a database is an alternative to expanding storage capacity.

Rob J Hyndman04/23/14
7 Forecasting Blogs

There are sev­eral other blogs on fore­cast­ing that read­ers might be inter­ested in. Here are seven worth following.

Anoop Madhusudanan04/22/14
Hadoop On Azure / HDInsight– Quick Intro Video On Writing Map Reduce Jobs In C#

Here is a quick intro screen cast on Big Data and creating map reduce jobs in C# to distribute the processing of large volumes of data, leveraging Microsoft Azure HDInsight / Hadoop On Azure.

Arthur Charpentier04/22/14
Data News: Why the Boom in Big Data Journalism Makes Sense & More

This installment of Arthur Charpentier's regular collection of data science-related links includes 9 problems with big data, the "Wonk Bubble" and big data journalism, advice from Cathy O'Neil about putting your trust in data analysis, and how big data is the next frontier for innovation, competition, and productivity.

Madhuka Udantha04/22/14
Intro to Machine Learning

Machine learning is a "Field of study that gives computers the ability to learn without being explicitly programmed".

Amar Mattey04/21/14
Writing effective custom queries in Hibernate

There are many instances where we will have to write custom queries with hibernate.

Sarah Ervin04/20/14
The Best of the Week (Apr. 11): Big Data Zone

Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (Apr. 11 to Apr. 17). This week's best include a guide to real-time big data, an evaluation of big data platforms, and the dark sides of Lucene.

Angela Ashenden04/18/14
Teradata Looks to Build Bridges and Cross the Big Data Divide

Teradata unveiled a number of enhancements to its core data management offerings. One announcement stood out: the launch of QueryGrid, a tool designed to orchestrate the execution of analytic processing across parallel databases.

Ayende Rahien04/17/14
The Dark Sides of Lucene

The author has been using Lucene for the past six or seven years, and after his last post, he thought it would be a good idea to talk a bit about the kind of things that it isn't doing well.

Rob J Hyndman04/17/14
Generating Tables in LaTeX

Typ­ing tables in LaTeX can get messy, but there are some good tools to sim­plify the process.

Arthur Charpentier04/16/14
Seasonal Unit Roots

So, whatever the test, we always reject the assumption that there is a seasonal unit root. Which does not mean that we can not have a strong cycle! Actually, the series is almost periodic. But there is no unit root!

John Cook04/16/14
The Mean of the Mean is the Mean

The hypothesis of this theorem is that the underlying distribution has a mean. Lets see where things break down if the distribution does not have a mean.

Hubert Klein Ikkink04/15/14
Coloring Different Data Sources in IntelliJ IDEA

The database plugin in IntelliJ IDEA is a useful tool to work with data in databases. As long as we got a JDBC driver to connect to the database we can configure a data source

Oliver Hookins04/15/14
(Something, Something) Big Data!

SQL as an interface to big data operations is desirable for the same reasons the author found it useful, but it also introduces some performance implications that are not suited to traditional MapReduce-style jobs which tend to have completion times in the tens of minutes to hours rather than seconds.

Paul Miller04/15/14
Infochimps CEO Jim Kaskade Talks About Acquisition and the Big Data Opportunity

Infochimps has moved in a different direction, focusing far more attention upon the tools and services required to work with data, less upon offering a place for customers to find data. We touch upon Hadoop’s role within the growing big data ecosystem, asking if it’s as important as its backers tend to claim.

Bill Jones04/15/14
Social Media Mining With R

Get down with R and start visualizing your data in a whole new way!

Mikio Braun04/14/14
Mikio's Guide To Real-Time Big Data

In most of these applications, you have to deal with evented data which comes in “in real-time”. Data is constantly changing and you usually want to consider the data over a certain time frame (“page views in the last hour”), instead of just taking all of the past data into account.

Sarah Ervin04/13/14
The Best of the Week (Apr. 04): Big Data Zone

Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (Apr. 04 to Apr. 10). This week's best include a discussion of interoperability in the Internet of Things (IoT) discipline, a look at Apache Spark, and an adventure into Lucene's indexing process.

Sarah Ervin04/12/14
Big Data Zone Link Roundup (Apr. 12)

For a look at what's been happening outside of the Big Data Zone, we've assembled a collection of links including the 30 best tools for data visualization, different perspectives on Hadoop and related tools, New Relic's Splunk-style Analytics, and the role of big data in the rise of the Internet of Things (IoT).

Nati Shalom04/11/14
Notes from Big Data Business Challenges Panel Discussion

In this post the author summarizes their notes from a conference that included the following topics: Is Big Data a Big Hype? How do you make sense out of your Big Data? Do we need a new role for Chief Data Officer? What is the business value behind Big Data? Is there a good visualization tool for Big Data?

Fredric Paul04/11/14
Building “The House of Data”

The solution was to “build the house of data” and for the time being, that means using Hadoop for what it calls internally, “hadumping.”

Rob J Hyndman04/10/14
Interpreting Noise

What is going on here is that the com­men­ta­tors are assum­ing we live in a noise-​​free world. How­ever, the world is noisy — real data are sub­ject to ran­dom fluc­tu­a­tions, and are often also mea­sured inac­cu­rately. So to inter­pret every lit­tle fluc­tu­a­tion is silly and misleading.

Ayende Rahien04/10/14
Sorting with Lucene

How do you do sorting on a field value? The answer is, not easily.