Big Data/Analytics Zone is brought to you in partnership with:
  • submit to reddit
Mark Needham07/08/14
0 replies

R: Aggregate by different functions and join results into one data frame

In continuing my analysis of the London Neo4j meetup group using R I wanted to see which days of the week we organise meetups and how many people RSVP affirmatively by the day.

Mark Needham07/07/14
0 replies

R: Order by Data Frame Column and Take Top 10 Rows

I’ve been doing some ad-hoc analysis of the Neo4j London meetup group using R and Neo4j and having worked out how to group by certain keys the next step was to order the rows of the data frame.

Whitney Baker07/06/14
0 replies

The Best of the Week (June 27): Big Data Zone

Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (June 27 to July 4). This week's topic's include lying with data, hadoop shell commands to manage HDFS, payloads in solr, the origins of the term big data and interactive data visualization.

Daniel Bartl07/04/14
0 replies

Interview with MapR’s M.C. Srivas about Apache Drill

Recently we had M.C.Srivas, CTO and Co-Founder of MapR Technologies, as a speaker at our Munich Hadoop User Group. He gave a nice talk about the Apache Drill Project which develops a tool providing fast interactive SQL on Hadoop and other data sources.

Rambo Zhou07/04/14
0 replies

Refactor of Hadoop

In Hadoop 1.x, there are some problem, for example, HA and too many small files. In Hadoop 2.x Yarn, there are no HA problem. But only one of the masters in yarn can be active and serve to clients, and other masters are stand by. I think this is a waste of master.

Mikio Braun07/03/14
0 replies

What is Scalable Machine Learning?

Scalability has become one of those core concept slash buzzwords of Big Data. It’s all about scaling out, web scale, and so on. In principle, the idea is to be able to take one piece of code and then throw any number of computers at it to make it fast.

Tomer Levi07/03/14
0 replies

Using Solr 4.9 new ChildDocTransformerFactory

Lucene & Solr 4.9 were released a couple weeks ago and introduced a new result document transformer called ChildDocTransformerFactory.

Jim King07/03/14
0 replies

How to Implement Complex Business Rules with a Powerful Computation Tool

In an ever-changing business world, people nowadays often find themselves get stuck in various computing problems involving complex business rules.

Paul Miller07/02/14
0 replies

How to Lie with Data

There are plenty of books telling data scientists (whatever they are) and others how to visualise data, how to tell stories, and how to persuade.

Erick Erickson07/02/14
0 replies

Payloads Are Neat, but Where’s a Complete Example for Solr?

I’ve been a bit frustrated whenever I discuss payloads in Solr by the lack of an example I could find that gave me all the pieces in a single place. So I decided to create one for Solr 4.0+ (actually, 4.8.1 at the time of this writing, but this should apply for all the 4x code line).

Whitney Baker07/02/14
0 replies

The Origins of the Term "Big Data"

It may have come about innocuously enough, in an article written in 1989 about organizations mining data for marketing purposes.

Ricky Ho07/01/14
1 replies

Interactive Data Visualization

Recently, "interactive report" is becoming a hot topic in data visualization. I believe it is becoming the next generation UI paradigm for KPI reports.

Trevor Parsons07/01/14
0 replies

Google Cloud DataFlow – A Game Changer?

So the biggest revolution in database and analytics technology – namely the distributed batch processing technique known as MapReduce (and the associated Hadoop-centric ecosystem that has built up around it) is a legacy technology for one Silicon Valley player.

Saurabh Chhajed07/01/14
0 replies

Hadoop: Getting Started with Pig

Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data analysts to write complex data transformations without knowing Java.

Saurabh Chhajed06/30/14
0 replies

Top 10 Hadoop Shell Commands to manage HDFS

Top 10 basic Hadoop HDFS operations managed through shell commands which are useful to manage files on HDFS clusters.

Whitney Baker06/30/14
0 replies

Teradata Enhances Open Source R Analytics

Teradata announced Teradata Aster R this week, which seeks to lift memory and processing limitations in open source R analytics.

Whitney Baker06/29/14
0 replies

The Best of the Week (June 20): Big Data Zone

Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (June 13 to June 20). This week's topic's include Python, search relevancy, patent data and the nature of invention, scalable machine learning and anomaly detection.

Rafał Kuć06/27/14
0 replies

Lucene and Solr 4.9

This week Apache Lucene and Solr PMC announced another version of Apache Lucene library and Apache Solr search server numbered 4.9. This is a next release continuing the 4th version of both Apache Lucene and Apache Solr.

Jakub Holý06/27/14
0 replies

Review: Clojure for Machine Learning (Ch 1-3)

Pack Publishing has asked me to review their new book, Clojure for Machine Learning (4/2014) by Akhil Wali. Interested both in Clojure and M.L., I have taken the challenge and want to share my impressions from the first chapters.

Whitney Baker06/26/14
1 replies

Google I/O: Dumping MapReduce

At its developer conference on Wednesday, however, Google followed a burgeoning trend in dumping MapReduce in favor of what they're calling Google Cloud Dataflow.

Whitney Baker06/26/14
0 replies

The Parable of Google Flu

In a white paper made publicly available by Harvard University, researchers broach the topic of Google Flu Trends – commonly hailed as an innovative and thorough application of Big Data – and some of its shortcomings.

Whitney Baker06/25/14
0 replies

Chicago Moves Forward with Big Data Sensors

A series of sensors will be installed on light poles in a downtown Chicago district later this summer. According to the Chicago Tribune, the data-collection sensors will measure air quality, light intensity, sound volume, heat, precipitation, and wind, in addition to wireless devices of passing crowds.

Arthur Charpentier06/25/14
0 replies

Tukey and Mosteller’s Bulging Rule (and Ladder of Powers)

I read something about a “Ladder of Powers Rule” also called “Tukey and Mosteller’s Bulging Rule“. To be honest, I never heard about this rule before. But that won’t be the first time I learn something while working on my notes for a course !

Jim King06/25/14
0 replies

How to Boost Computing Capability of Java over Massive Data

Programmers usually resort to the database to implement the massive data computation of Java. However, the database is unavailable or inconvenient for some application scenarios. In which cases, the native capability of Java is the key to achieve the goal.

Arthur Charpentier06/24/14
0 replies

Data News: How To Be Bayesian in Python

In this edition of "Somewhere Else" from Freakonometrics: how to be bayesian in python, how the New York Times is using machine learning to analyze its readership, an experiment on paid search effectiveness, and more.