Big Data/Analytics Zone is brought to you in partnership with:
  • submit to reddit
Mark Needham07/10/14
2149 views
0 replies

R/plyr: ddply – Error in vector(type, length) : vector: cannot make a vector of mode ‘closure’.

In my continued playing around with plyr’s ddply function I was trying to group a data frame by one of its columns and return a count of the number of rows with specific values and ran into a strange (to me) error message.

John Piekos07/09/14
7255 views
0 replies

Designing a Data Architecture to Support both Fast and Big Data

In this post, I will illustrate how I envision the corporate architecture that will enable companies to achieve the data dream that integrates Fast and Big.

Nati Shalom07/09/14
1580 views
0 replies

How to do real time complex query on Big Data

In the past few years, almost every function of our lives has become dependent on real time applications. Whether it is updating our friends on every move we make via social media or shopping on e-commerce websites; we have become completely dependent on getting the correct information quickly.

Mark Needham07/08/14
963 views
0 replies

Data Science: Mo' Data, Mo' Problems

Over the last couple of years I’ve worked on several proof of concept style Neo4j projects and on a lot of them people have wanted to work with their entire data set which I don’t think makes sense so early on

Whitney Baker07/08/14
762 views
0 replies

Leaving Data on the Table: Obstacles to Big Data Analytics

There is frequent conversation about the explosive growth of Big Data in the age of wearables, compulsive social media and ever more capable computers, but when it comes down to gleaning useful insights from data, data scientists face more challenges with variety than with sheer volume.

Mark Needham07/08/14
474 views
0 replies

R: Aggregate by different functions and join results into one data frame

In continuing my analysis of the London Neo4j meetup group using R I wanted to see which days of the week we organise meetups and how many people RSVP affirmatively by the day.

Mark Needham07/07/14
2182 views
0 replies

R: Order by Data Frame Column and Take Top 10 Rows

I’ve been doing some ad-hoc analysis of the Neo4j London meetup group using R and Neo4j and having worked out how to group by certain keys the next step was to order the rows of the data frame.

Whitney Baker07/06/14
4027 views
0 replies

The Best of the Week (June 27): Big Data Zone

Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (June 27 to July 4). This week's topic's include lying with data, hadoop shell commands to manage HDFS, payloads in solr, the origins of the term big data and interactive data visualization.

Daniel Bartl07/04/14
2214 views
0 replies

Interview with MapR’s M.C. Srivas about Apache Drill

Recently we had M.C.Srivas, CTO and Co-Founder of MapR Technologies, as a speaker at our Munich Hadoop User Group. He gave a nice talk about the Apache Drill Project which develops a tool providing fast interactive SQL on Hadoop and other data sources.

Rambo Zhou07/04/14
1175 views
0 replies

Refactor of Hadoop

In Hadoop 1.x, there are some problem, for example, HA and too many small files. In Hadoop 2.x Yarn, there are no HA problem. But only one of the masters in yarn can be active and serve to clients, and other masters are stand by. I think this is a waste of master.

Mikio Braun07/03/14
2668 views
0 replies

What is Scalable Machine Learning?

Scalability has become one of those core concept slash buzzwords of Big Data. It’s all about scaling out, web scale, and so on. In principle, the idea is to be able to take one piece of code and then throw any number of computers at it to make it fast.

Tomer Levi07/03/14
1002 views
0 replies

Using Solr 4.9 new ChildDocTransformerFactory

Lucene & Solr 4.9 were released a couple weeks ago and introduced a new result document transformer called ChildDocTransformerFactory.

Jim King07/03/14
699 views
0 replies

How to Implement Complex Business Rules with a Powerful Computation Tool

In an ever-changing business world, people nowadays often find themselves get stuck in various computing problems involving complex business rules.

Paul Miller07/02/14
2477 views
0 replies

How to Lie with Data

There are plenty of books telling data scientists (whatever they are) and others how to visualise data, how to tell stories, and how to persuade.

Erick Erickson07/02/14
2664 views
0 replies

Payloads Are Neat, but Where’s a Complete Example for Solr?

I’ve been a bit frustrated whenever I discuss payloads in Solr by the lack of an example I could find that gave me all the pieces in a single place. So I decided to create one for Solr 4.0+ (actually, 4.8.1 at the time of this writing, but this should apply for all the 4x code line).

Whitney Baker07/02/14
3265 views
0 replies

The Origins of the Term "Big Data"

It may have come about innocuously enough, in an article written in 1989 about organizations mining data for marketing purposes.

Ricky Ho07/01/14
4200 views
1 replies

Interactive Data Visualization

Recently, "interactive report" is becoming a hot topic in data visualization. I believe it is becoming the next generation UI paradigm for KPI reports.

Trevor Parsons07/01/14
1834 views
0 replies

Google Cloud DataFlow – A Game Changer?

So the biggest revolution in database and analytics technology – namely the distributed batch processing technique known as MapReduce (and the associated Hadoop-centric ecosystem that has built up around it) is a legacy technology for one Silicon Valley player.

Saurabh Chhajed07/01/14
1907 views
0 replies

Hadoop: Getting Started with Pig

Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data analysts to write complex data transformations without knowing Java.

Saurabh Chhajed06/30/14
2634 views
0 replies

Top 10 Hadoop Shell Commands to manage HDFS

Top 10 basic Hadoop HDFS operations managed through shell commands which are useful to manage files on HDFS clusters.

Whitney Baker06/30/14
1713 views
0 replies

Teradata Enhances Open Source R Analytics

Teradata announced Teradata Aster R this week, which seeks to lift memory and processing limitations in open source R analytics.

Whitney Baker06/29/14
2902 views
0 replies

The Best of the Week (June 20): Big Data Zone

Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (June 13 to June 20). This week's topic's include Python, search relevancy, patent data and the nature of invention, scalable machine learning and anomaly detection.

Rafał Kuć06/27/14
3009 views
0 replies

Lucene and Solr 4.9

This week Apache Lucene and Solr PMC announced another version of Apache Lucene library and Apache Solr search server numbered 4.9. This is a next release continuing the 4th version of both Apache Lucene and Apache Solr.

Jakub Holý06/27/14
4018 views
0 replies

Review: Clojure for Machine Learning (Ch 1-3)

Pack Publishing has asked me to review their new book, Clojure for Machine Learning (4/2014) by Akhil Wali. Interested both in Clojure and M.L., I have taken the challenge and want to share my impressions from the first chapters.

Whitney Baker06/26/14
8402 views
1 replies

Google I/O: Dumping MapReduce

At its developer conference on Wednesday, however, Google followed a burgeoning trend in dumping MapReduce in favor of what they're calling Google Cloud Dataflow.