Big Data/Analytics Zone is brought to you in partnership with:
  • submit to reddit
Ricky Ho07/30/14
0 replies

Incorporate domain knowledge into predictive model

To create a predictive model, feature engineering (defining the set of input) is a key part if not the most important. In this post, I'd like to share my experience in how to come up with the initial set of features and how to evolve it as we learn more.

Mark Needham07/28/14
0 replies

R: ggplot – Plotting back to back charts using facet_wrap

Earlier in the week I showed a way to plot back to back charts using R’s ggplot library but looking back on the code it felt like it was a bit hacky to ‘glue’ two charts together using a grid. I wanted to find a better way.

Bruno Terkaly07/28/14
0 replies

Fundamentals of Machine Learning

Let's face it - computing was created to analyze data and machine learning represents the state-of-the-art in making sense of data. For many years it has been out of reach for the common developer.

Steven Lott07/25/14
0 replies

Building Probabilistic Graphical Models with Python

A colleague had some questions about the book named above. Some of which were irrational. I'll try to tackle the rational questions since emphasis my point on ways not to ask questions about books.

Rob J Hyndman07/24/14
0 replies

Plotting the characteristic roots for ARIMA models

When modelling data with ARIMA models, it is sometimes useful to plot the inverse characteristic roots. The following functions will compute and plot the inverse roots for any fitted ARIMA model (including seasonal models).

Ayende Rahien07/23/14
0 replies

Avoid where in a reduce clause

We got a customer question about a map/reduce index that produced the wrong results. The problem was a problem between the conceptual model and the actual model of how Map/Reduce actually works.

Rob J Hyndman07/22/14
0 replies

I am not an econometrician

Econo­met­rics is often “the­ory dri­ven” while sta­tis­tics tends to be “data dri­ven”. I dis­cov­ered this in the inter­view for my cur­rent job when some­one crit­i­cized my research for being “data dri­ven” and asked me to respond.

Mike Driscoll07/21/14
0 replies

An Intro to Peewee – Another Python ORM

I thought it would be fun to try out a few different Python object relational mappers (ORMs) besides SQLAlchemy. I recently stumbled across a project known as peewee. For this article, we will take the examples from my SQLAlchemy tutorial and port it to peewee to see how it stands up.

Mark Needham07/21/14
0 replies

R: ggplot: Problem automatically picking scale for difftime object

I thought it’d be interesting to create some visualisations around the times that people RSVP ‘yes’ to the various Neo4j events that we run in London. I tried to use ggplot to create a bar chart of the data. Unfortunately that resulted in this error:

Mark Needham07/17/14
0 replies

Thoughts on Software Development R: Apply a Custom Function Across Multiple Lists

In my continued playing around with R I wanted to map a custom function over two lists comparing each item with its corresponding items.

Rob J Hyndman07/17/14
0 replies

Variations on Rolling Forecasts

Rolling forecasts are commonly used to compare time series models. Here are a few of the ways they can be computed using R. I will use ARIMA models as a vehicle of illustration, but the code can easily be adapted to other univariate time series models.

Jason Baldridge07/16/14
0 replies

Emotional Contagion: Contextualizing the Controversy

The past two weeks have seen a great deal of discussion around the recent computational social science study of Kramer, Guillory and Hancock (2014) “Experimental evidence of massive-scale emotional contagion through social networks” .

Doug Turnbull07/16/14
0 replies

Improving The Camel Solr Component

We’ve been using Apache Camel a fair amount recently as our ingestion pipeline of choice. It presents a fairly nice DSL for wiring together different data sources, performing transformations, and finally sending data to Solr.

Allen Coin07/15/14
0 replies

Data Visualization: A Day in the Life of an NYC Taxi

This visualization displays the data for one random NYC yellow taxi on a single day in 2013. See where it operated, how much money it made, and how busy it was over 24 hours.

Doug Turnbull07/15/14
0 replies

Reindexing Collections with Solr’s Cursor Support

When a Solr schema changes, us Solr devs know what’s next — a large reindex of all of our data to capture any changes to index-time analysis. When we deliver solutions to our customers, we frequently need to build this in as a feature.

Yonik Seeley07/15/14
0 replies

Solution for multi-term synonyms in Lucene/Solr using the Auto Phrasing TokenFilter

In this post, I show how the use of this filter combined with a Synonym Filter configured to take advantage of auto phrasing, can help to solve an ongoing problem in Lucene/Solr – how to deal with multi-term synonyms.

Arthur Charpentier07/15/14
0 replies

Bayesian Wizardry for Muggles

Monday, I will be giving the closing talk of the R in Insurance Conference, in London, on Bayesian Computations for Actuaries, as to be more specific, Getting into Bayesian Wizardry… (with the eyes of a muggle actuary).

Whitney Baker07/13/14
0 replies

The Best of the Week (July 4): Big Data Zone

Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (July 4 to July 11). This week's topic's include designing data architecture, python in universities and other links, R tricks and Big Data white papers.

Brian O' Neill07/12/14
0 replies

Applied Big Data: The Freakonomics of Healthcare

Our healthcare system is still (mostly) based on capitalism: more patients + more visits = more money. Within such a system, it is not in the best interest of healthcare providers to have healthy patients.

Angela Ashenden07/11/14
0 replies

Overwhelmed by the Volume of Big Data Information Out There? Step This Way...

If you’re trying to navigate your way through the Big Data landscape and can’t see the wood for the trees, allow us to show you the way by cutting right to the chase with this new series of concise and right-to-the-point reports.

Eyal Golan07/11/14
0 replies

Parse Elasticsearch Results Using Ruby

One of our modules in our project is an elasticsearch cluster. In order to fine tune the configuration (shards, replicas, mapping, etc.) and the queries, we created a JMeter environment.

Arthur Charpentier07/10/14
0 replies

Python's Popularity in Universities and other Big Data Links from Somewhere Else

Some data writings worth reading from Freakonometrics.

Barton George07/10/14
0 replies

Sumo Logic and Machine Data Intelligence — DevOps Days Austin

Today’s interview from DevOps Days Austin features Sumo Logic’s co-founder and CTO, Christian Beedgen. If you’re not familiar with Sumo Logic it’s a log management and analytics service. I caught up with Christian right after he got off stage on day one.

Mark Needham07/10/14
0 replies

R/plyr: ddply – Error in vector(type, length) : vector: cannot make a vector of mode ‘closure’.

In my continued playing around with plyr’s ddply function I was trying to group a data frame by one of its columns and return a count of the number of rows with specific values and ran into a strange (to me) error message.

John Piekos07/09/14
0 replies

Designing a Data Architecture to Support both Fast and Big Data

In this post, I will illustrate how I envision the corporate architecture that will enable companies to achieve the data dream that integrates Fast and Big.