Big Data/Analytics Zone is brought to you in partnership with:
  • submit to reddit
Jakub Holý08/04/14
2731 views
0 replies

Most interesting links of July '14

A curated collection of the most interesting articles, links, and news in the programming world from last month, July of 2014.

Robert Diana08/01/14
5282 views
0 replies

Geek Reading July 31, 2014

These items are a combination of tech business news, development news and programming tools and techniques.

Ricky Ho07/30/14
5966 views
0 replies

Incorporate domain knowledge into predictive model

To create a predictive model, feature engineering (defining the set of input) is a key part if not the most important. In this post, I'd like to share my experience in how to come up with the initial set of features and how to evolve it as we learn more.

Mark Needham07/28/14
2062 views
0 replies

R: ggplot – Plotting back to back charts using facet_wrap

Earlier in the week I showed a way to plot back to back charts using R’s ggplot library but looking back on the code it felt like it was a bit hacky to ‘glue’ two charts together using a grid. I wanted to find a better way.

Bruno Terkaly07/28/14
1842 views
0 replies

Fundamentals of Machine Learning

Let's face it - computing was created to analyze data and machine learning represents the state-of-the-art in making sense of data. For many years it has been out of reach for the common developer.

Mark Needham07/27/14
3366 views
0 replies

Java: Determining the Status of Data Import Using Kill Signals

A few weeks ago I was working on the initial import of ~ 60 million bits of data into Neo4j and we kept running into a problem where the import process just seemed to freeze and nothing else was imported.

Steven Lott07/25/14
1418 views
0 replies

Building Probabilistic Graphical Models with Python

A colleague had some questions about the book named above. Some of which were irrational. I'll try to tackle the rational questions since emphasis my point on ways not to ask questions about books.

Rob J Hyndman07/24/14
3334 views
0 replies

Plotting the characteristic roots for ARIMA models

When modelling data with ARIMA models, it is sometimes useful to plot the inverse characteristic roots. The following functions will compute and plot the inverse roots for any fitted ARIMA model (including seasonal models).

Ayende Rahien07/23/14
5452 views
0 replies

Avoid where in a reduce clause

We got a customer question about a map/reduce index that produced the wrong results. The problem was a problem between the conceptual model and the actual model of how Map/Reduce actually works.

Rob J Hyndman07/22/14
4246 views
0 replies

I am not an econometrician

Econo­met­rics is often “the­ory dri­ven” while sta­tis­tics tends to be “data dri­ven”. I dis­cov­ered this in the inter­view for my cur­rent job when some­one crit­i­cized my research for being “data dri­ven” and asked me to respond.

Mike Driscoll07/21/14
4146 views
0 replies

An Intro to Peewee – Another Python ORM

I thought it would be fun to try out a few different Python object relational mappers (ORMs) besides SQLAlchemy. I recently stumbled across a project known as peewee. For this article, we will take the examples from my SQLAlchemy tutorial and port it to peewee to see how it stands up.

Mark Needham07/21/14
2940 views
0 replies

R: ggplot: Problem automatically picking scale for difftime object

I thought it’d be interesting to create some visualisations around the times that people RSVP ‘yes’ to the various Neo4j events that we run in London. I tried to use ggplot to create a bar chart of the data. Unfortunately that resulted in this error:

Mark Needham07/17/14
3248 views
0 replies

Thoughts on Software Development R: Apply a Custom Function Across Multiple Lists

In my continued playing around with R I wanted to map a custom function over two lists comparing each item with its corresponding items.

Rob J Hyndman07/17/14
2865 views
0 replies

Variations on Rolling Forecasts

Rolling forecasts are commonly used to compare time series models. Here are a few of the ways they can be computed using R. I will use ARIMA models as a vehicle of illustration, but the code can easily be adapted to other univariate time series models.

Jason Baldridge07/16/14
3869 views
0 replies

Emotional Contagion: Contextualizing the Controversy

The past two weeks have seen a great deal of discussion around the recent computational social science study of Kramer, Guillory and Hancock (2014) “Experimental evidence of massive-scale emotional contagion through social networks” .

Doug Turnbull07/16/14
1123 views
0 replies

Improving The Camel Solr Component

We’ve been using Apache Camel a fair amount recently as our ingestion pipeline of choice. It presents a fairly nice DSL for wiring together different data sources, performing transformations, and finally sending data to Solr.

Allen Coin07/15/14
4691 views
0 replies

Data Visualization: A Day in the Life of an NYC Taxi

This visualization displays the data for one random NYC yellow taxi on a single day in 2013. See where it operated, how much money it made, and how busy it was over 24 hours.

Doug Turnbull07/15/14
796 views
0 replies

Reindexing Collections with Solr’s Cursor Support

When a Solr schema changes, us Solr devs know what’s next — a large reindex of all of our data to capture any changes to index-time analysis. When we deliver solutions to our customers, we frequently need to build this in as a feature.

Yonik Seeley07/15/14
643 views
0 replies

Solution for multi-term synonyms in Lucene/Solr using the Auto Phrasing TokenFilter

In this post, I show how the use of this filter combined with a Synonym Filter configured to take advantage of auto phrasing, can help to solve an ongoing problem in Lucene/Solr – how to deal with multi-term synonyms.

Arthur Charpentier07/15/14
1529 views
0 replies

Bayesian Wizardry for Muggles

Monday, I will be giving the closing talk of the R in Insurance Conference, in London, on Bayesian Computations for Actuaries, as to be more specific, Getting into Bayesian Wizardry… (with the eyes of a muggle actuary).

Whitney Baker07/13/14
2569 views
0 replies

The Best of the Week (July 4): Big Data Zone

Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (July 4 to July 11). This week's topic's include designing data architecture, python in universities and other links, R tricks and Big Data white papers.

Brian O' Neill07/12/14
3001 views
0 replies

Applied Big Data: The Freakonomics of Healthcare

Our healthcare system is still (mostly) based on capitalism: more patients + more visits = more money. Within such a system, it is not in the best interest of healthcare providers to have healthy patients.

Angela Ashenden07/11/14
2378 views
0 replies

Overwhelmed by the Volume of Big Data Information Out There? Step This Way...

If you’re trying to navigate your way through the Big Data landscape and can’t see the wood for the trees, allow us to show you the way by cutting right to the chase with this new series of concise and right-to-the-point reports.

Eyal Golan07/11/14
636 views
0 replies

Parse Elasticsearch Results Using Ruby

One of our modules in our project is an elasticsearch cluster. In order to fine tune the configuration (shards, replicas, mapping, etc.) and the queries, we created a JMeter environment.

Arthur Charpentier07/10/14
4170 views
0 replies

Python's Popularity in Universities and other Big Data Links from Somewhere Else

Some data writings worth reading from Freakonometrics.