Big Data/BI Zone is brought to you in partnership with:
  • submit to reddit
Mark Needham05/15/13
1725 views
0 replies

Book Review: The Signal and the Noise by Nate Silver

Nate Silver is famous for having correctly predicted the winner of all 50 states in the 2012 United States elections and Sid recommended his book so I could learn more about statistics for the A/B tests that we were running.

Marko Rodriguez05/15/13
1057 views
0 replies

Educating the Planet and Graph Databases

New data processing technologies and theories in education are moving much of the learning experience into the digital space — into massive open online courses (MOOCs). Two years ago Pearson contacted Aurelius about applying graph theory and network science to this burgeoning space.

John Cook05/15/13
1642 views
0 replies

Synchronizing Cicadas with Python

Suppose you want to know when your great-grandmother was born. You can’t find the year recorded anywhere. But you did discover an undated letter from her father that mentions her birth and one curious detail: the 13-year and 17-year cicadas were swarming.

Nick Johnson05/14/13
4474 views
0 replies

Algorithm of the Week: Damn Cool Secure Permutations with Block Ciphers

A secure permutation is one in which an attacker, given any subset of the permutation, cannot determine the order of any other elements. A simple example of this would be to take a cryptographically secure pseudo-random number generator, seed it with a secret key, and use it to shuffle your sequence.

Chase Seibert05/14/13
2374 views
0 replies

Hive with HBase Quickstart

Though there is some decent documentation, I found that setting up Hive with a HBase back-end to be somewhat fiddly. Hopefully this guide will help you get started quicker.

Michael Mccandless05/14/13
1850 views
1 replies

Eating Dogfood with Lucene

For the past few weeks I've been building a simple Lucene search application, searching all Lucene and Solr Jira issues, and using it instead of Jira's search whenever I need to go find an issue.

Arthur Charpentier05/14/13
166 views
0 replies

Reproducibility and Randomness

The good thing is that even complex functions (logistic regression, regression trees, etc) produce the same kind of outputs. But we found a problem that we could not fix: generating identical training subsets of observations…

Paul Miller05/13/13
1766 views
0 replies

Seeking Simplicity's Sweet Spot

"Everything should be made as simple as possible, but not simpler." These words have resonated with me recently, as I’ve heard pitches from one company after another, all of which are trying to cut through the complexity of data to make it accessible.

John Cook05/13/13
1504 views
0 replies

Mutually Odd Functions

The floor of a real number x is the largest integer n ≤ x, written ⌊x⌋. The ceiling of a real number x is the smallest integer n ≥ x, written ⌈x⌉. The floor and ceiling have the following symmetric relationship

Arthur Charpentier05/13/13
605 views
0 replies

Playing Cards with R

In my courses on R, I usually show how to insert a picture as a background for a graph. But it is also to see the picture as an object, and to insert it in a graph everywhere we like to see it.

Arthur Charpentier05/13/13
410 views
0 replies

Poisson Regression on Non-Integers

In my course on claims reserving techniques, I mentioned the use of Poisson regression, even if incremental payments were not integers. For instance, we did consider incremental triangles...

Christopher Taylor05/12/13
2981 views
0 replies

Why Don't We Just Hadoopify It?

Matt Schumpert took the stage at the InterOp Big Data Workshop in Las Vegas yesterday to talk about the myths and realities of Big Data. Matt is Director of Solutions Engineering at Datameer and has customers that include Visa, Sears, and three out of the worlds largest five banks. He’s an expert on Big Data and brought the following insights...

Arthur Charpentier05/12/13
897 views
0 replies

Data News: "Life in the City is One Giant Math Problem" and More

Arthur Charpentier's data link roundup takes a look at the mathematics of life in the city, the Batman equation, an accurate geek CT scan, and much more.

Eli Bendersky05/11/13
2342 views
0 replies

Python Will Have enums in 3.4!

After months of intensive discussion (more than a 1000 emails in dozens of threads spread over two mailing lists, and a couple of hundred additional private emails), PEP 435 has been accepted and Python will finally have an enumeration type in 3.4!

Christopher Taylor05/11/13
1545 views
1 replies

I Have All the Data in the World, Now What?

The Big Data Workshop at InterOp Las Vegas wrapped up the morning with a presentation on Big Data requirements by John West, CTO and Founder of Fabless Labs. John kicked off with the challenge of having your enormous data set all ready to work with when you discover any one of the following problems...

Charles Anderson05/10/13
4576 views
0 replies

Review: Hadoop Beginner's Guide

Hadoop Beginner's Guide provides an introduction to how to get up and running with the core components of Hadoop, some higher level tools like Hive, integration tools like Sqoop and Flume, and it also provides some good starting information relating to operational issues with Hadoop.

Christopher Taylor05/10/13
220 views
0 replies

Forget the "Big" Part for a Moment, Think About Data

Sometimes, just sometimes, what happens in Las Vegas shouldn’t stay in Las Vegas. That was clearly the case this morning when TIBCO CTO Matt Quinn took the stage to talk about the myths and realities of Big Data.

Bootstrap Mark...05/10/13
333 views
0 replies

Why Aren’t BI Users Analyzing Hadoop Data? – Part 2

Continuing from part 1 of this blog series, we will focus on answer the question “Who are some of the folks working to make Hadoop more accessible to BI users?”

Eric Gregory05/10/13
255 views
0 replies

On Hadoop On Azure

Yaniv Rodenski delivers a fifty minute talk about Hadoop on Azure.

Kay Cichini05/09/13
1394 views
0 replies

Creating a QGIS-Style (qml-file) with an R-Script

How to get from a txt-file with short names and labels to a QGIS-Style (qml-file)? I used the below R-script to create a style for this legend table where I copy-pasted the parts I needed to a txt-file, like for the WRB-FULL (WRB-FULL: Full soil code of the STU from the World Reference Base for Soil Resources).

Gary Sieling05/09/13
1724 views
0 replies

Extracting PDF Text with Scala

This example extracts the text contents of a PDF for use in other systems. This demonstrates some basic differences from Java: multi-line strings (hooray!), imports, primitive arrays, and what implementing an interface looks like.

John Cook05/09/13
1492 views
0 replies

Almost If and Only If

If you used the Perrin condition to test whether numbers less than a billion are prime, you would correctly identify all 50,847,534 primes as primes. But out of the 949,152,466 composite numbers, you would falsely report 17 of these as prime.

Niranjan Tallapalli05/09/13
2441 views
0 replies

Hashmap Internal Implementation Analysis in Java

Full detailed analysis of java.util.HashMap’s implementation, its internals and working concepts.

John Cook05/08/13
1010 views
0 replies

Ramanujan Approximation for Circumference of an Ellipse

There’s no elementary formula for the circumference of an ellipse, but there is an elementary approximation that is extremely accurate.

Arthur Charpentier05/08/13
1705 views
0 replies

Data News: "Algorithms Every Data Scientist Should Know" and More

Arthur Charpentier's regular roundup of stats and data science-related links points us to algorithms every data scientist should know, a free ebook on probabilistic programming and Bayesian methods for coders, and much much more.