In this post we will use Avro for serializing and deserializing data. We will use these 3 methods in which we can use Avro for serialization/deserialization: Using Avro command line tools, using Avro Java API without code generation, and using Avro Java API with code generation.
Sometimes, reading candidates' answers pisses the author off. He has a question that goes something like this: "We have a 15TB csv file that contains web log, entries sorted by date. Find all log entries within a given date range. You may not read more than 32 MB." This is how one candidate replied.
Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone! This week's best include an interview with Solr and Lucene specialist Rafał Kuć, a look at designing map/reduce algorithms, a tutorial on cleaning and optimizing the ElasticSearch indexes of Logstash, and more!
Interested in free books on statistical learning? There are a few recently available online, including Elements of Statistical Learning, Introduction to Statistical Learning with Applications in R, and Statistical foundations of machine learning.
Sometimes you just want to make games play themselves. This recent blog post applies reinforcement machine learning techniques to the controversial and recently-disappeared mobile game, Flappy Bird.
This tutorial will show you how to set up Eclipse and run your map reduce project and MapReduce job right from your IDE.
Apache Avro is a popular data serialization format and is gaining more users, because many Hadoop-based tools natively support Avro for serialization and deserialization. In this post we will understand some basics about Avro.
This installment of Arthur Charpentier's regular collection of data science-related links includes a free "Introduction to R" ebook, another on Big Data Analytics, a visualization of Yankee and Red Sox fan borders using Facebook data, how Apple and Amazon's security flaws led to one hacker's skills, and more.
In our space, we found that some of the most current healthcare related information is found on the internet. Our crawlers run against hundreds of websites. We have a fairly large web harvester, which is what drove me to explore Nutch with Cassandra: Crawling the web with Cassandra.
Every week here and in our newsletter, we feature a new developer/blogger from the DZone community to catch up and find out what he or she is working on now and what's coming next. This week we're talking to Rafał Kuć, software architect and Solr and Lucene specialist.
A few months ago, the author saw a link on Twitter to a graph charting the similarities of different foods based on their flavor compounds, in addition to their prevalence in recipes. The author decided to use the data for something different: Figuring out which ingredients tended to correlate across recipes.
This word might not exist, but mainstreamification is what’s happening to data analysis right now. Projects like Pandas, scikit-learn, MLbase, and Apache Mahout have made data analysis more accessible. The general message is: Data analysis has become super easy. But has it?
If you want to use Java objects as data source and data set in eclipse's BIRT you need to do that by using sripted data source and scripted data set. This article presents the usage of sripted data set in eclipse's BIRT.
This installment of Arthur Charpentier's regular collection of data science-related links includes "An Economist's Guide to Visualizing Data," a look at computational actuarial science (CAS) with R, an economic analysis of the Somali pirate business model, and much more.
Recently the author read a book on Map/Reduce algorithms by Lin and Dyer. This book gives a deep insight in designing efficient M/R algorithms. Today, in this post, he will discuss the in-mapper combining algorithm and a sample M/R program using this algorithm.
ElasticSearch index files grow large quickly, and one of the most common questions about them is how to optimize them and clean them, getting rid of old records you're not interested in any longer. A very easy way to accomplish these tasks is using the following two scripts.
Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone. This week's best include some embarrassing code samples from job candidates, data science news from Arthur Charpentier, a review of Andrew Chisholm's "Exploring Data With RapidMiner," and more!
Last year, the author mentioned that unit-root tests are dangerous, because they might lead us to strange models. For instance, in an earlier post, the author found that the temperature observed in January 2013, in Montréal, might be considered a random walk process (or at least an integrated process).
In this article, the author will try to show how someone who digs into data visualization profoundly has an advantage over those who go with shortcuts. The example he will use is from a field completely unrelated to software development: music.
In this post, we resume our series on implementing the algorithms found in Data-Intensive Text Processing with MapReduce, this time covering map-side joins. As we can guess from the name, map-side joins join data exclusively during the mapping phase and completely skip the reducing phase.
Sometimes it takes very little time to know that a candidate is going to be pretty horrible. As you can probably guess, the sort of questions we ask tend to be “find me this data in this sort of file.” Probably the fastest indication is when they send me projects like these.
This installment of Arthur Charpentier's regular collection of data science-related links includes a discussion of open source data science tools and the bifurcation between Hadoop users and SQL/Excel users, an "idiot's" exploration of Bayesian analysis, an a visualization of iPhone users vs. Android users.
According to DJ Walker-Morgan at MongoHQ, you don't have Big Data. You may have lots of data, but that doesn't mean Big Data. Walker-Morgan's recent post on the subject covers a number of different ideas regarding Big Data and what it all means.
Google screwed the pooch with their protobuf 2.5 release. Code generated with protobuf 2.5 is binary incompatible with older protobuf libraries. Unfortunately, the current stable release of Flume 1.4 packages protobuf 2.4.1, and if you try and use HDFS on Hadoop 2.2 as a sink, you’ll be smacked with an exception.