In honor of Facebook's 10th anniversary, HadoopSphere pointed to Facebook's influence on the open source Big Data community by assembling a list of major contributions, as well as an infographic to show it all off.
With Avro, the context and the values are separated. This means the schema/structure of what the information is does not get stored or streamed over and over and over and over (and over) again.
Ever since the Infinispan 5.2 release we implemented fully distributed execution of both map and reduce phases of MapReduceTask.
Before anyone freaks out, the author's talking about a technology collapse, not a market collapse or steep downhill slope of a hype curve. Market demands are pushing our systems to ingest increasing amounts of data in a shorter time, while also making that data available to an increasing variety of queries.
Elastic MapReduce has the ability to run jobs against datasets located in S3 (rather than on HDFS as is usually the case). The author believes this used to be a customisation AWS had applied to Hadoop, but has been in mainline for some time now.
In one of the author's previous posts, he explained how to convert JSON data to Avro data and vice versa using Avro tools' command line option. Today he was trying to see what options we have for converting CSV data to Avro format, because as of now we don't have any Avro tool option to accomplish this.
Here is a short snippet of Ansible playbook that installs R and any required packages to any nodes of the cluster. Note that the command installs each package only if it is not already present, but messes up the “changed” status of Ansible’s PLAY RECAP by incorrectly reporting a change per R package at every run.
The current version of Oozie (4.0.0) doesn’t build correctly when you try and target Hadoop 2.2. The Oozie team have a fix going into release 4.0.1 (see OOZIE-1551), but until then you can hack the Maven files to get it working with 4.0.0.
It’s sometimes easy to assume that the clusters of commodity servers commonly associated with big data have made high performance computing (HPC) installations a thing of the past. But Robert Clyde argues that HPC has evolved, and that the machines in HPC labs now look an awful lot like regular computers.
Another day, another data breach. The author just received another “we’re sorry you got hacked” letter. This is the fifth letter he has received in the past three months: Forbes.com, Target, Neiman Marcus, a credit card company, and a previous employer. What is going on?
One difference between MapReduce and Impala is that in Impala, the intermediate data moves from process to process directly instead of storing it on HDFS for processes to get at the data needed for processing. This provides a huge performance advantage while consuming few cluster resources.
Python has a vast library of modules that are included with its distribution. The csv module gives the Python programmer the ability to parse CSV (Comma Separated Values) files.
Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone. This week's best include machine learning and Flappy Bird, a how-to on using ElasticSearch from AngularJS, a collection of free books on statistical learning, and more!
This week, Apache Hadoop 2.3.0 was released. There are a lot of bug fixes and small changes in this one - you can read it all in Apache's release notes - but some there are some bigger changes, such as in-memory caching for HDFS and heterogeneous storage hierarchy in HDFS.
The author has started work on the second edition of his book, which will bring existing coverage up to date, and also add new chapters covering things like YARN, Running Storm on YARN, pulling data out of Kafka into HDFS, using Spark for in-memory, iterative data processing, and more.
Sometimes it is useful to “backcast” a time series — that is, forecast in reverse time. Although there are no in-built R functions to do this, it is very easy to implement.
This article introduces the Query By Example (QBE) capabilities of Hibernate. Also presents how to use QBE together with Criteria Query, and gives an overview of QBE usage in other JPA implementations.
Not all the data we handle is easy to query. This is often times because a lot of the data we process and handle is unstructured. Be it logs, archived documents, user data, or text fields in our database that we know contain information that can be useful, but we just don’t know how to get to it.
Today Apache Lucene and Solr PMC announced a new release of Apache Lucene library and Apache Solr search server – the 4.7 one. This is another release from the 4.x brach bringing new functionalities and bugfixes.
In the author's previous post, he described how we can use in-mapper combiner to make our M/R program more efficient. We also saw both M/R algorithms for average calculation with and without using in-mapper combiner optimization. In this post, the author is posting code for both the algorithms.
Hive supports different data types to be used in table columns. The data types supported by Hive can be broadly classified in Primitive and Complex data types. The primitive data types supported by Hive are listed here.
This installment of Arthur Charpentier's regular collection of data science-related links includes thoughts on Big Data visualization, a statistical model for predicting Olympic medal winners, the role of intuition in Big Data, data science's move from hubris and machismo to human-centered design, and much more.
In this post we will use Avro for serializing and deserializing data. We will use these 3 methods in which we can use Avro for serialization/deserialization: Using Avro command line tools, using Avro Java API without code generation, and using Avro Java API with code generation.
Sometimes, reading candidates' answers pisses the author off. He has a question that goes something like this: "We have a 15TB csv file that contains web log, entries sorted by date. Find all log entries within a given date range. You may not read more than 32 MB." This is how one candidate replied.
Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone! This week's best include an interview with Solr and Lucene specialist Rafał Kuć, a look at designing map/reduce algorithms, a tutorial on cleaning and optimizing the ElasticSearch indexes of Logstash, and more!