What is going on here is that the commentators are assuming we live in a noise-free world. However, the world is noisy — real data are subject to random fluctuations, and are often also measured inaccurately. So to interpret every little fluctuation is silly and misleading.
How do you do sorting on a field value? The answer is, not easily.
The author of this article uses data concerning First Nations libraries in Ontario to demonstrate variations of data visualization in R.
Simpson’s Paradox is a phenomenon in which a trend identified from a population is reversed when investigated at the sub-population levels. Think about that again – conclusions drawn from an overall set of data are not indicative of the behavior of the underlying subsets.
Continuing his trip into the Lucene codebase, the author is now looking into the process indexing as they are happening. Interestingly enough, that is something that we never really had to look at before.
In this recap of a podcast with Bikas Saha and Arun Murthy, the author got to hear about some of what is in 2.4 and coming in 2.5 of Hadoop.
Apache Spark is an increasingly popular alternative to replace MapReduce with a more performant execution engine but still use Hadoop HDFS as storage engine for large data sets.
The Data Platform Group at Microsoft does a lot, from SQL Server and their Hadoopey HDInsight offering through to Business Intelligence and analytics capabilities which sit in or on top of the humble Excel spreadsheet.
This installment of Arthur Charpentier's regular collection of data science-related links includes "'Big Data' the Buzzword vs. Data the Actual Thing, the influential interaction between data visualization and story telling, what big data can say about relationships between world leaders, and more.
Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (Mar. 28 to Apr. 03). This week's best include the Apache Solr/Lucene 4.7.1 announcement, a discussion of how some tools can make things harder instead of easier, and an overview of the upcoming ApacheCon.
One convenience of this implementation is that we can deploy the above classes in a jar under solr core's lib directory. We do not need to overhaul solr source code and deal with deploying some "custom" solr shards.
Are you practicing Big data blocking and tackling actions? While Hadoop makes it easier to warehouse data, effective analytics across disparate data sources still requires defining data semantics, data mapping, and master data sources. Don’t forget these important foundational building blocks.
Today Apache Lucene and Solr PMC announced another version of Apache Lucene library and Apache Solr search server numbred 4.7.1.
This installment of Arthur Charpentier's regular collection of data science-related links includes problems with Google's data-based flu tracker, "Simplifying Data Analysis & Making Sense of Big Data," and More.
In this article, Arthur Charpentier details the mathmatics behind linear prediction for AR time series.
ASF is the home for the majority of open source big data projects and ApacheCon is a must-attend event if you care about big data. Being able to converse with many members of various Apache project communities is invaluable.
In this post the author demonstrates how to perform a map-side join when both data sets are large and can’t fit into memory. Map-side joins offer substantial gains in performance since we are avoiding the cost of sending data across the network.
If your Elasticsearch cluster is in a remote location you might need to forward your data over an encrypted channel.
This article is dedicated to manipulation of tables as part of eclipse BIRT rptdesign.xml-files via the Designe Engine API.
The author wanted to know a lot more about exactly how Lucene is storing data on disk. They know the general stuff about segments and files, etc. But the author wanted to see the actual bits & bytes. So they started tracing into Lucene, trying to figure out what it is doing.
According to a new report by Gartner, one-third of companies will face an information crisis within the next 3 years.
Both Hive and Pig require approximately the same amount of lines to set up the log parsing, mostly because it involves setting up each field label and data type individually and then a regex to parse the fields out of the input files. If you have a deserializer UDF this is made much easier in either case.
Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (Mar. 21 to Mar. 27). This week's best include Java 8's impact on database access, how to use CustomScoreQuery with Solr/Lucene Scoring, and Apache Accumulo's ability to preserve security.
This installment of Arthur Charpentier's regular collection of data science-related links includes "74,476 Reasons You Should Always Get the Bigger Pizza," the distribution of user-selected Twitter languages, climate forecasts, and a picture of the human journey created from global DNA data.
This might explain why, in R, when we ask for an autoregressive process or order P, then we get a model with P parameters to estimate, and even if some are not significant, we usually keep them for the forecast.