Are you practicing Big data blocking and tackling actions? While Hadoop makes it easier to warehouse data, effective analytics across disparate data sources still requires defining data semantics, data mapping, and master data sources. Don’t forget these important foundational building blocks.
Today Apache Lucene and Solr PMC announced another version of Apache Lucene library and Apache Solr search server numbred 4.7.1.
This installment of Arthur Charpentier's regular collection of data science-related links includes problems with Google's data-based flu tracker, "Simplifying Data Analysis & Making Sense of Big Data," and More.
In this article, Arthur Charpentier details the mathmatics behind linear prediction for AR time series.
ASF is the home for the majority of open source big data projects and ApacheCon is a must-attend event if you care about big data. Being able to converse with many members of various Apache project communities is invaluable.
In this post the author demonstrates how to perform a map-side join when both data sets are large and can’t fit into memory. Map-side joins offer substantial gains in performance since we are avoiding the cost of sending data across the network.
If your Elasticsearch cluster is in a remote location you might need to forward your data over an encrypted channel.
This article is dedicated to manipulation of tables as part of eclipse BIRT rptdesign.xml-files via the Designe Engine API.
The author wanted to know a lot more about exactly how Lucene is storing data on disk. They know the general stuff about segments and files, etc. But the author wanted to see the actual bits & bytes. So they started tracing into Lucene, trying to figure out what it is doing.
According to a new report by Gartner, one-third of companies will face an information crisis within the next 3 years.
Both Hive and Pig require approximately the same amount of lines to set up the log parsing, mostly because it involves setting up each field label and data type individually and then a regex to parse the fields out of the input files. If you have a deserializer UDF this is made much easier in either case.
Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (Mar. 21 to Mar. 27). This week's best include Java 8's impact on database access, how to use CustomScoreQuery with Solr/Lucene Scoring, and Apache Accumulo's ability to preserve security.
This installment of Arthur Charpentier's regular collection of data science-related links includes "74,476 Reasons You Should Always Get the Bigger Pizza," the distribution of user-selected Twitter languages, climate forecasts, and a picture of the human journey created from global DNA data.
This might explain why, in R, when we ask for an autoregressive process or order P, then we get a model with P parameters to estimate, and even if some are not significant, we usually keep them for the forecast.
Apache Accumulo is a system built for doing random i/o with peta bytes of data. Distributing the computation to the data with cell level security is where Accumulo really shines.
I often get asked how few data points can be used to fit a time series model. As with almost all sample size questions, there is no easy answer.
ZooKeeper is an open source distributed configuration service, synchronization service, and naming registry for large distributed systems. ZooKeeper was a sub project of Hadoop.
Java 8 is finally here! After years of waiting, Java programmers will finally get support for functional programming in Java.
DAS is a platform for Hadoop which includes data source integration, an analytics engine and visualization functionality. This promise of a fully integrated Big Data analysis process motivated me to test the product.
Frequently you’ll find that existing Lucene queries will do fine with matching but you’d like to take control of just the scoring/ordering. That’s what CustomScoreQuery gives you – the ability to wrap another Lucene Query and rescore it.
Last night I gave a talk at SkillsMatter London on multi-lingual
search with Lucene and Elasticsearch. The talk covered various
challenges with indexing texts in various languages: tokenization, term
normalization and stemming. I started with...
Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone. This week's best include world domination by algorithms, the sometimes scary potential of the internet of things, and tips on how to get started with Avro.
The deeper you want to go in the results, the slower the query will be. This is because Solr needs to prepare the data from the beginning for each query. Until Solr 4.7 there wasn’t a good solution for that problem.
My advice is, be careful about “Big Data” hype. We’re hearing about implementations that simply take too much time, between 9-12 months to produce value. The “Data Scientist” concept is often divorced from the domain expertise needed to produce real business insight.
Many functions in the forecast package for R will allow a Box-Cox transformation. The models are fitted to the transformed data and the forecasts and prediction intervals are back-transformed. Occasionally the mean forecast is required.