Teradata unveiled a number of enhancements to its core data management offerings. One announcement stood out: the launch of QueryGrid, a tool designed to orchestrate the execution of analytic processing across parallel databases.
The author has been using Lucene for the past six or seven years, and after his last post, he thought it would be a good idea to talk a bit about the kind of things that it isn't doing well.
Typing tables in LaTeX can get messy, but there are some good tools to simplify the process.
So, whatever the test, we always reject the assumption that there is a seasonal unit root. Which does not mean that we can not have a strong cycle! Actually, the series is almost periodic. But there is no unit root!
The hypothesis of this theorem is that the underlying distribution has a mean. Lets see where things break down if the distribution does not have a mean.
The database plugin in IntelliJ IDEA is a useful tool to work with data in databases. As long as we got a JDBC driver to connect to the database we can configure a data source
SQL as an interface to big data operations is desirable for the same reasons the author found it useful, but it also introduces some performance implications that are not suited to traditional MapReduce-style jobs which tend to have completion times in the tens of minutes to hours rather than seconds.
Infochimps has moved in a different direction, focusing far more attention upon the tools and services required to work with data, less upon offering a place for customers to find data. We touch upon Hadoop’s role within the growing big data ecosystem, asking if it’s as important as its backers tend to claim.
Get down with R and start visualizing your data in a whole new way!
In most of these applications, you have to deal with evented data which comes in “in real-time”. Data is constantly changing and you usually want to consider the data over a certain time frame (“page views in the last hour”), instead of just taking all of the past data into account.
Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (Apr. 04 to Apr. 10). This week's best include a discussion of interoperability in the Internet of Things (IoT) discipline, a look at Apache Spark, and an adventure into Lucene's indexing process.
For a look at what's been happening outside of the Big Data Zone, we've assembled a collection of links including the 30 best tools for data visualization, different perspectives on Hadoop and related tools, New Relic's Splunk-style Analytics, and the role of big data in the rise of the Internet of Things (IoT).
In this post the author summarizes their notes from a conference that included the following topics: Is Big Data a Big Hype? How do you make sense out of your Big Data? Do we need a new role for Chief Data Officer? What is the business value behind Big Data? Is there a good visualization tool for Big Data?
The solution was to “build the house of data” and for the time being, that means using Hadoop for what it calls internally, “hadumping.”
How do you do sorting on a field value? The answer is, not easily.
What is going on here is that the commentators are assuming we live in a noise-free world. However, the world is noisy — real data are subject to random fluctuations, and are often also measured inaccurately. So to interpret every little fluctuation is silly and misleading.
The author of this article uses data concerning First Nations libraries in Ontario to demonstrate variations of data visualization in R.
Simpson’s Paradox is a phenomenon in which a trend identified from a population is reversed when investigated at the sub-population levels. Think about that again – conclusions drawn from an overall set of data are not indicative of the behavior of the underlying subsets.
Continuing his trip into the Lucene codebase, the author is now looking into the process indexing as they are happening. Interestingly enough, that is something that we never really had to look at before.
In this recap of a podcast with Bikas Saha and Arun Murthy, the author got to hear about some of what is in 2.4 and coming in 2.5 of Hadoop.
Apache Spark is an increasingly popular alternative to replace MapReduce with a more performant execution engine but still use Hadoop HDFS as storage engine for large data sets.
The Data Platform Group at Microsoft does a lot, from SQL Server and their Hadoopey HDInsight offering through to Business Intelligence and analytics capabilities which sit in or on top of the humble Excel spreadsheet.
This installment of Arthur Charpentier's regular collection of data science-related links includes "'Big Data' the Buzzword vs. Data the Actual Thing, the influential interaction between data visualization and story telling, what big data can say about relationships between world leaders, and more.
Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (Mar. 28 to Apr. 03). This week's best include the Apache Solr/Lucene 4.7.1 announcement, a discussion of how some tools can make things harder instead of easier, and an overview of the upcoming ApacheCon.
One convenience of this implementation is that we can deploy the above classes in a jar under solr core's lib directory. We do not need to overhaul solr source code and deal with deploying some "custom" solr shards.