I’ve been playing around with Python over the last few days while cleaning up a data set and one thing I wanted to do was translate date strings into a timestamp.
The hts package for R allows for forecasting hierarchical and grouped time series data. The idea is to generate forecasts for all series at all levels of aggregation without imposing the aggregation constraints, and then to reconcile the forecasts so they satisfy the aggregation constraints.
In this post my aim is to get Hadoop up and running on a Ubuntu host using Local (Standalone) Mode and on Pseudo-Distributed Mode.
Usually, once you have revised the paper, some references were added, others were dropped. But you need to spend some time to check that all references are actually mentioned in the paper. I wanted to work on that manually this week-end, but @3wen suggested to write a simple R function to scan the tex f file (as well as the aux file actually) to remove uncited references.
Andrews curves are a method for visualizing multidimensional data by mapping each observation onto a function. It has been shown the Andrews curves are able to preserve means, distance (up to a constant) and variances. Which means that Andrews curves that are represented by functions close together suggest that the corresponding data points will also be close together.
Deep belief networks have made it possible to train computers to predict if a sentence is positive, negative or neutral. Most sentiment analysis captures headlines because tweets can be analysed. However are there business applications beyond social networking analytics? Here are five examples:
In my continued playing around with R I’ve sometimes noticed ‘NA’ values in the linear regression models I created but hadn’t really thought about what that meant. On the advice of Peter Huber I recently started working my way through Coursera’s Regression Models which has a whole slide explaining its meaning:
Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (October 17 - October 24). This week's topics include an Apache Hadoop FAQ for executives, information retrieval with Apache Lucene and Tika, and validating configuration.
This is a sequal of what was presented in part 1 and part 2 of this tutorial; after indexing and querying we can highlight the results of a search by making use of Highlighter(s).
A sequal of what was implemented in Part 1 of this tutorial; we continue indexing and improving search conditions through different features provided by the Apache Lucene library.
This tutorial will explain the Lucene and Tika frameworks will be explained through their core concepts (parsing, mime detection, indexing, scoring, boosting) via illustrative examples that should be applicable to not only seasoned software developers but to beginners to content analysis and programming as well.
Apache Hadoop has slowly been infiltrating the mainstream business world, but many executives are still left with doubts about whether adopting Hadoop is a sound strategy for their organization. Is Hadoop enterprise friendly? Is it economical for an organization to use?
Review of "Scaling Apache Solr" book.
Every week here and in our newsletter, we feature a new developer/blogger from the DZone community to catch up and find out what he or she is working on now and what's coming next. This week we're talking to Ashwini Kuntamukkala, Software Architect at SciSpike, Inc.
When I’m working with people on Hadoop I ask what you would think is a simple question. What version of Hadoop are you using? In reality though it’s not as straight forward as you might think.
One question which pops up again and again when I talk about streamdrill is whether that cannot be done by X, where X is one of Hadoop, Spark, Go, or some other piece of Big Data infrastructure. The truth is that there’s a huge gap between “in principle” and “in reality”, and I’d like to spell this difference out in this post.
Examining exactly what is a data platform? Get a better understanding of big data and it's application. In this article I’ll be talking about the HortonWorks Data Platform as a reference platform.
With so many events taking place it can be a very daunting task finding the one that perfectly fits your interests and needs. That being said, I’ve done some research and compiled a comprehensive list of 22 Big Data and Business Intelligence events that you must attend during Q4 of 2014.
A crash course on JSON Schema. A nearly complete coverage of the Draft 4 specification, in brief.
I’ve been working through the videos that accompany the Introduction to Statistical Learning with Applications in R book and thought it’d be interesting to try out the linear regression algorithm against my meetup data set.
In my own experience as an editor who covers BI, I read numerous BI articles and I have found that despite the disproportionately low number of women in technology, many of the articles that I’ve read were authored by women. In BI, the works of women have provided great insight and thought leadership to the BI community and I personally want to list nine of the the top women writers who have helped shape my view on BI.
I’ve been playing around with R data frames a bit more and one thing I wanted to do was derive a new column based on the text contained in the existing column.
In the introduction of Computational Actuarial Science with R, there was a short paragraph on how could we import only some parts of a large database, by selecting specific variables.
I’ve been working through the exercises from An Introduction to Statistical Learning and one of them required you to create a pair wise correlation matrix of variables in a data frame.
At the Heidelberg Laureate Forum I had a chance to interview John Tate. In his remarks below, Tate briefly comments on his early work on number theory and cohomology. Most of the post consists of his comments on the work of Alexander Grothendieck.