This is a sequal of what was presented in part 1 and part 2 of this tutorial; after indexing and querying we can highlight the results of a search by making use of Highlighter(s).
A sequal of what was implemented in Part 1 of this tutorial; we continue indexing and improving search conditions through different features provided by the Apache Lucene library.
This tutorial will explain the Lucene and Tika frameworks will be explained through their core concepts (parsing, mime detection, indexing, scoring, boosting) via illustrative examples that should be applicable to not only seasoned software developers but to beginners to content analysis and programming as well.
Apache Hadoop has slowly been infiltrating the mainstream business world, but many executives are still left with doubts about whether adopting Hadoop is a sound strategy for their organization. Is Hadoop enterprise friendly? Is it economical for an organization to use?
Review of "Scaling Apache Solr" book.
Every week here and in our newsletter, we feature a new developer/blogger from the DZone community to catch up and find out what he or she is working on now and what's coming next. This week we're talking to Ashwini Kuntamukkala, Software Architect at SciSpike, Inc.
When I’m working with people on Hadoop I ask what you would think is a simple question. What version of Hadoop are you using? In reality though it’s not as straight forward as you might think.
One question which pops up again and again when I talk about streamdrill is whether that cannot be done by X, where X is one of Hadoop, Spark, Go, or some other piece of Big Data infrastructure. The truth is that there’s a huge gap between “in principle” and “in reality”, and I’d like to spell this difference out in this post.
With so many events taking place it can be a very daunting task finding the one that perfectly fits your interests and needs. That being said, I’ve done some research and compiled a comprehensive list of 22 Big Data and Business Intelligence events that you must attend during Q4 of 2014.
Examining exactly what is a data platform? Get a better understanding of big data and it's application. In this article I’ll be talking about the HortonWorks Data Platform as a reference platform.
A crash course on JSON Schema. A nearly complete coverage of the Draft 4 specification, in brief.
I’ve been working through the videos that accompany the Introduction to Statistical Learning with Applications in R book and thought it’d be interesting to try out the linear regression algorithm against my meetup data set.
In my own experience as an editor who covers BI, I read numerous BI articles and I have found that despite the disproportionately low number of women in technology, many of the articles that I’ve read were authored by women. In BI, the works of women have provided great insight and thought leadership to the BI community and I personally want to list nine of the the top women writers who have helped shape my view on BI.
I’ve been playing around with R data frames a bit more and one thing I wanted to do was derive a new column based on the text contained in the existing column.
In the introduction of Computational Actuarial Science with R, there was a short paragraph on how could we import only some parts of a large database, by selecting specific variables.
I’ve been working through the exercises from An Introduction to Statistical Learning and one of them required you to create a pair wise correlation matrix of variables in a data frame.
At the Heidelberg Laureate Forum I had a chance to interview John Tate. In his remarks below, Tate briefly comments on his early work on number theory and cohomology. Most of the post consists of his comments on the work of Alexander Grothendieck.
OpenCSV is one of the best tools for CSV operations. We will see how to use OpenCSV for basic reading and writing operations.
Recently I authored a section of the DZone Guide for Big Data 2014. I wrote about MapReduce and the evolution of Hadoop.
When developers talk about using data, they are usually concerned with ACID, scalability, and other operational aspects of managing data. But data science is not just about making fancy business intelligence reports for management. Data drives the user experience directly, not after the fact.
Want to learn more about what "self-service" BI programs? Why many organizations are looking to leverage these technologies and programs on their quest to become more data-driven.
In my continued playing around with meetup data I wanted to plot the number of members who join the Neo4j group over time. I wanted to plot the actual count alongside a rolling average for which I created the following data frame:
Hadoop isn't the only big data tool out there. Check out this list of big data tools available.
The marketing department of software vendors have done a good job making Big Data go mainstream, whatever that means. The promise of we can achieve anything if we make use of Big Data; business insight and beating our competitions to submission. Yet, there is no well-publicised Big Data successful implementation. The question is: why not?
With MapReduce, companies no longer need to delete old logs that are ripe with insights—or dump them onto unmanageable tape storage—before they’ve had a chance to analyze them. Today, the Apache Hadoop project is the most widely used implementation of MapReduce.