When the author went on Toronto Open Data’s website and found a dataset of licensed child care centers throughout Toronto, he thought he might have a fun time analyzing a topic that he thankfully has not had to deal with thus far! In this article, you'll find the process of mapping buildings and the R code to do it.
This presentation from Hilary Mason at devs love bacon is an introduction to machine learning for those who have no prior experience with it. Take a look if you're interested in a quick, fun overview to help you get started.
Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone. This week's best include a reflection on curing cancer with data visualization, how to compare word counts in two text documents using R, and working with Java 8 Lambda expressions and JDBC.
In this article, you'll find a top 100 list of the most popular Java libraries, based on 10,000 GitHub projects and an analysis of the top trends in Java. Like the author, you may be surprised by some of the results.
Recently, Yelp made available a sample dataset from the greater Phoenix metropolitan area including around 11,000 businesses and 8,000 check-in sets. We are interested in finding out whether it is possible to visually cluster businesses by category based on their check-in data.
The possibilities in the field of mobile healthcare seem enormous. In the UK at least, much of community health is delivered in a labour intensive way, with professionals either going out to households or patients coming into GP surgeries.
Experienced developers interested in learning more about programming in R have a fantastic resource in John Cook's "R programming for those coming from other languages." Cook's guide is to-the-point and concise, and focuses on the information needed to become productive with R, without a lot of fluff.
Data access, specifically SQL access from within Java, has never been nice. This is in large part due to the fact that the JDBC api has a lot of ceremony. In this article, you'll learn how to make SQL access easier in Java using Java 8 Lambda expressions and Streams.
You’ve just found a bowl you know nothing about. You start pulling out marbles and the first 99 marbles are red. Will the 100th marble be red as well? D’oh. But is it really that obvious? How can you be sure?
This recent article discusses the emerging field of web science, and the increasing popularity of Python as the ideal language for data analysis over previous standards, such as STATA and R.
One of the questions the author tends to get is what happens with a SolrCloud cluster when ZooKeeper fails. Not a single ZooKeeper instance failure, but the whole ensemble not being accessible. Because the answer to this question is easy to verify, the author decided to show what happens when ZooKeeper fails.
This installment of Arthur Charpentier's regular collection of data science-related links includes 5 ways to work with Big Data in R, "Statistical inference in massive data sets," how to analyze your network of Facebook friends with R, and more.
Recently, Apache Lucene and Solr PMC announced another version of Apache Lucene library and Apache Solr search server numbered 4.6. This is a next release continuing the 4th version of both Apache Lucene and Apache Solr.
The Python vs. R article cited in this post clarifies the reasons why a programming language is a better choice than a "tool" or "platform."
This installment of Arthur Charpentier's data science-related links includes an article on Lucien Le Cam and Bayes, a discussion of the multilingual cyberspace, a visualization of income disparity over time, and more.
As an old Spring Data fan, when I found out Spring Data offered a Solr module, I jumped at the chance to try it.
Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone. This week's best include a history of Python's emergence as the language of choice for data science, how to integrate R with Cloudera Impala, an introduction to machine learning with R, and more.
Everyone thought that The Netherlands Cancer Institute’s 12-year-old dataset on breast cancer was old news. That was until a researcher, Pek Lum, analyzed and visualized the dataset using topological data analysis (TDA) and advanced machine learning technology.
Here is a pair of code samples in R designed to compare word counts in two pieces of text. The first attempts to reinvent the wheel, while the second utilizes the capabilities of existing packages.
When’s the last time you saw a photo of Nessie, Sasquatch, or a Martian ship? When few carried cameras, the rare photo of the unexplained (a grainy image in a Scottish lake, or a flash of light in the sky that seemed to hover) was enough to create a broadly spread story. No longer.
Implementing basic learning algorithms is an important step, but in a way, it's also the simple step. The hard part is integrating all these learning methods into a whole.
In this installment of Arthur Charpentier's data science-related links, you'll find a visualization of Waldo's locations on the page, the reason "why Python is steadily eating other languages," the first in a series of posts on "recommendations engine," and more.
This recent tutorial demonstrates how to use non-Java languages - R, in particular - to work with Hadoop data through MapReduce and Hive. Though the tutorial focuses on R, it is also meant to open doors for users working with other languages, such as Python, Ruby, and Linux commands or Shell scripts.
Impala uses Hadoop as a storage engine, but moves away from MapReduce algorithms toward distributed queries. Also, R can be integrated with Impala to provide fast, interactive queries running on top of Hadoop data sets. The data can then be further processed or visualized within R.