One of the author's clients has some heavy-duty requirements for boosting functions. It’s right on the boundary of what he thinks is appropriate for Solr. He might as well expect the boosting functions to be as readable, and well-organized as possible, though, so let’s take a look at his strategy.
This set of slides presents an introduction to machine learning with R. It covers the strong points of R as a language, the basic concepts and uses of machine learning, and provides an overview of each, complete with code samples in R and images of the visualized data.
For Big Data Analytics, Power Query enables users to easily discover, combine, and refine data for better analysis in Excel, and Power Map allows you to plot geographic and temporal data visually, analyze that data in 3D, and create interactive tours to share with others.
Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone. This week's best include techniques for estimating ages based on first names, a new R virtual machine written in Java, and a collection of data science-related links, including "The Hidden Technology That Makes Twitter Huge."
Nowadays, Python is probably the programming language of choice (besides R) for data scientists for prototyping, visualization, and running data analyses on small and medium sized data sets. And rightly so, I think, given the large number of available tools. However, it wasn’t always like this.
This installment of Arthur Charpentier's regular collection of data science-related links includes a free e-book on "Applied Epidemiology Using R," an argument that statistics are the least important part of data science, and what every programmer should know about memory.
Those of you who work with R (or Java) might be interested in FastR, an R virtual machine written in Java. R is not the fastest or most efficient language out there, and FastR aims to improve on it in a number of ways.
The book introduces the IPython basics and then focuses on how to combine IPython with some of the most useful libraries for data analysis such as Numpy, Matplotlib, Basemap and Pandas. Every topic is covered with examples and the code presented is also available online.
The author did some research into DisMax type searching. This is a Solr search type that you can use by simply adding it to your cfsearch tag. In this article, you'll learn some of the advantages of DisMax, as well as some issues the author faced in implementing it.
In the author's last post, he wrote about how he compiled a US Social Security Agency dataset into something usable in R, and mentioned some issues scaling it up to be usable for bigger datasets. This time, he’ll show you how he prepped the dataset to become more scalable.
This new tool seems to be centered on creating an easy-to-use interface for analyzing Hadoop data - probably aiming to be more accessible to employees who aren't quite as much in the loop, among other things - and allows users to manipulate data in Excel, which is then scaled to Hadoop's dataset.
This installment of Arthur Charpentier's collection of data science-related links includes time series analyses in R, an article about "Probability, Gambling and the Origins of Risk Management," an analysis of The Economist's explanation of statistical significance, and more.
After reading a post with lists of the trendiest names in US history, the author decided to compile the lists using R. In this post, the author discusses building a dataframe, as well as a function to query the dataframe.
The following problem illustrates how the smallest changes to a problem can have large consequences. As explained at the end of the post, this problem is a little artificial, but it illustrates difficulties that come up in realistic problems.
The author's previous post on recommendation systems suffers from a lack of diversity. For example, a list may contain the same book as a soft cover, hard cover, and Kindle version. Because interests are diverse, a better recommendation list should contain items that cover a broad spectrum of the user's interests
This installment of Arthur Charpentier's regular collection of data science-related links includes analyzing baseball data with R, a profile of a sword-swallowing statistician, and the technology behind Twitter that creates a massive network of data.
Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone! This week's best include the open source announcement of Facebook's Presto, an analysis of the multi-armed bandit algorithm, and practical uses for Big Data in terms of social and economic efficiency.
Logentries processes over 10 billion log events every day. That’s quite a lot of data. So, the Logentries research team decided to take advantage of their unique position and set out to examine a sample of their overall user base for insights: 22 billion log events from over 6,000 Heroku applications.
Since I started using ff and ffbase, I have resorted to saving and loading my ff dataframes using ffsave and ffload. The syntax isn’t so bad, but the resulting process it puts your computer through to save and load your ff dataframe is a bit cumbersome. For that reason, I was happy to learn about an alternative.
Today, the author wanted to publish a post on generating functions, partly because his students just finished their Probability exam (and there were a few questions that were related). In this article, you'll look at generating functions and R.
Users of the cluster manager Apache Mesos might find this recent set of tutorials from Mesosphere to be useful. The tutorials cover a lot of ground, focusing on Hadoop, Spark, Chronos, and more, and look like a good start for anybody looking to work in new ways with Mesos.
Recently, the author was working with a student of his on mortality tables. Since they have some missing data, they wanted to use Generalized Nonlinear Models. In this article, you will learn how to get a smooth estimator of the mortality surface, and how to code an analysis of their dataset.
The author needed to write a classification algorithm to work out whether a person on the Titanic survived, and luckily, scikit-learn has extensive documentation on each of the algorithms. Unfortunately, most examples used NumPy data structures, and we’d loaded the data using pandas!
In this paper, @Tom7 writes about an algorithm he’s developed over several weekends that plays classic NES games. The software consists of two utilities: learnfun and playfun. Learnfun watches you play a game and figures out what it means to win. Playfun uses that knowledge to play the game.
This recent article asks when big data will become profitable. That's not to say that there is no profit in big data, but that many companies are jumping on the data bandwagon expecting immediate results and finding that there's a lot more to it than just collecting data.