In case you missed them, here is a curated list of the best articles of this week from The Big Data Zone. This week: Randomization and probabilistic techniques to scale up machine learning, creating a skewed random discrete distribution in Python, Bayes factors versus P-values, a new Python podcast, and the Big Data Challenge.
These are only a few instances of probabilistic bounds being applied to solve real world machine learning problems. There are a lots more. In fact I find that scalability of machine learning has a very direct correlation with application of probabilistic techniques to the model. As I mentioned earlier the point of this post is to share some of my thoughts as I continue to learn techniques to scale up machine learning models.
I’m planning to write a variant of the TF/IDF algorithm over the HIMYM corpus which weights in favour of term that appear in a medium number of documents and as a prerequisite needed a function that when given a number of documents would return a weighting.
Bayesian analysis and Frequentist analysis often lead to the same conclusions by different routes. But sometimes the two forms of analysis lead to starkly different conclusions. The following illustration of this difference comes from a talk by Luis Pericci last week. He attributes the example to “Bernardo (2010)” though I have not been able to find the exact reference.
I’m super excited to announce that I just launched a brand new podcast for Python developers called Talk Python To Me. This weekly podcast already has the first episode published and some amazing guests lined up.
Big Data analytics is now more towards accurately defining data, uniform handling and developing data driven smart products.