I’ve been looking back over some of the early code I wrote using R before I knew about the dplyr library and thought it’d be an interesting exercise to refactor some of the snippets.
The release of the latest JBoss Developer Studio (JBDS) brings with it the questions around how to get started with the various JBoss Integration and BPM product tool sets that are not installed out of the box.
A few months ago I wrote a blog explaining how to dynamically/programatically group a data frame by a field using dplyr but that approach has been deprecated in the latest version. It turns out the ‘group_by_’ function doesn’t want to receive a list of fields so let’s remove the call to list:
I’ve been looking through the code from Martin Eastwood’s excellent talk ‘Predicting Football Using R‘ and was intrigued by the code which reshaped the data into that expected by glm. I really like dplyr’s pipelining function so I thought I’d try and translate Martin’s code to use that and other dplyr functions.
The “frequency” is the number of observations per season. This is the opposite of the definition of frequency in physics, or in Fourier analysis, where “period” is the length of the cycle, and “frequency” is the inverse of period. When using the ts() function in R, the following choices should be used.
While working across multiple data science projects, I observed a similar pattern across a group of strategic data science projects where a common methodology can be used. In this post, I want to sketch this methodology at a high level.
When working with large amounts of data, backup and--if necessary--restoring is an important requirement. Elasticsearch has a snapshot and restore module that addresses this need.
Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (November 07 - November 14). This week's topics include learning R, sample range estimates, extracting datasets in Excel, R class for wildfire scientists, and converting a named vector to a data frame in R.
This article endeavors to explain how Big Data will bring about changes in information processing in the IT world. Its aim is to reach out to people seeking clarity on this concept, which has been surrounded by so much hype.
I’ve been playing around with igraph’spage rank function to see who the most central nodes in the London NoSQL scene are and I wanted to put the result in a data frame to make the data easier to work with.
This is SAS's view on big data. The article discusses how big data can be used to take better decisions, cut costs, and gain advantages.
The title of the post is a bit long, but that’s the problem I was facing this morning: importing datasets from files, online. I mean, it was not a “problem”, more a challenge (I should be able to do it in R, directly)
Mazama Science has just finished creating class materials on using R for the AirFire team at the USFS Pacific Wildland Fire Sciences Lab in Seattle, Washington. Autodidacts new to R should take about 20-30 hrs to complete the course.
Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (October 31 - November 07). This week's topics include getting started with Hadoop and MapReduce, data structural integrity, Big Data goals, prediction intervals, JSF versus JSP with CRUD applications.
This article represents some of the basic concepts you need to understand in order to write a Hello world using the R programming language.
Big data has a large role to play in the real estate industry.
I’ve been doing some work with Focused Objective lately, and today the following question came up in our discussion. If you’re sampling from a uniform distribution, how many samples do you need before your sample range has an even chance of covering 90% of the population range?
We make decisions every day; everything we say and do is the result of a decision, whether we make it consciously or not. No matter how big or small is the choice, there's no (easy) formula for making the right decision.
Don’t be afraid of these new applications. They are coming whether you like it or not. Embrace them, understand them as best you can. Then sit back and think about what the network can do for them. You have an ability to significantly impact their ability to perform.
Salesforce the world’s largest enterprise cloud computing company has recently unveiled “Wave”, their new enterprise business intelligence solution.
Hadoop MapReduce framework provides a way to process large data, in parallel, on large clusters of commodity hardware. An edit to an earlier version.
While there is certainly much feel-good hyperbole about the “making the world a better place” nature of big data, that is more than offset with actual real-world details of how data is being used to solve more day-to-day business problems.
Can an image capture an entire system's structural integrity? Can we tell from a graphic whether a system is well-structured? The Blighttown corollary highlights the importance of a good package structure, as this structure will probably constrain the quality of the entire system's structure.
Almost all prediction intervals from time series models are too narrow. This is a well-known phenomenon and arises because they do not account for all sources of uncertainty. When we produce prediction intervals for time series models, we generally only take into account the first of these sources of uncertainty.
Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (October 24 - October 31). This week's topics include Twitter data analysis, running Hadoop on Ubuntu, information retrieval with Apache Lucene, a method for data visualization, and removing references with R.