In my continued playing around with plyr’s ddply function I was trying to group a data frame by one of its columns and return a count of the number of rows with specific values and ran into a strange (to me) error message.
In this post, I will illustrate how I envision the corporate architecture that will enable companies to achieve the data dream that integrates Fast and Big.
In the past few years, almost every function of our lives has become dependent on real time applications. Whether it is updating our friends on every move we make via social media or shopping on e-commerce websites; we have become completely dependent on getting the correct information quickly.
Over the last couple of years I’ve worked on several proof of concept style Neo4j projects and on a lot of them people have wanted to work with their entire data set which I don’t think makes sense so early on
There is frequent conversation about the explosive growth of Big Data in the age of wearables, compulsive social media and ever more capable computers, but when it comes down to gleaning useful insights from data, data scientists face more challenges with variety than with sheer volume.
In continuing my analysis of the London Neo4j meetup group using R I wanted to see which days of the week we organise meetups and how many people RSVP affirmatively by the day.
I’ve been doing some ad-hoc analysis of the Neo4j London meetup group using R and Neo4j and having worked out how to group by certain keys the next step was to order the rows of the data frame.
Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (June 27 to July 4). This week's topic's include lying with data, hadoop shell commands to manage HDFS, payloads in solr, the origins of the term big data and interactive data visualization.
Recently we had M.C.Srivas, CTO and Co-Founder of MapR Technologies, as a speaker at our Munich Hadoop User Group. He gave a nice talk about the Apache Drill Project which develops a tool providing fast interactive SQL on Hadoop and other data sources.
In Hadoop 1.x, there are some problem, for example, HA and too many small files.
In Hadoop 2.x Yarn, there are no HA problem. But only one of the masters in yarn can be active and serve to clients, and other masters are stand by. I think this is a waste of master.
Scalability has become one of those core concept slash buzzwords of Big Data. It’s all about scaling out, web scale, and so on. In principle, the idea is to be able to take one piece of code and then throw any number of computers at it to make it fast.
Lucene & Solr 4.9 were released a couple weeks ago and introduced a new result document transformer called ChildDocTransformerFactory.
In an ever-changing business world, people nowadays often find themselves get stuck in various computing problems involving complex business rules.
There are plenty of books telling data scientists (whatever they are) and others how to visualise data, how to tell stories, and how to persuade.
I’ve been a bit frustrated whenever I discuss payloads in Solr by the lack of an example I could find that gave me all the pieces in a single place. So I decided to create one for Solr 4.0+ (actually, 4.8.1 at the time of this writing, but this should apply for all the 4x code line).
It may have come about innocuously enough, in an article written in 1989 about organizations mining data for marketing purposes.
Recently, "interactive report" is becoming a hot topic in data visualization. I believe it is becoming the next generation UI paradigm for KPI reports.
So the biggest revolution in database and analytics technology – namely the distributed batch processing technique known as MapReduce (and the associated Hadoop-centric ecosystem that has built up around it) is a legacy technology for one Silicon Valley player.
Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data analysts to write complex data transformations without knowing Java.
Top 10 basic Hadoop HDFS operations managed through shell commands which are useful to manage files on HDFS clusters.
Teradata announced Teradata Aster R this week, which seeks to lift memory and processing limitations in open source R analytics.
Make sure you didn't miss anything with this list of the Best of the Week in the Big Data Zone (June 13 to June 20). This week's topic's include Python, search relevancy, patent data and the nature of invention, scalable machine learning and anomaly detection.
This week Apache Lucene and Solr PMC announced another version of Apache Lucene library and Apache Solr search server numbered 4.9. This is a next release continuing the 4th version of both Apache Lucene and Apache Solr.
Pack Publishing has asked me to review their new book, Clojure for Machine Learning (4/2014) by Akhil Wali. Interested both in Clojure and M.L., I have taken the challenge and want to share my impressions from the first chapters.
At its developer conference on Wednesday, however, Google followed a burgeoning trend in dumping MapReduce in favor of what they're calling Google Cloud Dataflow.