Hadoop and Mahout in Data Mining
Introduction: Data Mining
This blog talks about a typical solution architecture where we use various BigData components like Hadoop, Mahout, Sqoop, Flume and Mongo-Hadoop Connector. Lots of successful companies have been around for atleast 20 years and have invested in IT infrastructure. They have over a period of time, accumulated data in the form of transaction data, log archive and are eager to analyse these data and see if they can improve the business efficiency out of it.
Below is the Component Diagram of how each of the above frameworks fit into the ecosystem.
In a typical scenario, as the customer are using the IT systems of the company for buying companies products the customer data like , their purchasing patterns, their location, their Gender, how are other similar customers are purchasing are being stored for data mining. Based on this data, the company can helping customers make buying decisions. If you notice carefullyNetflix, Amazon, eBay, Gilt are already doing this.
Mahout is one of the open source framework which has a built in Recommender engine. It also runs on Hadoop using Map/Reduce paradigm. In a typical scenario,
- User Item data that is stored in MySQL is Data Mining into Hadoop Distributed File System (HDFS) using a ETL tool like Sqoop
- User click information that is stored in Log files is exported into HDFS using log migration tools like Flume
- If the transaction data is stored in NoSQL, there are connectors to export the data to HDFS
Once the data is in HDFS, we can use Mahout Batch job to run the data analysis and import the processed data back to transactional database.