Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Scaling a Sales Recommendation Engine With Apache Spark and MongoDB

DZone's Guide to

Scaling a Sales Recommendation Engine With Apache Spark and MongoDB

Last time, we covered building a sales recommendation engine with Apache Spark. To go one step further, I have replaced the file system layer with MongoDB.

· AI Zone ·
Free Resource

EdgeVerve’s Business Applications built on AI platform Infosys Nia™ enables your enterprise to manage specific business areas and make the move from a deterministic to cognitive approach.

In my last post on building a sales recommendation engine with Apache Spark, we built a standalone version of spark-mllib's ALS demonstration based using local CSV files.

ml-bigdata-starter/README.md at master · ERS-HCL/ml-bigdata-starter

To go one step further, I have replaced the file system layer with MongoDB. MongoDB provides a spark-mongo connector that wraps the standard Java/Scala connector with Spark's interpolatable data format/APIs.

Getting Started

Apart from the spark core APIs, you need the following dependency to connect to MongoDB server.

ml-bigdata-starter/README.md at master · ERS-HCL/ml-bigdata-starter

<dependency>
<groupId>org.mongodb.spark</groupId>
<artifactId>mongo-spark-connector_2.11</artifactId>
<version>2.2.1</version>
</dependency>

Preparing the Datasets

In the current scenario, i.e MongoDB, instead of creating files, we need same data in JSON format in collections.

Sales orders:

UserId UserName ProductId ProductName Rate Quantity Amount
1 User 1 1 Product 1 10 5 50
User 1 2 Product 2 20 10 200
1 User 1 3 Product 3 10 15 150
2 User 2 1 Product 1 10 5 50
2 User 2 2 Product 2 20 20 400
2 User 2  4 Product 4 10 15 150

Sales leads:

UserId UserName ProductId ProductName
1 User 1 4 Product 4
1 User 1 5 Product 5
2 User 2 3 Product 3
2 User 2 6 Product 6

We need to predict/recommend the most relevant product for both the users based on their past order history. Here, we can see that both User 1 and User 2 ordered Product 1 and Product 2; also, they ordered one item separately.

Now, we predict their rating for alternate products and one new product.

Implementation

Step 1

Our first step is making a database connection using MongoDB specific properties.

SparkConf conf = new SparkConf().//
setAppName("rnd").//
setMaster("local").//
set("spark.mongodb.input.uri", "mongodb://127.0.0.1:27017/sparkdb.myCollection").//
set("spark.mongodb.output.uri", "mongodb://127.0.0.1:27017/sparkdb.myCollection");

Step 2

Now you can read training model via JavaMongoRDD API and convert it to Rating format using JavaRDD API.

private static JavaMongoRDD < Document > getJavaMongoRDD(JavaSparkContext jsc, String collName) {

 Map < String, String > readOverrides = new HashMap < String, String > ();
 readOverrides.put("collection", collName);
 readOverrides.put("readPreference.name", "secondaryPreferred");
 ReadConfig readConfig = ReadConfig.create(jsc).withOptions(readOverrides);

 JavaMongoRDD < Document > mongoRDD = MongoSpark.load(jsc, readConfig);
 return mongoRDD;
}

JavaMongoRDD < Document > salesOrdersRDD = getJavaMongoRDD(jsc, "SalesOrders");


// Map file to Ratings(user,item,rating) tuples
JavaRDD < Rating > ratings = salesOrdersRDD.map(new Function < Document, Rating > () {
 public Rating call(Document d) {
  return new Rating(d.getInteger("userCode"), d.getInteger("productCode"), ((Number) d.get("amount")).doubleValue());
 }
});

Step 3

The next step is to train the matrix factorization model using the ALS algorithm.

ml-bigdata-starter/README.md at master · ERS-HCL/ml-bigdata-starter

MatrixFactorizationModel model = ALS.train(JavaRDD.toRDD(ratings), rank, numIterations); 

Step 4

Now, we load the sales lead file and convert it to tuple format.

ml-bigdata-starter/README.md at master · ERS-HCL/ml-bigdata-starter

JavaMongoRDD < Document > salesLeadsRDD = getJavaMongoRDD(jsc, "SalesLeads");

// Create user-item tuples from ratings
JavaRDD < Tuple2 < Object, Object >> userProducts = salesLeadsRDD.map(new Function < Document, Tuple2 < Object, Object >> () {
 public Tuple2 < Object, Object > call(Document d) {
  return new Tuple2 < Object, Object > (d.getInteger("userCode"), d.getInteger("productCode"));
 }
});

Step 5

Finally, we can predict the future rating using a simple API.

ml-bigdata-starter/README.md at master · ERS-HCL/ml-bigdata-starter

// Predict the ratings of the products not rated by user 
JavaRDD<Rating> recomondations = model.predict(userProducts.rdd()).toJavaRDD().distinct();

Step 6

Optionally, you can sort the output using a basic pipeline operation:

ml-bigdata-starter/README.md at master · ERS-HCL/ml-bigdata-starter

// Sort the recommendations by rating in descending order 
recomondations = recomondations.sortBy(new Function<Rating, Double>() {  
  @Override  
  public Double call(Rating v1) throws Exception {   
    return v1.rating();  
  }
 }, false, 1);

Step 7

Now, you can display your result using the basic JavaRDD API.

ml-bigdata-starter/README.md at master · ERS-HCL/ml-bigdata-starter

// Print the recommendations . 
recomondations.foreach(new VoidFunction<Rating>() {  
  @Override  
  public void call(Rating rating) throws Exception {   
    String str = "User : " + rating.user() + //   " Product : " + rating.product() + //   " Rating : " + rating.rating();   
    System.out.println(str);  
    } 
  });

Output

User : 2 Product : 3 Rating : 54.54927015541634
User : 1 Product : 4 Rating : 49.93948224984236

Conclusion

The above output recommends that User 2 would like to buy Product 3 and User 1 would go for Product 4.

This also recommends that there is no recommendation for new products, as they do not match any similarity criteria in past.

Adopting a digital strategy is just the beginning. For enterprise-wide digital transformation to truly take effect, you need an infrastructure that’s #BuiltOnAI. Click here to learn more.

Topics:
spark ,mllib ,mongodb ,recommendation engine ,ai ,tutorial

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}