Product Catalog with MongoDB, Part 1: Schema Design
Join the DZone community and get the full member experience.
Join For FreeThis post is part of the Product Catalog MongoDB Series, in which we will cover many aspects of building a Product Catalog with MongoDB. This approach has been tested with a varied product catalog of 130 million items running on a single server (EC2 i2.2xlarge).
MongoDB seems to be the perfect fit to implement a product catalog since products maps so well to documents. But as we shall see it is not as easy as it seems! The data is fairly complex with many relationships involved. Also almost every other system will want to make use of the catalog instead of making its own copy, so typically a low latency, scalable and geo distributed catalog service is the ideal solution.
A product has at least the following information:
- Item: the overall product info (e.g. Levi’s 501)
- Variant: a specific variant of an item (e.g. in black size 6) which typically has a specific SKU / UPC
- Price: price information may vary based on the store, the variant, etc
- Hierarchy: the item taxonomy
- Facet: facets to search products by
- Vendors: a given sku may be available through different vendors if the site is a marketplace
A classic pitfall is to try to fit everything into a single document. As a result you end up with something very complex with many nested lists, which makes it difficult to navigate and index. Additionally APIs find themselves sending back massive documents even if only partial info is needed. In certain cases, we’ve seen items with 1000s of variant (e.g. Automotive part) which go beyond 16MB of pure JSON (in which case compression becomes mandatory)! Instead here we are going to model the data in a way that is natural, maps well to the API, at the sweet spot between normalization and denorm.
Item Model
The item collection has document representing the high level data of a product. Here is a sample item document for a shoe:
{ "_id": "054VA72303012P", "desc": [ { "lang": "en", "val": "Give your dressy look a lift with these women's Kate high-heel shoes by Metaphor. These playful peep-toe pumps feature satin-wrapped stiletto heels and chiffon pompoms at the toes. Rhinestones on each of the silvertone buckles add just a touch of sparkle to these shoes for a flirty footwear look that's made for your next night out." } ], "name": "Women's Kate Ivory Peep-Toe Stiletto Heel", "lname": "women's kate ivory peep-toe stiletto heel", "category": "/84700/80009/1282094266/1200003270", "brand": { "id": "2483510", "img": { "src": "http://i.sears.com/s/i/bl/image/spin_prod_metadata_168138610" }, "name": "Metaphor" }, "assets": { "imgs": [ { "img": { "height": "1900", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_967112812", "width": "1900" } }, { "img": { "height": "1900", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_945877912", "width": "1900" } }, { "img": { "height": "1900", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_945878012", "width": "1900" } } ] }, "shipping": { "dimensions": { "height": "13.0", "length": "1.8", "width": "26.8" }, "weight": "1.75" }, "specs": [ { "name": "Heel Height (in.)", "val": "3.75" } ], "attrs": [ { "name": "Heel Height", "value": "High (2-1/2 to 4 in.)" }, { "name": "Upper Material", "value": "Synthetic" }, { "name": "Toe", "value": "Open toe" } { "name": "Brand", "value": "Metaphor" } ], "variants": { "cnt": 9, "attrs": [ { "dispType": "COMBOBOX", "name": "Width", }, { "dispType": "DROPDOWN", "name": "Color", }, { "dispType": "DROPDOWN", "name": "Shoe Size", } ] }, "lastUpdated": 1400877254787 }
Fields of interest:
- _id: the product id
- lastUpdated: useful timestamp to see recently updated
- category: the category path made up of hierarchy nodes
- name: the product name
- lname: a lower-case version of the name. This can be useful for doing case-insensitive matching with an index
- brand: the brand
- desc: list of descriptions (website, retail box, etc)
- assets: list of assets (images, etc)
- attrs: list of attributes as name-value pairs. Will be used to implement facetting. Note that the brand is also included as one attribute.
- variants: some information on variants, but not the variants themselves
Common queries (indexed):
- find by id:
{ _id: "the product id" }
- find by category prefix:
{ product.cat: { $regex: "^category prefix" } }
- find by case-insensitive name prefix:
{ product.lname: { $regex: "^name prefix" } }
Variant Model
The Variant documents represent specific variations of a product. Certain products only exist in a unique variant (e.g. XBox, no options to pick) whereas other products may have thousands. Here is a sample variant document for the same shoe:
{ "_id": "05458452563", "name": "Width:Medium,Color:Ivory,Shoe Size:6.5", "lname": "width:medium,color:ivory,shoe size:6.5", "itemId": "054VA72303012P", "altIds": { "upc": "632576103580" }, "assets": { "imgs": [ { "width": "1900", "height": "1900", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_945348512" }, { "width": "1900", "height": "1900", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_945348612" } ] }, "attrs": [ { "name": "Width", "value": "Medium" }, { "name": "Color", "family": "White", "value": "Ivory" }, { "name": "6.5", "value": "6.5" } ] }
Fields of interest:
- _id: the SKU
- itemId: the parent item id.
- attrs: a list of attributes specific to the variant. Note that some of the attributes may have both a specific value (e.g. ivory) and a family value (e.g. white).
- assets: assets specific to the variant (e.g. image with a specific color).
Common query parameters (indexed):
- find by SKU:
{ _id: "the sku" }
- find by item Id:
{ itemId: "item id" }
Hierarchy Model
The Hierarchy document represents a node of the hierarchical tree representing product taxonomy. The top level nodes represent departments, while further nodes represent specific categories.
{ "_id": "1200003270", "name": "Women's Heels & Pumps", "count": 223, "parents": [ "1282094266" ], "facets": [ "Heel Height", "Toe", "Upper Material", "Width", "Shoe Size", "Color" ] }
Fields of interest:
- _id: the category id
- name: the category name
- count: the number of items in this category. It can be a useful statistic to display.
- parents: list of parent nodes. Simpler implementations could make use of a single value.
- facets: list of facets that exist for this category (e.g. color, size). This info will be used when displaying the facets available in the searching page.
Common queries (indexed):
- find by parent id:
{ p: "parent id" }
- find top level departments:
{ p: null }
Facet Model
The facet document represent a name/value pair representing a product attribute.
{ "_id": "accessory type=hosiery", "name": "Accessory Type", "value": "Hosiery", "count": 14 }
Fields of interest:
- _id: the id, which is a concatenation of lower-cased facet name and value.
- name: the facet name with original casing, e.g. “Accessory Type”
- value: the facet value with original casing, e.g. “Hosiery”. Important note: here the value should be the family value is possible, e.g. “White” rather than “Ivory”. Those facets will be used for searching items, and the family value is better for that purpose.
- count: the number of items that have this facet. This count will be important in defining the order of attributes in a query when doing faceted search.
Common query parameters (indexed):
- find a specific facet:
{ _id: "name_value" }
- find facets for a name:
{ _id: { $regex: "^name_" } }
Price Model
The Price document obviously represents the price of an item, but there is quite a bit more to it. We want to be able to vary the price per variant (e.g. gold color is more expensive) or per store (e.g. online store is cheaper). While we will not touch on the store model in this post, let’s just imagine we have a few thousand stores which are grouped into a dozen store groups (e.g. online, west coast, etc). If we implement this naively, we would end up with 1000 stores x 200m variants = 2 billion price documents! Instead let’s be a bit smarter and make good use of MongoDB’s querying capability, by allowing to price products at different levels as needed, thus keeping the number of documents in the millions. A document looks like:
{ "_id": "SPM8824542513_1234", "price": "69.99", "sale": { "salePrice": "42.72", "saleEndDate": "2050-12-31 23:59:59" }, "lastUpdated": 1374647707394 }
Fields of interest:
- _id: the id is built in a specific way. It is the concatenation of the item information and store information. The item information is either the item id or the variant id (SKU). The store information is either the store group id or the store id.
- price: the regular price
- sale: sales information, optional
Common queries (indexed):
- find all prices by item id:
{ _id: { "$regex": "^itemId_" } }
- find all prices by SKU (price could be at item level):
{ _id: { "$in: [ { "$regex": "^itemId_" }, { "$regex": "^sku_" } ] }
- find price for a given SKU and store (4 combinations are possible):
{ _id: { "$in: [ "itemId_storeGroupId", "itemId_storeId", "sku_storeGroupId", "sku_storeId" ] }
- find items on sale, starting with ones ending soonest:
{ "sale.saleEndDate": { $ne: null } } with sort by { "sale.saleEndDate": 1 } (sparse index on "sale.saleEndDate")
Summary Model
The previous documents are now properly modeled, easy to maintain, and can efficiently power an API to serve product details pages. The last, and most difficult issue left to tackle is how to do faceted searching and other kind of browsing. We could leave it off to a full text search system or similar software, but it’s actually doable with MongoDB. For this purpose we face some tough challenges:
- whatever the search is, need a response within milliseconds returning hundreds of items
- the search can be a combination of many facets: category, brand, etc
- facets can be both at the item and variant levels: color, size, etc. If matching a specific variant, we should display that specific image (e.g. red shoes).
- hundreds of variants of the same item could match, in which case only the parent item should be returned as result
- efficient sorting on several attributes: price, popularity
- pagination feature which requires deterministic ordering
For this purpose we create a separate collection called Summary in which each document represent the summary information of an item and all its variants. The data is stripped out to the minimum needed to power a browse & search feature. Such a document looks like:
{ "_id": "3ZZVA46759401P", "name": "Women's Chic - Black Velvet Suede", "lname": "women's chic - black velvet suede", "dep": "84700", "cat": "/84700/80009/1282094266/1200003270", "desc": [ { "lang": "en", "val": "This pointy toe slingback features a high quality upper and a classy, simple silhouette. This heel has a classic shape, an adjustable ankle strap for a vintage feel and a secure fit. The Chic is the perfect combination between dressy and professional." } ], "img": [ { "height": "330", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_591726201", "title": "spin_prod_591726201", "width": "450" } ], "attrs": [ "heel height=mid (1-3/4 to 2-1/4 in.)", "brand=metaphor" ], "sattrs": [ "upper material=synthetic", "toe=open toe" ], "vars": [ { "id": "05497884001", "img": [ { "height": "400", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_591726301", "title": "spin_prod_591726301", "width": "450" } ], "attrs": [ "width=medium", "color=black", "shoe size=6" ] }, { "id": "05497884002", "img": [ { "height": "400", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_591726301", "title": "spin_prod_591726301", "width": "450" } ], "attrs": [ "width=medium", "color=black", "shoe size=6.5" ] }, { "id": "05497884004", "img": [ { "height": "400", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_591726301", "title": "spin_prod_591726301", "width": "450" } ], "attrs": [ "width=medium", "color=black", "shoe size=7.5" ] } ] }
Fields of interest:
- _id: the item id
- name: the item name
- lname: the item name, lower-case
- img: list of images, ideally just the thumbnail
- dep: the department (top level of category). Needs to be separate for proper indexing
- cat: the category path
- attrs: the item attributes, to be indexed
- sattrs: the item secondary attributes, not to be indexed
- vars: list of variants
- vars.id: the variant sku
- vars.attrs: the variant attributes, to be indexed
- vars.sattrs: the secondary variant attributes, not to be indexed
Indices:
- department + attr + category + _id
- department + vars.attrs + category + _id
- department + category + _id
- department + price + _id
- department + rating + _id
Common queries (indexed):
- find by department:
{ dep: "department" }
- find by category prefix:
{ dep: "department", cat: { $regex: "^category prefix"} }
- find by item attribute:
{ dep: "department", attrs: "name=value" }
- find by several item attributes:
{ dep: "department", attrs: { $all: [ "name=value", ... ] }
- find by variant attribute:
{ dep: "department", vars.attrs: "name=value" }
- find by several variant attributes:
{ dep: "department", vars.attrs: { $all: [ "name=value", ... ] }
- find by item attributes, variant attributes, category:
{ dep: "department", attrs: { $all: [ "name=value", ... ], vars.attrs: { $all: [ "name=value", ... ], cat: { $regex: "^category prefix"} }
A few interesting notes on indexing / querying:
- each index starts with the department, which is a convenient way to subdivide our product catalog. It is an acceptable restriction to force the user to pick a department before displaying any kind of search facet (unless we’re displaying a pre-computed list like “most popular”). Hence having the department there will ensure that there is always a large amount of filtering done for cheap by the index :)
- each index ends with “_id” which is useful for pagination. It will give sorting on _id for free for some common queries. It’s always better to avoid resorting the skip/limit for pagination, which is only fine for a low number of pages.
- for queries using “$all“ the most restrictive attribute should be specified first (e.g. “color=red”). This information can be inferred from the “facet“ collection described earlier. This piece is critical to make facetted searches efficient and keep them in the few milliseconds.
Conclusion
In conclusion, we’ve seen here how to model and index a product catalog in MongoDB which will allow high performance, flexibility, and easy maintenance. More details on this topic can be see in the MongoDB World video Product Catalog
Stay tuned for more information on our Product Catalog MongoDB solution, including:
- How to implement full text search within MongoDB or with a connector to an external FTS system
- More statistics and benchmarking on the faceted search capability
- Operational considerations: geo distributed for low latency queries, stringent read latency SLA, and spiky catalog updates
Also check out another interesting topic: how to log all user activities around the site and run useful analytics on them in the MongoDB World video covering the Insight Component
Published at DZone with permission of Antoine Girbal, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments