DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • MongoDB to Couchbase for Developers, Part 1: Architecture
  • Extracting Data From Very Large XML Files With X-definition
  • MongoDB to Couchbase: An Introduction to Developers and Experts
  • Manage Hierarchical Data in MongoDB With Spring

Trending

  • Your Ultimate Website QA Checklist
  • How to Write for DZone Publications: Trend Reports and Refcards
  • How to Format Articles for DZone
  • Beyond Microservices: The Emerging Post-Monolith Architecture for 2025
  1. DZone
  2. Data Engineering
  3. Databases
  4. Product Catalog with MongoDB, Part 1: Schema Design

Product Catalog with MongoDB, Part 1: Schema Design

By 
Antoine Girbal user avatar
Antoine Girbal
·
Sep. 25, 14 · Interview
Likes (8)
Comment
Save
Tweet
Share
64.6K Views

Join the DZone community and get the full member experience.

Join For Free

This post is part of the Product Catalog MongoDB Series, in which we will cover many aspects of building a Product Catalog with MongoDB. This approach has been tested with a varied product catalog of 130 million items running on a single server (EC2 i2.2xlarge).

MongoDB seems to be the perfect fit to implement a product catalog since products maps so well to documents. But as we shall see it is not as easy as it seems! The data is fairly complex with many relationships involved. Also almost every other system will want to make use of the catalog instead of making its own copy, so typically a low latency, scalable and geo distributed catalog service is the ideal solution.

A product has at least the following information:

  • Item: the overall product info (e.g. Levi’s 501)
  • Variant: a specific variant of an item (e.g. in black size 6) which typically has a specific SKU / UPC
  • Price: price information may vary based on the store, the variant, etc
  • Hierarchy: the item taxonomy
  • Facet: facets to search products by
  • Vendors: a given sku may be available through different vendors if the site is a marketplace

A classic pitfall is to try to fit everything into a single document. As a result you end up with something very complex with many nested lists, which makes it difficult to navigate and index. Additionally APIs find themselves sending back massive documents even if only partial info is needed. In certain cases, we’ve seen items with 1000s of variant (e.g. Automotive part) which go beyond 16MB of pure JSON (in which case compression becomes mandatory)! Instead here we are going to model the data in a way that is natural, maps well to the API, at the sweet spot between normalization and denorm.

Item Model

The item collection has document representing the high level data of a product. Here is a sample item document for a shoe:

{
    "_id": "054VA72303012P",
    "desc": [
        {
            "lang": "en",
            "val": "Give your dressy look a lift with these women's Kate high-heel shoes by Metaphor. These playful peep-toe pumps feature satin-wrapped stiletto heels and chiffon pompoms at the toes. Rhinestones on each of the silvertone buckles add just a touch of sparkle to these shoes for a flirty footwear look that's made for your next night out."
        }
    ],
    "name": "Women's Kate Ivory Peep-Toe Stiletto Heel",
    "lname": "women's kate ivory peep-toe stiletto heel",
    "category": "/84700/80009/1282094266/1200003270",
    "brand": {
        "id": "2483510",
        "img": {
            "src": "http://i.sears.com/s/i/bl/image/spin_prod_metadata_168138610"
        },
        "name": "Metaphor"
    },
    "assets": {
        "imgs": [
            {
                "img": {
                    "height": "1900",
                    "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_967112812",
                    "width": "1900"
                }
            },
            {
                "img": {
                    "height": "1900",
                    "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_945877912",
                    "width": "1900"
                }
            },
            {
                "img": {
                    "height": "1900",
                    "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_945878012",
                    "width": "1900"
                }
            }
        ]
    },
    "shipping": {
        "dimensions": {
            "height": "13.0",
            "length": "1.8",
            "width": "26.8"
        },
        "weight": "1.75"
    },
    "specs": [
        {
            "name": "Heel Height (in.)",
            "val": "3.75"
        }
    ],
    "attrs": [
        {
            "name": "Heel Height",
            "value": "High (2-1/2 to 4 in.)"
        },
        {
            "name": "Upper Material",
            "value": "Synthetic"
        },
        {
            "name": "Toe",
            "value": "Open toe"
        }
        {
            "name": "Brand",
            "value": "Metaphor"
        }
    ],
    "variants": {
        "cnt": 9,
        "attrs": [
            {
                "dispType": "COMBOBOX",
                "name": "Width",
            },
            {
                "dispType": "DROPDOWN",
                "name": "Color",
            },
            {
                "dispType": "DROPDOWN",
                "name": "Shoe Size",
            }
        ]
    },
    "lastUpdated": 1400877254787
}

Fields of interest:

  • _id: the product id
  • lastUpdated: useful timestamp to see recently updated
  • category: the category path made up of hierarchy nodes
  • name: the product name
  • lname: a lower-case version of the name. This can be useful for doing case-insensitive matching with an index
  • brand: the brand
  • desc: list of descriptions (website, retail box, etc)
  • assets: list of assets (images, etc)
  • attrs: list of attributes as name-value pairs. Will be used to implement facetting. Note that the brand is also included as one attribute.
  • variants: some information on variants, but not the variants themselves

Common queries (indexed):

  • find by id:
    { _id: "the product id" }
  • find by category prefix:
    { product.cat: { $regex: "^category prefix" } }
  • find by case-insensitive name prefix:
    { product.lname: { $regex: "^name prefix" } }

Variant Model

The Variant documents represent specific variations of a product. Certain products only exist in a unique variant (e.g. XBox, no options to pick) whereas other products may have thousands. Here is a sample variant document for the same shoe:

{
    "_id": "05458452563",
    "name": "Width:Medium,Color:Ivory,Shoe Size:6.5",
    "lname": "width:medium,color:ivory,shoe size:6.5",
    "itemId": "054VA72303012P",
    "altIds": {
        "upc": "632576103580"
    },
    "assets": {
        "imgs": [
            {
                "width": "1900",
                "height": "1900",
                "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_945348512"
            },
            {
                "width": "1900",
                "height": "1900",
                "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_945348612"
            }
        ]
    },
    "attrs": [
        {
            "name": "Width",
            "value": "Medium"
        },
        {
            "name": "Color",
            "family": "White",
            "value": "Ivory"
        },
        {
            "name": "6.5",
            "value": "6.5"
        }
    ]
}

Fields of interest:

  • _id: the SKU
  • itemId: the parent item id.
  • attrs: a list of attributes specific to the variant. Note that some of the attributes may have both a specific value (e.g. ivory) and a family value (e.g. white).
  • assets: assets specific to the variant (e.g. image with a specific color).

Common query parameters (indexed):

  • find by SKU:
    { _id: "the sku" }
  • find by item Id:
    { itemId: "item id" }

Hierarchy Model

The Hierarchy document represents a node of the hierarchical tree representing product taxonomy. The top level nodes represent departments, while further nodes represent specific categories.

{
    "_id": "1200003270",
    "name": "Women's Heels & Pumps",
    "count": 223,
    "parents": [
        "1282094266"
    ],
    "facets": [
        "Heel Height",
        "Toe",
        "Upper Material",
        "Width",
        "Shoe Size",
        "Color"
    ]
}

Fields of interest:

  • _id: the category id
  • name: the category name
  • count: the number of items in this category. It can be a useful statistic to display.
  • parents: list of parent nodes. Simpler implementations could make use of a single value.
  • facets: list of facets that exist for this category (e.g. color, size). This info will be used when displaying the facets available in the searching page.

Common queries (indexed):

  • find by parent id:
    { p: "parent id" }
  • find top level departments:
    { p: null }

Facet Model

The facet document represent a name/value pair representing a product attribute.

{
    "_id": "accessory type=hosiery",
    "name": "Accessory Type",
    "value": "Hosiery",
    "count": 14
}

Fields of interest:

  • _id: the id, which is a concatenation of lower-cased facet name and value.
  • name: the facet name with original casing, e.g. “Accessory Type”
  • value: the facet value with original casing, e.g. “Hosiery”. Important note: here the value should be the family value is possible, e.g. “White” rather than “Ivory”. Those facets will be used for searching items, and the family value is better for that purpose.
  • count: the number of items that have this facet. This count will be important in defining the order of attributes in a query when doing faceted search.

Common query parameters (indexed):

  • find a specific facet:
    { _id: "name_value" }
  • find facets for a name:
    { _id: { $regex: "^name_" } }

Price Model

The Price document obviously represents the price of an item, but there is quite a bit more to it. We want to be able to vary the price per variant (e.g. gold color is more expensive) or per store (e.g. online store is cheaper). While we will not touch on the store model in this post, let’s just imagine we have a few thousand stores which are grouped into a dozen store groups (e.g. online, west coast, etc). If we implement this naively, we would end up with 1000 stores x 200m variants = 2 billion price documents! Instead let’s be a bit smarter and make good use of MongoDB’s querying capability, by allowing to price products at different levels as needed, thus keeping the number of documents in the millions. A document looks like:

{
    "_id": "SPM8824542513_1234",
        "price": "69.99",
        "sale": {
            "salePrice": "42.72",
            "saleEndDate": "2050-12-31 23:59:59"
        },
    "lastUpdated": 1374647707394
}

Fields of interest:

  • _id: the id is built in a specific way. It is the concatenation of the item information and store information. The item information is either the item id or the variant id (SKU). The store information is either the store group id or the store id.
  • price: the regular price
  • sale: sales information, optional

Common queries (indexed):

  • find all prices by item id:
    { _id: { "$regex": "^itemId_" } }
  • find all prices by SKU (price could be at item level):
    { _id: { "$in: [ { "$regex": "^itemId_" }, { "$regex": "^sku_" } ] }
  • find price for a given SKU and store (4 combinations are possible):
    { _id: { "$in: [ "itemId_storeGroupId", "itemId_storeId", "sku_storeGroupId", "sku_storeId" ] }
  • find items on sale, starting with ones ending soonest:
    { "sale.saleEndDate": { $ne: null } } with sort by { "sale.saleEndDate": 1 } (sparse index on "sale.saleEndDate")

Summary Model

The previous documents are now properly modeled, easy to maintain, and can efficiently power an API to serve product details pages. The last, and most difficult issue left to tackle is how to do faceted searching and other kind of browsing. We could leave it off to a full text search system or similar software, but it’s actually doable with MongoDB. For this purpose we face some tough challenges:

  • whatever the search is, need a response within milliseconds returning hundreds of items
  • the search can be a combination of many facets: category, brand, etc
  • facets can be both at the item and variant levels: color, size, etc. If matching a specific variant, we should display that specific image (e.g. red shoes).
  • hundreds of variants of the same item could match, in which case only the parent item should be returned as result
  • efficient sorting on several attributes: price, popularity
  • pagination feature which requires deterministic ordering

For this purpose we create a separate collection called Summary in which each document represent the summary information of an item and all its variants. The data is stripped out to the minimum needed to power a browse & search feature. Such a document looks like:

{
    "_id": "3ZZVA46759401P",
    "name": "Women's Chic - Black Velvet Suede",
    "lname": "women's chic - black velvet suede",
    "dep": "84700",
    "cat": "/84700/80009/1282094266/1200003270",
    "desc": [
        {
            "lang": "en",
            "val": "This pointy toe slingback features a high quality upper and a classy, simple silhouette. This heel has a classic shape, an adjustable ankle strap for a vintage feel and a secure fit. The Chic is the perfect combination between dressy and professional."
        }
    ],
    "img": [
        {
            "height": "330",
            "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_591726201",
            "title": "spin_prod_591726201",
            "width": "450"
        }
    ],
    "attrs": [
        "heel height=mid (1-3/4 to 2-1/4 in.)",
        "brand=metaphor"
    ],
    "sattrs": [
        "upper material=synthetic",
        "toe=open toe"
    ],
    "vars": [
        {
            "id": "05497884001",
            "img": [
                {
                    "height": "400",
                    "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_591726301",
                    "title": "spin_prod_591726301",
                    "width": "450"
                }
            ],
            "attrs": [
                "width=medium",
                "color=black",
                "shoe size=6"
            ]
        },
        {
            "id": "05497884002",
            "img": [
                {
                    "height": "400",
                    "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_591726301",
                    "title": "spin_prod_591726301",
                    "width": "450"
                }
            ],
            "attrs": [
                "width=medium",
                "color=black",
                "shoe size=6.5"
            ]
        },
        {
            "id": "05497884004",
            "img": [
                {
                    "height": "400",
                    "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_591726301",
                    "title": "spin_prod_591726301",
                    "width": "450"
                }
            ],
            "attrs": [
                "width=medium",
                "color=black",
                "shoe size=7.5"
            ]
        }
    ]
}

Fields of interest:

  • _id: the item id
  • name: the item name
  • lname: the item name, lower-case
  • img: list of images, ideally just the thumbnail
  • dep: the department (top level of category). Needs to be separate for proper indexing
  • cat: the category path
  • attrs: the item attributes, to be indexed
  • sattrs: the item secondary attributes, not to be indexed
  • vars: list of variants
  • vars.id: the variant sku
  • vars.attrs: the variant attributes, to be indexed
  • vars.sattrs: the secondary variant attributes, not to be indexed

Indices:

  • department + attr + category + _id
  • department + vars.attrs + category + _id
  • department + category + _id
  • department + price + _id
  • department + rating + _id

Common queries (indexed):

  • find by department:
    { dep: "department" }
  • find by category prefix:
    { dep: "department", cat: { $regex: "^category prefix"} }
  • find by item attribute:
    { dep: "department", attrs: "name=value" }
  • find by several item attributes:
    { dep: "department", attrs: { $all: [ "name=value", ... ] }
  • find by variant attribute:
    { dep: "department", vars.attrs: "name=value" }
  • find by several variant attributes:
    { dep: "department", vars.attrs: { $all: [ "name=value", ... ] }
  • find by item attributes, variant attributes, category:
    { dep: "department", attrs: { $all: [ "name=value", ... ], vars.attrs: { $all: [ "name=value", ... ], cat: { $regex: "^category prefix"} }

A few interesting notes on indexing / querying:

  • each index starts with the department, which is a convenient way to subdivide our product catalog. It is an acceptable restriction to force the user to pick a department before displaying any kind of search facet (unless we’re displaying a pre-computed list like “most popular”). Hence having the department there will ensure that there is always a large amount of filtering done for cheap by the index :)
  • each index ends with “_id” which is useful for pagination. It will give sorting on _id for free for some common queries. It’s always better to avoid resorting the skip/limit for pagination, which is only fine for a low number of pages.
  • for queries using “$all“ the most restrictive attribute should be specified first (e.g. “color=red”). This information can be inferred from the “facet“ collection described earlier. This piece is critical to make facetted searches efficient and keep them in the few milliseconds.

Conclusion

In conclusion, we’ve seen here how to model and index a product catalog in MongoDB which will allow high performance, flexibility, and easy maintenance. More details on this topic can be see in the MongoDB World video Product Catalog

Stay tuned for more information on our Product Catalog MongoDB solution, including:

  • How to implement full text search within MongoDB or with a connector to an external FTS system
  • More statistics and benchmarking on the faceted search capability
  • Operational considerations: geo distributed for low latency queries, stringent read latency SLA, and spiky catalog updates

Also check out another interesting topic: how to log all user activities around the site and run useful analytics on them in the MongoDB World video covering the Insight Component

Database MongoDB Attribute (computing) Document Schema Design

Published at DZone with permission of Antoine Girbal, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • MongoDB to Couchbase for Developers, Part 1: Architecture
  • Extracting Data From Very Large XML Files With X-definition
  • MongoDB to Couchbase: An Introduction to Developers and Experts
  • Manage Hierarchical Data in MongoDB With Spring

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!