Build a Scalable E-commerce Platform: System Design Overview
This article explores the architecture of a distributed and scalable e-commerce platform with multiple services and components, hosted on a cloud platform like AWS.
Join the DZone community and get the full member experience.
Join For FreeTopic: High-level design of an e-commerce platform like Amazon, Walmart, etc.
Objective: Big picture, challenges, options to overcome those
Brief: In this article, we’ll explore the architecture of an e-commerce platform. The architecture supports critical features like product search, user authentication, order management, inventory updates, payment processing, etc. It does not support the delivery process. These components work in harmony to deliver a seamless shopping experience while maintaining reliability, scalability, and performance.
What Is an eCommerce System, and What Are Its Benefits?
An e-commerce system is a SaaS that enables businesses to sell mostly products and, in some cases, services online, providing a digital storefront for customers to browse, purchase, and manage orders. It integrates various software components to facilitate online transactions, manage inventory, process payments, handle logistics, etc.
Below are the benefits of e-commerce platforms:
- Global reach: E-commerce systems allow businesses to reach a global audience, breaking geographical barriers and expanding their customer base.
- Convenience: Online shopping provides customers with a convenient way to browse and purchase products at any time.
- Scalability: A scalable e-commerce system can handle a large volume of transactions to meet growing demand.
- Cost-effective: Online stores don’t need brick-and-mortar stores, thus saving operational costs. Some companies do have warehouses, but the operation costs for them are still lower, especially because warehouses can have a lot of automation.
- Recommendations: E-commerce systems enable personalized recommendations, improving the shopping experience.
- Automation: Automated processes streamline order management, inventory tracking, shipping, and returns.
Requirements and Goals of the System
Functional Requirements (FR)
- Authorization
- Search
- Add to Cart
- Order/Purchase (Payment)
- Notification (Stage of Order)
- User Service
- Inventory Management
- Order History
- Stretch Goal: Recommendation System
Non-Functional Requirements (NFR)
- Availability (for the Platform)
- Consistency (for the Inventory)
- Reliability (for data)
- Stretch Goals: Monitoring, Observability
Out of Scope
- Delivery System
- API Design
- Data Models
- Shipping, Returns, Customer Support, and other post-purchase logistics
Design Considerations
- The system overall will be read-heavy, as people browse more than buy, so search and recommendations must be low latency.
- Data should be 100% reliable. If a user creates an account or a seller creates a product listing, the system will guarantee that it will never be lost.
- Sellers can create as many listings, and customers can order as many items as they want. The system must support efficient data management.
- Tasks like notifications are allowed to have high latency but have to be reliable.
- Order storage will be divided into two parts: hot storage (< 1 year old. Use AWS RDS) and cold storage (> 1 year old. Use AWS S3 with Athena)
Back-of-the-Envelope Calculations
Daily Active Users (DAU)
The platform serves around 10M DAUs. Assume 10% are active at peak times.
Search Requests
- Assume 1-2 searches per session. If 10% of DAUs are active, that's 1M searches/hour.
- Each search query might return several items, requiring low latency searches and high read throughput.
Item Service
With over 10M products, each user might view 5-10 items per session, resulting in 50-100M TPD to the Item Service.
Cart and Checkout
- If 1% of searches lead to an add-to-cart action, that's 0.1M cart TPD.
- Assume 10-20% of those turn into purchases, so 0.01-0.02M orders TPD.
Recommendations
Each search triggers a recommendation service request, so the service needs to handle 1M recommendations/hour.
Inventory Management
The inventory service needs to maintain real-time stock information. Assuming frequent product updates, that’s 10M TPD.
Data Storage
- User DB: Storing 10M users with each user using ~1KB of space and accounting for redundancy would need around 100 GB.
- Item DB: For 10M products, for storing product details, each item could take around ~1KB, which makes it 100 GB. Storing images and videos would be extra, which can be done in a file storage system along with CDN.
- Order DB: Assuming 1000 orders per user over a lifetime, order storage could take around 100TB.
Latency and Throughput
Services should maintain low latency (<100ms) with high throughput, ensuring scalability and responsiveness across the platform.
High-Level Design
Use the FRs and the NFR as the guiding light to build this system.
Build a microservice for each feature. This is not a hard and fast rule, but by and large, it holds. Use your best judgment.
For orders, have hot storage and cold storage. Hot storage will store orders that are less than a year old, which, upon reaching the one-year mark, will be migrated to cold storage. This will help keep the hot storage small and fast.
This will work because old orders are rarely queried for, and if they are, then they can be fetched from cold storage, which won’t be as fast as hot storage, and that’s fine because they won’t be expected to be fetched very quickly.
As per the data size estimation done above, these could be the choice of technologies to be used:
Storage
- OpenSearch cluster for search functionality
- S3 with Athena for cold storage of orders
- RDBMS for everything else
Queuing Infrastructure: Kafka
- To be used for tasks that require fault tolerance and don’t require low turnaround time (TAT)
- Example tasks: Order status notifications (ordered, shipped, delivered, etc.)
Distributed Caching Infrastructure
- Searching: Search queries generate high concurrent reads, which can cause bottlenecks. A distributed cache allows scalability and redundancy across multiple nodes. This type of cache avoids overloading a single point (centralized cache).
- Why not centralized? A centralized cache would limit scalability and could become a bottleneck under high read demand, impacting system performance.
- Recommendations: Personalized data and recommendations need fast, scalable access. A distributed cache can handle many concurrent users with large datasets.
- Why not centralized? Centralized caching can’t efficiently scale to the level needed for serving personalized data to a large number of users, resulting in slower response times.
- Inventory management: Inventory data is accessed frequently and needs to be synchronized across multiple nodes due to its dynamic nature. Distributed caching ensures updates are consistent and distributed for fast access.
- Why not centralized? Inconsistent inventory data across a centralized cache could lead to data discrepancies, especially during peak times or high loads.
Centralized Cache:Memcached
- User service: User profile data might not change frequently but is essential for personalization. A centralized cache works well because the data is relatively static and needs to be accessed reliably.
- Why not distributed? A distributed cache is overkill for relatively stable, low-frequency access data. A centralized cache ensures simplicity with lower overhead and synchronization challenges.
- Carting: Cart data is user-specific and doesn’t require high scalability. Centralized caching is effective for handling real-time access with low concurrent modification, as only a few updates happen per user session.
- Why not distributed? There’s no significant need for scalability across multiple nodes for cart data, and centralized caching provides simplicity and fast access without additional management.
- Auth: Authentication-related data (like tokens or session details) benefits from consistency and quick access. Centralized caching ensures that user sessions can be validated quickly without network delays across multiple nodes.
- Why not distributed? Distributed caching could introduce synchronization issues for session management and token expiration across multiple nodes, complicating consistency. A centralized cache avoids these problems.
Race Condition
What happens when there are multiple customers trying to buy the same product at the same time? The same item can’t be sold to multiple people if the inventory is limited. So what’s the solution?
There are multiple ways of handling this:
DB Locking
The system (Items Management Service) locks inventory records when a customer starts a transaction to prevent others from accessing it until the process completes. This happens automatically in an RDBMS using row-level locking when a transaction happens.
This is not a great method business-wise as there may be more items available than the locking user requested, and so the other users waiting for the lock to release implies a poor customer experience. Amazon found that every 100ms of latency costs them 1% in sales [source].
Optimistic Concurrency Control
The system checks inventory availability just before confirming the purchase. While OCC minimizes locking overhead, it is not foolproof. It works on the assumption that most DB transactions don’t conflict, allowing multiple transactions to proceed without locking. However, OCC can fail in high-contention scenarios, where many transactions update the same data simultaneously. In such cases, conflicts might occur, leading to failed transactions that must be retried.
Queueing Mechanism
Ensure FCFS (First Come, First Serve) model and is also fault tolerant, but is too slow for a high transaction eCommerce platform.
Stock Reservation
Temporarily reserve items for customers to complete purchases within a time window. How this would work:
- Have an Item Reservations (IR) table which will have a quantity column, reserved_until column, status column (expired, reserved, purchased) among others.
- Before purchasing, check if the requested quantity is available in stock. If enough stock is available, reserve the items by inserting a record into the IR table; else, decline the purchase.
- After reserving, update the stock in the items table by reducing the item count.
- To remove/expire the reservations, set up a scheduled job (say using a cron job) to periodically check for expired reservations and release stock (by incrementing back the item count).
High-Level System Design Diagram
Component Design
1. API Gateway and Load Balancer
Acts as a single entry point for all user interactions with the platform, routing incoming requests to the appropriate microservices. It also handles auth, rate limiting, logging, security, caching, and content encoding.
The Load Balancer distributes traffic across nodes to ensure high availability, reliability, and performance by preventing a single node from getting too hot. AWS API Gateway is a good choice for this.
2. Search Service
Enables users to search for products by querying the search cluster and also search functionality like text-based queries, filters, and ranking results based on relevance. It also ensures quick retrieval of search results by indexing the product catalog stored in the Item Service and Details DB. This service should support giving results even if the user makes a typing mistake, which can be achieved through Fuzzy Search. AWS OpenSearch Service is a good choice for this.
3. Detail Page Service
Product selection by the user from the search results will trigger a call to the Detail Page Service, which retrieves product details from the Details DB, like product specifications, price, reviews, and availability, ensuring that the user can view all necessary information to make a purchase decision. This service may further call different hydration sources for reviews, metadata, images, videos, etc.
4. Location Service
Manages the geo aspects of the platform. It tracks user locations (as per user permission and local regulations) to personalize shipping options, calculate delivery times, and show promotions. It stores location data in the Location DB and integrates with the Search Service and Purchase Service for geographically tailored experiences. Data storage must follow compliance and regulatory requirements.
5. Auth Service
Handles authentication and authorization processes, managing user logins, sessions, and registration. It ensures only authenticated users can interact with the system by integrating with external authentication providers or using local credentials.
6. User Service
Manages user profiles, including personal information, addresses, and preferences, which are stored in the User DB. This service is integral to customizing user experiences, enabling features like personalized recommendations, saved carts, and wish lists.
7. Cart Service
Enables users to add, update, and remove items from their carts, persisting this data in the Cart DB. It ensures users can seamlessly transition between sessions without losing cart information and calls the Purchase Service during checkout.
8. Purchase Service
Process orders, integrating with Payment Gateway for secure transactions. It handles the checkout process and calculates total costs, including taxes and shipping. After successful payment, it passes the order details to the Order Service for further processing.
9. Item Service
Manages product catalog and stores product details, availability, and metadata in the Items DB. This service integrates with Search Service to provide indexed data for search queries and with Inventory Service to ensure real-time stock updates.
10. Inventory Service
Manages product availability across various warehouses. It tracks stock levels and updates the system when items are sold or restocked. The service ensures accurate stock counts and integrates with the Item Service to display real-time product availability to users.
11. Item Management Service
Handles the management of product listings, allowing sellers or warehouse workers to add, update, or remove products from the catalog. It integrates with Inventory Service to manage stock levels and Item Service to update product metadata.
12. Order Service
Tracks orders from the time they are placed until they are fulfilled. It stores order information in the Orders DB and integrates with the Order Status Service to update users on their order status.
13. Order Status Service
Provides real-time updates on the status of user orders. It integrates with Order Service and tracks orders as they move through various stages, from processing to delivery. This information is stored in the Status DB, which keeps users informed via the Notification Service.
14. Migration Service
Responsible for migrating orders older than a year (configurable) to a cold storage DB. This is done to keep the hot storage DB small and fast. Since old orders are seldom accessed by users, this decision works well. AWS S3 with Athena is a good choice for this.
15. Notification Service
Sends updates to users about user-configured or default events, e.g., order confirmations, shipping notifications, and promotions. It stores notifications in the Notification DB and integrates with the Order Status Service to keep users informed. This service uses a queuing mechanism like Kafka.
This is to ensure that notifications remain fault tolerant, and since they’re not time-critical, this decision works well. If there are more critical notifications, then they’d have to use a real-time notification system like Web Sockets, Push Notifications, SSE (server sent events), etc.
16. Recommendation Service
Provides personalized product suggestions to users, leveraging user data from the User Service and order history from Order Service. By analyzing user behavior and preferences, this service enhances user engagement by offering relevant products.
17. ES Cluster (Elasticsearch)
Powers the search capabilities by indexing data from the Item Service and Details DB. It stores data in a format optimized for fast retrieval and supports the full-text search functionality of the platform. It also supports Fuzzy Search so that queries with typing errors also give correct results. AWS Open Search is a good choice for this.
18. Payment Gateway
Facilitates secure payment processing by connecting the platform to external payment providers like Visa, MasterCard, etc. It ensures sensitive payment details are handled securely and that funds are transferred to the merchant upon a successful purchase.
19. Queuing Infra (Kafka)
Facilitates async communication between services. It ensures that updates, such as inventory changes or order status, are propagated efficiently across different components of the system without requiring direct coupling between services. It also facilitates non-time-critical operations like notifications.
Conclusion
This architecture ensures scalability, reliability, and seamless user experiences while maintaining a loosely coupled, service-oriented approach.
Opinions expressed by DZone contributors are their own.
Comments