Real-Time Big Data Processing with Spark and MemSQL
As more business decisions are based on data and insights, batch and offline reporting from data aren't enough. What is needed in order to available with minimal latency?
Join the DZone community and get the full member experience.Join For Free
I got an opportunity to work extensively with big data and analytics in Myntra, an e-commerce store based in India. Data-driven intelligence is one of the core values at Myntra, so crunching and processing data and reporting meaningful insights for the company is of utmost importance.
Every day, millions of users visit Myntra via the app or website, generating billions of clickstream events. It's very important for the data platform team to scale to such a huge number of incoming events, ingest them in real-time with minimal or no loss, and process the unstructured or semi-structured data to generate insights.
We use a varied set of technologies and in-house products to achieve the above, including Go, Kafka, Secor, Spark, Scala, Java, S3, Presto, and Redshift.
As more and more business decisions tend to be based on data and insights, batch and offline reporting from data were simply not enough. We required real-time user behavior analysis, real-time traffic, real-time notification performance, and more to be available with minimal latency. We needed to ingest as well as filter and process data in real-time and also persist it in a write-fast performant data store to do dashboarding and reporting.
Meterial is a pipeline that does exactly this and even more with a feedback loop for other teams to take action from the data in real-time.
Meterial is powered by:
Data transformer based on Apache Spark.
MemSQL real-time DB.
Our event collectors written in Golang sit behind Amazon ELB to receive events from the app or website. They add a timestamp to the incoming clickstream events and push them into Kafka. From Kafka, a Meterial-ingestion layer based on Apache Spark streaming ingests somewhere around four million events per minute, filters and transforms the incoming events based on a configuration file, and persists them to a MemSQL row-store every minute. MemSQL return results for queries spawning across millions of rows with sub-second latency.
Our in-house dashboarding and reporting framework — namely, UDP (Universal Dashboarding platform) — has services that query MemSQL every minute and store the result in a UDP query cache, from where it is served to all the connected clients using socket-based connections. The results are displayed in form of graphs, charts, tables, and other numerous widgets supported by UDP. The same UDP APIs are also used by Slack bots to post data into Slack channels in real-time using Slack outgoing webhooks.
As all transactional data currently lies in Redshift and there are requirements where reporting of commerce data with user data every 15 minutes is needed, Meterial also enables this ad-hoc analysis on data for our team of data analysts. Every fifteen minutes, data from MemSQL for that interval is dumped into S3, from where it is loaded to Redshift using our S3 (redshift ETLs).
We selected Spark as our streaming engine because of its proven scale, its powerful community support, the expertise within the team, and the easy scalability with proper tuning. For real-time datastore choice, we did POC on multiple DBs and drilled down to MemSQL. MemSQL is a high-performance, in-memory, and disk-based database that combines the horizontal scalability of distributed systems with the familiarity of SQL. We have seen MemSQL support very high concurrent reads/writes very smoothly at scale with proper tuning. Currently, we are exploring MemSQL column store as our OLAP DB for our AB Test framework (Morpheus) and Segmentation Platform (Personify).
Sample UI Screenshots
Future of Real-Time Analytics at Myntra
Using real-time data with predictive analytics, machine learning, and artificial intelligence opens new doors to understanding user behavior, what paths and funnels lead to commerce, etc. Getting such information in real-time can definitely help us boost our commerce and take corrective actions if something goes wrong as soon as possible. We are constantly working on improving and enhancing it.
Opinions expressed by DZone contributors are their own.