DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Real-Time Big Data Ingestion With Meterial

Real-Time Big Data Ingestion With Meterial

Using real-time data with predictive analytics, Machine Learning, and AI opens new doors to understand user behavior, what paths and funnels lead to commerce, and more.

Lesia Mryoshnickenko user avatar by
Lesia Mryoshnickenko
·
Mar. 12, 17 · Opinion
Like (5)
Save
Tweet
Share
7.16K Views

Join the DZone community and get the full member experience.

Join For Free

i got an opportunity to work extensively with big data and analytics in myntra. data-driven intelligence being one of the core values at myntra, crunching and processing data and reporting meaningful insights for the company is of utmost importance.

every day, millions of users visit myntra on our app or website, generating billions of clickstream events which makes it very important for the data platform team to scale to such a huge number of incoming events, ingest them in real time with minimal or no loss, and process the unstructured/semi-structured data to generate insights.

we use a varied set of technologies and in-house products to achieve the above including but not limited to go, kafka, secor, spark, scala, java, s3, presto, and redshift.

image1

motivation

as more and more business decisions tend to be based on data and insights, batch and offline reporting from data were simply not enough. we required real-time user behavior analysis, real-time traffic, real-time notification performance, and other metrics to be available with minimal latency for business users to make decisions. we needed to ingest as well as filter and process data in real-time and also persist it in a write fast performant data store to do dashboarding and reporting on top of it.

meterial is one such pipeline which does exactly this and even more with a feedback loop for other teams to take action from the data in real time.

architecture

image2

meterial is powered by:

  • 1. apache kafka.

  • 2. data transformer based on apache spark.

  • 3. memsql real-time database.

  • 4. react.js based ui.

deep dive

our event collectors, written in golang, sit behind amazon elb to receive events from our app and website. they add a timestamp to the incoming clickstream events and push them into kafka.

from kafka, a meterial-ingestion layer based on apache spark streaming ingests around ~4 million events/minute, filters and transforms the incoming events based on a configuration file, and persists them to memsql the row store every minute. memsql returns results for queries spawning across millions of rows with sub-second latency.

our in-house dashboarding and reporting framework (udp: universal dashboarding platform) have services that query memsql every minute and store the results in a udp query cache from where it is served to all the connected clients using socket based connections.

results are displayed in a form of graphs, charts, tables, and other numerous widgets supported by udp. the same udp apis are also used by slack bots to post data into slack channels in real time using slack outgoing webhooks.

as all transactional data currently lies in redshift and there are requirements where reporting of commerce data with user data every 15 minutes is needed, meterial also enables this ad-hoc analysis on data for our team of data analysts. every fifteen minutes, data from memsql for that interval is dumped into s3 from where it is loaded to redshift using our s3 — redshift etls.

we selected spark as our streaming engine because of its proven scale, powerful community support, expertise within the team, and easy scalability with proper tuning.

for real-time datastore choice, we did poc on multiple databases and drilled down to memsql.
memsql is a high-performance, in-memory, disk-based database that combines the horizontal scalability of distributed systems with the familiarity of sql.

we have seen memsql support very high concurrent reads and writes very smoothly at our scale with proper tuning.

currently, we are exploring memsql column stores as olap dbs for our ab test framework (morpheus) and segmentation platform (personify).

sample ui screenshots

image3

traffic.

image4

notification.

using real-time data with predictive analytics, machine learning, and artificial intelligence opens altogether new doors to understand user behavior, what paths and funnels lead to commerce, and more. getting such information in real time can definitely help us boost our commerce and take corrective actions if something goes wrong as soon as possible. we are constantly working on improving and enhancing.

Big data Database

Published at DZone with permission of Lesia Mryoshnickenko, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • How To Choose the Right Streaming Database
  • Building the Next-Generation Data Lakehouse: 10X Performance
  • How To Create a Failover Client Using the Hazelcast Viridian Serverless
  • Build an Automated Testing Pipeline With GitLab CI/CD and Selenium Grid

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: