With the growing need for and popularity of AI, crawling data from the web and summarizing this data based business requirements is one of the most common problems many people have to deal with. The problem becomes more challenging when the data sources can be any website or even a large set of websites, as providing a generic solution that meets all business requirements is very difficult.
I have been dealing with such problems in the last couple of years and recently, I tried an approach that I found to be very reliable, easy to maintain, robust, and optimized.
The approach is the 3 M approach, where the 3 Ms mean following:
I used the microservices approach for separation of concern. I segregated the AI algorithms and business logic from engineering work to make it easier for the team to maintain the application. There are two main reasons behind this segregation:
Usually, data scientists and ML researchers are good at AI tasks but not that good at engineering tasks, and vice versa with the product engineers.
For most data scientists and ML researchers, the favorable programming language is Python because it contains a huge list of libraries for AI and is pretty easy compared to other programming languages. On the other hand, for enterprise server-side applications, Java is the most preferable, as it has very robust and powerful frameworks like Spring, Hibernate, etc., which have plenty of features to make developers' jobs easy and simple.
Data crawling from the web is a tedious and unreliable process, as every website is different from any other website. The time consumed in crawling data from one website to another also varies significantly and depends on many factors. To make this process reliable, I used RabbitMQ to queue the requests and process asynchronously. This approach helped me in processing the request in a controlled way and it helped the user not to wait for a long time while the request was being processed.
Multithreading is used in the application for parallel processing of the request queued in the message queue. I used a thread pool of max size 30. The MessageQ listener continuously observes the MessageQ and calls the thread manager to start a thread whenever it is available in the pool.
The entire application is divided into the three components mentioned below, along with the technologies used in them:
Advisor-app: Java, Spring Boot, Spring Data, MySQL,and MongoDB
Advisor-Msg: Spring Boot, and RabbitMQ
Advisor-AI: Python, Django, NLP, NER, and various AI algorithms
Below is the architecture of the application: