Scalable Support Request Analysis Using Embeddings, HDBSCAN, and Tiny LLMs
Effective use of historical data to classify support requests into buckets and identify emerging or slightly different patterns.
Join the DZone community and get the full member experience.
Join For FreeData Exploration
Analyze the historical data to understand data quality, recurring key phrases, noise, and other patterns. Also, examine meta-attributes such as manual tagging, assigned department, assigned personnel, etc. Use spaCy or any other library to identify the most common words. This will indicate which words need to be masked, replaced, or normalized.
Data Cleansing and Enrichment
Identify domain-specific noise and aliases, and define regex rules to remove or standardize them.
For product-related support requests, the same product may be referred to using different terms; these can be masked with a common placeholder such as product. Identify other words or phrases that may need masking as well.
This will help the machine learning algorithm improve accuracy and avoid unnecessary cluster formation caused by irrelevant variations.
Apply to the Dataset
Keep one copy of the raw data before applying the cleansing and enrichment steps. After applying the transformations, review the effectiveness of the processed data and make any necessary adjustments.
Once everything is validated, embed the data into vectors using open-source embedding models. Select the appropriate model based on the domain data. A few examples are in the screenshot below:

Context Weightage
In some scenarios, we may need to provide additional weightage to certain keywords or phrases based on the domain and the nature of the data. To achieve this, identify those key phrases and create a standardized string representation for each row record. Then generate a separate vector embedding for this context.
Create a new function to combine the row data vector (alpha) and the context vector (beta), where a configurable parameter can be used to adjust the relative weightage between them.

Extracting Context Information Using Tiny LLM
Use an appropriate lightweight LLM to extract context-specific information based on the domain. For example, in the medical domain, medical entities are key elements. If possible, generate structured contextual information using the LLM. This will improve accuracy, and you can fine-tune the weightage through multiple iterations.

Identify a Suitable ML Algorithm for Clustering
The HDBSCAN algorithm provides an effective approach for clustering, as it can identify clusters of varying density and also classify noise points as a separate group. Adjusting the parameters is critical for achieving good results. Therefore, perform multiple iterations and experiments to identify the most suitable parameter settings for the specific use case.
Identifying the min_cluster_size is key. You can determine it either using the silhouette score or based on domain knowledge and the data. If you set a high value, the likelihood of having more noise data points will increase.

K-means is also a good algorithm, but it does not automatically categorize noise data. In scenarios where the number of clusters is predefined, it is a good choice. Within each cluster, you can calculate the distance from the centroid to identify outliers or the outer layer. For example, if you have a clear idea of support request categories, you can create that number of clusters and also identify the outer layer to detect new trends in the requests.

Noise Cluster Drilldown
Based on the data volume and parameter settings, there is a possibility that the algorithm may generate more noise data than expected. In such scenarios, it is better to perform a drill-down clustering on the noise data by reducing the min_cluster_size parameter. This will help generate another set of buckets with smaller clusters. There will still be some noisy records, which can be treated as new trends or rare occurrences. We can use an LLM to generate details for those records.
Generate Title and Short Description Using an LLM
Once the two levels of clusters are identified, use a small LLM to generate a title and a short description for each cluster. You can either use an LLM that can run on a CPU or use OpenAI or any other subscription-based LLM service.

Incremental Data or New Request Row Data
After building the clustering model using historical data (ensure that prediction_data: true is enabled), new records can be validated against the existing clusters using HDBSCAN prediction and cosine similarity. Based on the results, the new request can be assigned to the appropriate cluster or categorized as noise.

Outer Layer or Slightly Different From the Core Cluster
Define a scale to identify close and distant probability values within each cluster. A value of 1 represents 100% probability, and such records will be the centroid data of the cluster. Values between 0.01 and 0.49 indicate records that are farther away or have a lower probability of belonging to the core of the cluster. These records may represent emerging trends or requests that are slightly different from the centroid pattern.

Based on the above scale or segregation, we can use an LLM to generate a detailed report. For example, the 100% probability data will represent the core support request records, while the other sets will slightly differ in their requests compared to the core support context.
Conclusion
The above approach helps eliminate most LLM-related costs and is highly feasible for processing large datasets. The risk of hallucination is also limited, as the solution does not rely entirely on an LLM.
When implementing this in a production-level system, it is important to continuously evaluate the noise data and determine when a complete re-creation of clusters is required, especially after a significant number of new records have been incrementally assigned over time. Even if a small LLM running on a CPU may have latency, you need a clear understanding of the use case before deciding to use it. Design the architecture accordingly; for example, move LLM-related actions to backend sequential processes and, if needed, deploy multiple LLM instances for load balancing.
Opinions expressed by DZone contributors are their own.
Comments