DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

SBOMs are essential to circumventing software supply chain attacks, and they provide visibility into various software components.

Related

  • MongoDB Change Streams and Go
  • HTAP Using a Star Query on MongoDB Atlas Search Index
  • How to Identify the Underlying Causes of Connection Timeout Errors for MongoDB With Java
  • Loading XML into MongoDB

Trending

  • Reducing Hallucinations Using Prompt Engineering and RAG
  • Beyond the Glass Slab: How AI Voice Assistants are Morphing Into Our Real-Life JARVIS
  • Jakarta EE 11 and the Road Ahead With Jakarta EE 12
  • A Keycloak Example: Building My First MCP Server Tools With Quarkus
  1. DZone
  2. Data Engineering
  3. Databases
  4. Text Similarity : Python-sklearn on MongoDB Collection

Text Similarity : Python-sklearn on MongoDB Collection

Check out some Python code that can calculate the similarity of an indexed field between all the documents of a MongoDB collection.

By 
Anis Hajri user avatar
Anis Hajri
·
Jun. 11, 19 · Code Snippet
Likes (3)
Comment
Save
Tweet
Share
9.8K Views

Join the DZone community and get the full member experience.

Join For Free

Overview

In this article, I set up a Python script that allows us to calculate the similarity of an indexed field between all the documents of a MongoDB collection. In the process I parallelized the executions on four threads to improve performance.

The script is detailed below, I hope it will be useful.

Python Script

import multiprocessing
import threading
import json, sys
import pymongo
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances


class SimilarityThread (threading.Thread):
   def __init__(self, threadID, data_array, totalSize, similarity_collection,startIndex):
   threading.Thread.__init__(self)
   self.threadID = threadID
   self.data_array = data_array
   self.totalSize = totalSize
   self.similarity_collection = similarity_collection
   self.startIndex = startIndex


   def run(self):
      clacluateSimilarity( self.data_array, self.totalSize, self.similarity_collection,self.startIndex)


def clacluateDistance(txt1,txt2):
return euclidean_distances(txt1,txt2)[0][0]


def clacluateSimilarity( data_array, totalSize, similarity_collection, startIndex):
vectorizer = CountVectorizer()
for idx in range(startIndex,totalSize):
h = data_array[idx]
for idx1 in range((idx+1),totalSize):
h1 = data_array[idx1]
hSimilarity = {}
hSimilarity['idOrigin']=h['id']
hSimilarity['idTarget']=h1['id']
corpus = []
corpus.append(h['text'])
corpus.append(h1['text'])
features = vectorizer.fit_transform(corpus).todense()
distance = clacluateDistance(features[0],features[1])
hSimilarity['distance'] = distance
print(hSimilarity)
if distance < 4:
print("Distance ====> %d " % distance)
similarity_collection.insert_one(hSimilarity)


def processTextSimilarity(totalSize, data_array,similarity_collection):

num_cores = multiprocessing.cpu_count()
print(":::num cores ==> %d " % num_cores)
threadList = ["Thread-1", "Thread-2", "Thread-3", "Thread-4"]
threadID = 1;
threads=[]
rootIndex = round(totalSize/4)
startIndex = 0
for tName in threadList:
thread = SimilarityThread(threadID, data_array, startIndex+rootIndex, similarity_collection,startIndex)
thread.start()
startIndex+=rootIndex
threads.append(thread)
threadID += 1


# Wait for all threads to complete
for t in threads:
t.join()




def main():
print('****** Text Similarity::start ******')
connection = pymongo.MongoClient("mongodb://localhost")
db = connection.kalamokomnoor
article = db.article
article_similarity = db.article_similarity

data_array = article.find({}).sort("id",pymongo.ASCENDING)
totalSize =  article.count_documents({}) 


print('###### :: totalSize : %d ' % totalSize)


processTextSimilarity(totalSize,data_array,article_similarity)

print('****** Text Similarity::Ending ******')




if __name__ == '__main__':
main()




If you enjoyed this article and want to learn more about MongoDB, check out this collection of tutorials and articles on all things MongoDB.

MongoDB

Opinions expressed by DZone contributors are their own.

Related

  • MongoDB Change Streams and Go
  • HTAP Using a Star Query on MongoDB Atlas Search Index
  • How to Identify the Underlying Causes of Connection Timeout Errors for MongoDB With Java
  • Loading XML into MongoDB

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: