DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Algorithms and NoSQL (Part 3): the MongoDB Aggregation Framework

Algorithms and NoSQL (Part 3): the MongoDB Aggregation Framework

Davy Suvee user avatar by
Davy Suvee
·
Feb. 08, 12 · Interview
Like (0)
Save
Tweet
Share
6.45K Views

Join the DZone community and get the full member experience.

Join For Free

in part 1 of this article, i described the use of mongodb to solve a specific chemoinformatics problem, namely the computation of molecular similarities through tanimoto coefficients . when employing a low target tanimoto coefficient however, the number of returned compounds increases exponentially, resulting in a noticeable data transfer overhead. to circumvent this problem, part 2 of this article describes the use of mongodb’s build-in map-reduce functionality to perform the tanimoto coefficient calculation local to where the compound data is stored. unfortunately, the execution of these map-reduce algorithms through javascript is rather slow and a performance improvement can only be achieved when multiple shards are employed within the same mongodb cluster.

recently, mongodb introduced its new aggregation framework . this framework provides a more simple solution to calculating aggregate values instead of relying upon the powerful map-reduce constructs. with just a few simple primitives, it allows you to compute, group, reshape and project documents that are contained within a certain mongodb collection. the remainder of this article describes the refactoring of the map-reduce algorithm to make optimal use of the new mongodb aggregation framework. the complete source code can be found on the datablend public github repository .

1. mongodb aggregation framework

the mongodb aggregation framework draws on the well-known linux pipeline concept, where the output of one command is piped or redirected to be used as input of the next command. in case of mongodb, multiple operators are combined into a single pipeline that is responsible for processing a stream of documents. some operators, such as $match , $limit and $skip take a document as input and output the same document in case a certain set of criteria’s is met. other operators, such as $project and $unwind take a single document as input and reshape that document or emit multiple documents based upon a certain projection. the $group operator finally, takes multiple documents as input and groups them into a single document by aggregating the relevant values. expressions can be used within some of these operators to calculate new values or execute string operations.

multiple operators are combined into a single pipeline that is applied upon a list of documents. the pipeline itself is executed as a mongodb command , resulting in single mongodb document that contains an array of all documents that came out at end of the pipeline. the next paragraph details the refactoring of the molecular similarities algorithm as a pipeline of operators. make sure to (re)read the previous two articles to fully grasp the implementation logic.

2. molecular similarity pipeline

when applying a pipeline upon a certain collection, all documents contained within this collection are given as input to the first operator. it’s considered best practice to filter this list as quickly as possible to limit the number of total documents that are passed through the pipeline. in our case, this means filtering out all document that will never be able to satisfy the target tanimoto coefficient . hence, as a first step, we match all documents for which the fingerprint count is within a certain threshold. if we target a tanimoto coefficient of 0.8 with a target compound containing 40 unique fingerprints, the $match operator look as follows:

{ "$match" :
    { "fingerprint_count" : { "$gte" : 32 , "$lte" : 50}}
}

only compounds that have a fingerprint count between 32 and 50 will be streamed to the next pipeline operator. to perform this filtering, the $match operator is able to use the index that we have defined for the fingerprint_count property. for computing the tanimoto coefficient, we need to calculate the number of shared fingerprints between a certain input compound and the compound we are targeting. in order to be able to work at the fingerprint level, we use the $unwind operator . $unwind peels off the elements of an array one by one, returning a stream of documents where the specified array is replaced by one of its elements. in our case, we apply the $unwind upon the fingerprints property. hence, each compound document will result in n compound documents, where n is the number of unique fingerprints contained within the compound.

{ "$unwind" : "$fingerprints"}

in order to calculate the number of shared fingerprints, we will start off by filtering out all documents which do not have a fingerprint that is in the list of fingerprints of the target compound. for doing so, we again apply the $match operator , this time filtering on the fingerprints property, where only documents that contain a fingerprint that is in the list of target fingerprints are maintained.

{ "$match" :
    { "fingerprints" :
        { "$in" : [ 1960 , 15111 , 5186 , 5371 , 756 , 1015 , 1018 , 338 , 325 , 776 , 3900 , ..., 2473] }
    }
}

as we only match fingerprints that are in the list of target fingerprints, the output can be used to count the total number of shared fingerprints . for this, we apply the $group operator on the compound_cid , though which we create a new type of document, containing the number of matching fingerprints (by summating the number of occurrences), the total number of fingerprints of the input compound and the smiles representation .

{ "$group" :
    { "_id" : "$compound_cid" ,
      "fingerprintmatches" : { "$sum" : 1} ,
      "totalcount" : { "$first" : "$fingerprint_count"} ,
      "smiles" : { "$first" : "$smiles"}
    }
}

we now have all parameters in place to calculate the tanimoto coefficient. for this we will use the $project operator which, next to copying the compound id and smiles property, also adds a new, computed property named tanimoto .

{ "$project" :
    { "_id" : 1 ,
      "tanimoto" : { "$divide" : [ "$fingerprintmatches" , { "$subtract" : [ { "$add" : [ 40 , "$totalcount"] } , "$fingerprintmatches"] } ] } ,
      "smiles" : 1
    }
}

as we are only interested in compounds that have a target tanimoto coefficient of 0.8, we apply an additional $match operator to filter out all the ones that do not reach this coefficient.

{ "$match" :
    { "tanimoto" : { "$gte" : 0.8}
}

the full pipeline command can be found below.

{ "aggregate" : "compounds" ,
  "pipeline" : [
    { "$match" :
        { "fingerprint_count" : { "$gte" : 32 , "$lte" : 50} }
    },
    { "$unwind" : "$fingerprints"},
    { "$match" :
        { "fingerprints" :
            { "$in" : [ 1960 , 15111 , 5186 , 5371 , 756 , 1015 , 1018 , 338 , 325 , 776 , 3900, ... , 2473] }
        }
    },
    { "$group" :
        { "_id" : "$compound_cid" ,
          "fingerprintmatches" : { "$sum" : 1} ,
          "totalcount" : { "$first" : "$fingerprint_count"} ,
          "smiles" : { "$first" : "$smiles"}
        }
    },
    { "$project" :
        { "_id" : 1 ,
          "tanimoto" : { "$divide" : [ "$fingerprintmatches" , { "$subtract" : [ { "$add" : [ 89 , "$totalcount"]} , "$fingerprintmatches"] } ] } ,
          "smiles" : 1
        }
    },
    { "$match" :
       { "tanimoto" : { "$gte" : 0.05} }
    } ]
}

the output of this pipeline contains a list of compounds which have a tanimoto of 0.8 or higher with respect to a particular target compound. a visual representation of this pipeline can be found below:

pipeline

3. conclusion

the new mongodb aggregation framework provides a set of easy-to-use operators that allow users to express map-reduce type of algorithms in a more concise fashion. the pipeline concept beneath it offers an intuitive way of processing data. it is no surprise that this pipeline paradigm is adopted by various nosql approaches, including tinkerpop’s gremlin framework and neo4j’s cypher implementation.

performance wise , the pipeline solution is a major improvement upon the map-reduce implementation. the employed operators are natively supported by the mongodb platform , which results in a huge performance improvement with respect to interpreted javascript. as the aggregation framework is also able to work in a sharded environment , it easily beats the performance of my initial implementation, especially when the number of input compounds is high and the target tanimoto coefficient is low. great work from the mongodb team!


source: http://datablend.be/?p=1400

MongoDB Framework Document Algorithm AI Operator (extension) Pipeline (software) NoSQL

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Create Spider Chart With ReactJS
  • Key Elements of Site Reliability Engineering (SRE)
  • Stop Using Spring Profiles Per Environment
  • Spring Cloud

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: