HTAP Using a Star Query on MongoDB Atlas Search Index

The MongoDB document model for OLTP mirrors business objects and resembles a star schema. With Atlas Search indexes, it enables HTAP without a separate analytical DB.

Franck Pachot

Jun. 12, 25 · Tutorial

Likes (2)

Comment

Save

1.6K Views

MongoDB is often chosen for online transaction processing (OLTP) due to its flexible document model, which can align with domain-specific data structures and access patterns. Beyond basic transactional workloads, MongoDB also supports search capabilities through Atlas Search, built on Apache Lucene. When combined with the aggregation pipeline, this enables limited online analytical processing (OLAP) functionality suitable for near-real-time analytics. Because MongoDB uses a unified document model, these analytical queries can run without restructuring the data, allowing for certain hybrid transactional and analytical (HTAP) workloads. This article explores such a use case in the context of healthcare.

Traditional relational databases employ a complex query optimization method known as "star transformation" and rely on multiple single-column indexes, along with bitmap operations, to support efficient ad-hoc queries. This typically requires a dimensional schema, or star schema, which is distinct from the normalized operational schema used for transactional updates. MongoDB can support a similar querying approach using its document schema, which is often designed for operational use. By adding an Atlas Search index to the collection storing transactional data, certain analytical queries can be supported without restructuring the schema.

To demonstrate how a single index on a fact collection enables efficient queries even when filters are applied to other dimension collections, I utilized the MedSynora DW dataset, which is similar to a star schema with dimensions and facts. This dataset, published by M. Ebrar Küçük on Kaggle, is a synthetic hospital data warehouse covering patient encounters, treatments, and lab tests, and is compliant with privacy standards for healthcare data science and machine learning.

Import the Dataset

The dataset is accessible on Kaggle as a folder of comma-separated values (CSV) files for dimensions and facts compressed into a 730MB zip file. The largest fact table that I'll use holds 10 million records.

I downloaded the CSV files and uncompressed them:

curl -L -o medsynora-dw.zip "https://www.kaggle.com/api/v1/datasets/download/mebrar21/medsynora-dw"

unzip medsynora-dw.zip

I imported each file into a collection, using mongoimport from the MongoDB Database Tools:


for i in "MedSynora DW"/*.csv
do
 mongoimport -d "MedSynoraDW" --file="$i"  --type=csv --headerline -c "$(basename "$i" .csv)" -j 8                                                              
done

For this demo, I'm interested in two fact tables: FactEncounter and FactLabTest. Here are the fields described in the file headers:

# head -1 "MedSynora DW"/Fact{Encounter,LabTests}.csv

==> MedSynora DW/FactEncounter.csv <==
Encounter_ID,Patient_ID,Disease_ID,ResponsibleDoctorID,InsuranceKey,RoomKey,CheckinDate,CheckoutDate,CheckinDateKey,CheckoutDateKey,Patient_Severity_Score,RadiologyType,RadiologyProcedureCount,EndoscopyType,EndoscopyProcedureCount,CompanionPresent

==> MedSynora DW/FactLabTests.csv <==
Encounter_ID,Patient_ID,Phase,LabType,TestName,TestValue

The fact tables referenced the following dimensions:


# head -1 "MedSynora DW"/Dim{Disease,Doctor,Insurance,Patient,Room}.csv

==> MedSynora DW/DimDisease.csv <==
Disease_ID,Admission Diagnosis,Disease Type,Disease Severity,Medical Unit

==> MedSynora DW/DimDoctor.csv <==
Doctor_ID,Doctor Name,Doctor Surname,Doctor Title,Doctor Nationality,Medical Unit,Max Patient Count

==> MedSynora DW/DimInsurance.csv <==
InsuranceKey,Insurance Plan Name,Coverage Limit,Deductible,Excluded Treatments,Partial Coverage Treatments

==> MedSynora DW/DimPatient.csv <==
Patient_ID,First Name,Last Name,Gender,Birth Date,Height,Weight,Marital Status,Nationality,Blood Type

==> MedSynora DW/DimRoom.csv <==
RoomKey,Care_Level,Room Type

Here is the dimensional model, often referred to as a "star schema" because the fact tables are located at the center, referencing the dimensions. Because of normalization, when facts contain a one-to-many composition, it is described in two CSV files to fit into two SQL tables:

Star schema with facts and dimensions. The facts are stored in two tables in CSV files or a SQL database, but on a single collection in MongoDB. It holds the fact measures and dimension keys, which reference the key of the dimension collections.

MongoDB allows the storage of one-to-many compositions, such as Encounters and LabTests, within a single collection. By embedding LabTests as an array in Encounter documents, this design pattern promotes data colocation to reduce disk access and increase cache locality, minimizes duplication to improve storage efficiency, maintains data integrity without requiring additional foreign key processing, and enables more indexing possibilities. The document model also circumvents a common issue in SQL analytic queries, where joining prior to aggregation may yield inaccurate results due to the repetition of parent values in a one-to-many relationship.

Since this represents the appropriate data model for an operational database with such data, I created a new collection using an aggregation pipeline to replace the two imported from the normalized CSV:


db.FactLabTests.createIndex({ Encounter_ID: 1, Patient_ID: 1 });

db.FactEncounter.aggregate([
  {
    $lookup: {
      from: "FactLabTests",
      localField: "Encounter_ID",
      foreignField: "Encounter_ID",
      as: "LabTests"
    }
  },
  {
    $addFields: {
      LabTests: {
        $map: {
          input: "$LabTests",
          as: "test",
          in: {
            Phase: "$$test.Phase",
            LabType: "$$test.LabType",
            TestName: "$$test.TestName",
            TestValue: "$$test.TestValue"
          }
        }
      }
    }
  },
  {
    $out: "FactEncounterLabTests"
  }
]);

Here is how one document looks:

AtlasLocalDev atlas [direct: primary] MedSynoraDW> 

db.FactEncounterLabTests.find().limit(1)
[
  {
    _id: ObjectId('67fc3d2f40d2b3c843949c97'),
    Encounter_ID: 2158,
    Patient_ID: 'TR479',
    Disease_ID: 1632,
    ResponsibleDoctorID: 905,
    InsuranceKey: 82,
    RoomKey: 203,
    CheckinDate: '2024-01-23 11:09:00',
    CheckoutDate: '2024-03-29 17:00:00',
    CheckinDateKey: 20240123,
    CheckoutDateKey: 20240329,
    Patient_Severity_Score: 63.2,
    RadiologyType: 'None',
    RadiologyProcedureCount: 0,
    EndoscopyType: 'None',
    EndoscopyProcedureCount: 0,
    CompanionPresent: 'True',
    LabTests: [
      {
        Phase: 'Admission',
        LabType: 'CBC',
        TestName: 'Lymphocytes_abs (10^3/µl)',
        TestValue: 1.34
      },
      {
        Phase: 'Admission',
        LabType: 'Chem',
        TestName: 'ALT (U/l)',
        TestValue: 20.5
      },
      {
        Phase: 'Admission',
        LabType: 'Lipids',
        TestName: 'Triglycerides (mg/dl)',
        TestValue: 129.1
      },
      {
        Phase: 'Discharge',
        LabType: 'CBC',
        TestName: 'RBC (10^6/µl)',
        TestValue: 4.08
      },
...

In MongoDB, the document model utilizes embedding and reference design patterns, resembling a star schema with a primary fact collection and references to various dimension collections. It is crucial to ensure that the dimension references are properly indexed before querying these collections.

Atlas Search Index

Search indexes are distinct from regular indexes, which rely on a single composite key, as they can index multiple fields without requiring a specific order to establish a key. This feature makes them perfect for ad-hoc queries, where the filtering dimensions are not predetermined.

I created a single Atlas Search index encompassing all dimensions and measures I intended to use in predicates, including those in embedded documents.

db.FactEncounterLabTests.createSearchIndex(
  "SearchFactEncounterLabTests", {
    mappings: {
      dynamic: false,
      fields: {
        "Encounter_ID":        { "type": "number"  },
        "Patient_ID":          { "type": "token"  },
        "Disease_ID":          { "type": "number"  },
        "InsuranceKey":        { "type": "number"  },
        "RoomKey":             { "type": "number"  },
        "ResponsibleDoctorID": { "type": "number" },
        "CheckinDate":         { "type": "token"  },
        "CheckoutDate":        { "type": "token"  },
        "LabTests":            {
          "type": "document" , fields: {
            "Phase":           { "type": "token"  },
            "LabType":         { "type": "token"  },
            "TestName":        { "type": "token"  },
            "TestValue":       { "type": "number" }
          }
        }
      }
    }
  }
);

Since I don't need extra text searching on the keys, I designated the character string ones as token. I labeled the integer keys as number. Generally, the keys are utilized for equality predicates. However, some can be employed for ranges when the format permits, such as check-in and check-out dates formatted as YYYY-MM-DD.

In relational databases, the star schema approach involves limiting the number of columns in fact tables due to their typically large number of rows. Dimension tables, which are generally smaller, can include more columns and are often denormalized in SQL databases, making the star schema more common than the snowflake schema. Similarly, in document modeling, embedding all dimension fields can increase the size of fact documents unnecessarily, so referencing dimension collections is often preferred. MongoDB’s data modeling principles allow it to be queried similarly to a star schema without additional complexity, as its design aligns with common application access patterns.

Star Query

A star schema allows processing queries which filter fields within dimension collections in several stages:

In the first stage, filters are applied to the dimension collections to extract all dimension keys. These keys typically do not require additional indexes, as the dimensions are generally small in size.
In the second stage, a search is conducted using all previously obtained dimension keys on the fact collection. This process utilizes the search index built on those keys, allowing for quick access to the required documents.
A third stage may retrieve additional dimensions to gather the necessary fields for aggregation or projection. This multi-stage process ensures that the applied filter reduces the dataset from the large fact collection before any further operations are conducted.

For an example query, I aimed to analyze lab test records for female patients who are over 170 cm tall, underwent lipid lab tests, have insurance coverage exceeding 80%, and were treated by Japanese doctors in deluxe rooms for hematological conditions.

Search Aggregation Pipeline

To optimize the fact collection process and apply all filters, I began with a simple aggregation pipeline that started with a search on the search index. This enabled filters to be applied directly to the fields in the fact collection, while additional filters were incorporated in the first stage of the star query. I used a local variable with a compound operator to facilitate adding more filters for each dimension during this stage.

Before proceeding through the star query stages to add filters on dimensions, my query included a filter on the lab type, which was part of the fact collection and indexed.

const search =  {
    "$search": {
      "index": "SearchFactEncounterLabTests",
      "compound": {
        "must": [
          { "in":    { "path": "LabTests.LabType" , "value": "Lipids"   } },
        ]
      },
      "sort": { CheckoutDate: -1 }
    }
  }

I added a sort operation to order the results by check-out date in descending order. This illustrated the advantage of sorting during the index search rather than in later stages of the aggregation pipeline, especially when a limit was applied.

I used this local variable to add more filters in Stage 1 of the star query, so that it could be executed for Stage 2 and collect documents for Stage 3.

Stage 1: Query the Dimension Collections
In the first phase of the star query, I obtained the dimension keys from the dimension collections. For every dimension with a filter, I retrieved the dimension keys using a find() on the dimension collection and appended a must condition to the compound of the fact index search.

The following added the conditions on the Patient (female patients over 170 cm):

search["$search"]["compound"]["must"].push( { in: {
 path: "Patient_ID",                         // Foreign Key in Fact
 value: db.DimPatient.find(                  // Dimension collection
  {Gender: "Female", Height: { "$gt": 170 }} // filter on Dimension
 ).map(doc => doc["Patient_ID"]).toArray() } // Primary Key in Dimension
})

The following added the conditions on the Doctor (Japanese):

search["$search"]["compound"]["must"].push( { in: {
 path: "ResponsibleDoctorID",               // Foreign Key in Fact
 value: db.DimDoctor.find(                  // Dimension collection
  {"Doctor Nationality": "Japanese" }       // filter on Dimension
 ).map(doc => doc["Doctor_ID"]).toArray() } // Primary Key in Dimension
})

The following added the condition on the Room (Deluxe):

search["$search"]["compound"]["must"].push( { in: {
 path: "RoomKey",                         // Foreign Key in Fact
 value: db.DimRoom.find(                  // Dimension collection
  {"Room Type": "Deluxe" }                // filter on Dimension
 ).map(doc => doc["RoomKey"]).toArray() } // Primary Key in Dimension
})

The following added the condition on the Disease (Hematology):

search["$search"]["compound"]["must"].push( { in: {
 path: "Disease_ID",                         // Foreign Key in Fact
 value: db.DimDisease.find(                  // Dimension collection
  {"Disease Type": "Hematology" } // filter on Dimension
 ).map(doc => doc["Disease_ID"]).toArray() } // Primary Key in Dimension
})

Finally, here's the condition on the Insurance coverage (greater than 80%):

search["$search"]["compound"]["must"].push( { in: {
 path: "InsuranceKey",                         // Foreign Key in Fact
 value: db.DimInsurance.find(                  // Dimension collection
  {"Coverage Limit": { "$gt": 0.8 } }          // filter on Dimension
 ).map(doc => doc["InsuranceKey"]).toArray() } // Primary Key in Dimension
})

All these search criteria had the same structure: a find() on the dimension collection with the filters from the query, resulting in an array of dimension keys (similar to primary keys in a dimension table) that were used to search the fact documents by referencing them (like foreign keys in a fact table).

Each of these steps queried the dimension collection to obtain a simple array of dimension keys, which were then added to the aggregation pipeline. Rather than joining tables as in a relational database, the criteria on the dimensions were pushed down into the query on the fact collection.

Stage 2: Query the Fact Search Index

Using the results from the dimension queries, I built the following pipeline search step:

AtlasLocalDev atlas [direct: primary] MedSynoraDW> print(search)
{
  '$search': {
    index: 'SearchFactEncounterLabTests',
    compound: {
      must: [
        { in: { path: 'LabTests.LabType', value: 'Lipids' } },
        {
          in: {
            path: 'Patient_ID',
            value: [
              'TR551',    'TR751',    'TR897',    'TRGT201',  'TRJB261',
              'TRQG448',  'TRSQ510',  'TRTP535',  'TRUC548',  'TRVT591',
              'TRABU748', 'TRADD783', 'TRAZG358', 'TRBCI438', 'TRBTY896',
              'TRBUH905', 'TRBXU996', 'TRCAJ063', 'TRCIM274', 'TRCXU672',
              'TRDAB731', 'TRDFZ885', 'TRDGE890', 'TRDJK974', 'TRDKN003',
              'TRE004',   'TRMN351',  'TRRY492',  'TRTI528',  'TRAKA962',
              'TRANM052', 'TRAOY090', 'TRARY168', 'TRASU190', 'TRBAG384',
              'TRBYT021', 'TRBZO042', 'TRCAS072', 'TRCBF085', 'TRCOB419',
              'TRDMD045', 'TRDPE124', 'TRDWV323', 'TREUA926', 'TREZX079',
              'TR663',    'TR808',    'TR849',    'TRKA286',  'TRLC314',
              'TRMG344',  'TRPT435',  'TRVZ597',  'TRXC626',  'TRACT773',
              'TRAHG890', 'TRAKW984', 'TRAMX037', 'TRAQR135', 'TRARX167',
              'TRARZ169', 'TRASW192', 'TRAZN365', 'TRBDW478', 'TRBFG514',
              'TRBOU762', 'TRBSA846', 'TRBXR993', 'TRCRL507', 'TRDKA990',
              'TRDKD993', 'TRDTO238', 'TRDSO212', 'TRDXA328', 'TRDYU374',
              'TRDZS398', 'TREEB511', 'TREVT971', 'TREWZ003', 'TREXW026',
              'TRFVL639', 'TRFWE658', 'TRGIZ991', 'TRGVK314', 'TRGWY354',
              'TRHHV637', 'TRHNS790', 'TRIMV443', 'TRIQR543', 'TRISL589',
              'TRIWQ698', 'TRIWL693', 'TRJDT883', 'TRJHH975', 'TRJHT987',
              'TRJIM006', 'TRFVZ653', 'TRFYQ722', 'TRFZY756', 'TRGNZ121',
              ... 6184 more items
            ]
          }
        },
        {
          in: {
            path: 'ResponsibleDoctorID',
            value: [ 830, 844, 862, 921 ]
          }
        },
        { in: { path: 'RoomKey', value: [ 203 ] } },
        {
          in: {
            path: 'Disease_ID',
            value: [
              1519, 1506, 1504, 1510,
              1515, 1507, 1503, 1502,
              1518, 1517, 1508, 1513,
              1509, 1512, 1516, 1511,
              1505, 1514
            ]
          }
        },
        { in: { path: 'InsuranceKey', value: [ 83, 84 ] } }
      ]
    },
    sort: { CheckoutDate: -1
  }
}

MongoDB Atlas Search indexes, which are built on Apache Lucene, handle queries with multiple conditions and long arrays of values. In this example, a search operation uses the compound operator with the must clause to apply filters across attributes. This approach applies filters after resolving complex conditions into lists of dimension keys.

Using the search operation defined above, I ran an aggregation pipeline to retrieve the document of interest:

db.FactEncounterLabTests.aggregate([
 search,
])

With my example, nine documents were returned in 50 milliseconds.

Estimate the Count

This approach works well for queries with multiple filters where individual conditions are not very selective but their combination is. Querying dimensions and using a search index on facts helps avoid scanning unnecessary documents. However, depending on additional operations in the aggregation pipeline, it is advisable to estimate the number of records returned by the search index to prevent expensive queries.

In applications that allow multi-criteria queries, it is common to set a threshold and return an error or warning if the estimated number of documents exceeds it, prompting users to add more filters. To support this, you can run a $searchMeta operation on the index before a $search. For example, the following checks that the number of documents returned by the filter is less than 10,000:

MedSynoraDW> db.FactEncounterLabTests.aggregate([ 
  { "$searchMeta": {                                                                         
     index: search["$search"].index,
     compound: search["$search"].compound,
     count: { "type": "lowerBound" , threshold: 10000 }
     }
  }
])

[ { count: { lowerBound: Long('9') } } ]

In my case, with nine documents, I can add more operations to the aggregation pipeline without expecting a long response time. If there are more documents than expected, additional steps in the aggregation pipeline may take longer. If tens or hundreds of thousands of documents are expected as input to a complex aggregation pipeline, the application may warn the user that the query execution will not be instantaneous, and may offer the choice to run it as a background job with a notification when done. With such a warning, the user may decide to add more filters, or a limit to work on a Top-n result, which will be added to the aggregation pipeline after a sorted search.

Stage 3: Join Cack to Dimensions for Projection

The first step of the aggregation pipeline fetches all the documents needed for the result, and only those documents, using efficient access through the search index. Once filtering is complete, the smaller set of documents is used for aggregation or projection in the later stages of the aggregation pipeline.

In the third stage of the star query, it performs lookups on the dimensions to retrieve additional attributes needed for aggregation or projection. It might re-examine some collections used for filtering, which is not a problem since the dimensions remain small. For larger dimensions, the initial stage could save this information in a temporary array to avoid extra lookups, although this is often unnecessary.

For example, when I wanted to display additional information about the patient and the doctor, I added two lookup stages to my aggregation pipeline:

 {
    "$lookup": {
      "from": "DimDoctor",
      "localField": "ResponsibleDoctorID",
      "foreignField": "Doctor_ID",
      "as": "ResponsibleDoctor"
    }
  },
  {
    "$lookup": {
      "from": "DimPatient",
      "localField": "Patient_ID",
      "foreignField": "Patient_ID",
      "as": "Patient"
    }
  },

For the simplicity of this demo, I imported the dimensions directly from the CSV file. In a well-designed database, the primary key for dimensions should be the document's _id field, and the collection ought to be established as a clustered collection. This design ensures efficient joins from fact documents. Most of the dimensions are compact and stay in memory.

I added a final projection to fetch only the fields I needed. The full aggregation pipeline, using the search defined above with filters and arrays of dimension keys, is:

db.FactEncounterLabTests.aggregate([
  search,
  {  
    "$lookup": {  
      "from": "DimDoctor",  
      "localField": "ResponsibleDoctorID",
      "foreignField": "Doctor_ID",
      "as": "ResponsibleDoctor"
    }
  },
  {
    "$lookup": {
      "from": "DimPatient",
      "localField": "Patient_ID",
      "foreignField": "Patient_ID",
      "as": "Patient"
    }
  },
  {
    "$project": {
      "Patient_Severity_Score": 1,
      "CheckinDate": 1,
      "CheckoutDate": 1,
      "Patient.name": {
        "$concat": [
          { "$arrayElemAt": ["$Patient.First Name", 0] },
          " ",
          { "$arrayElemAt": ["$Patient.Last Name", 0] }
        ]
      },
      "ResponsibleDoctor.name": {
        "$concat": [
          { "$arrayElemAt": ["$ResponsibleDoctor.Doctor Name", 0] },
          " ",
          { "$arrayElemAt": ["$ResponsibleDoctor.Doctor Surname", 0] }
        ]
      }
    }
  }
])

On a small instance, it returned the following result in 50 milliseconds:

[
  {
    _id: ObjectId('67fc3d2f40d2b3c843949a97'),
    CheckinDate: '2024-02-12 17:00:00',
    CheckoutDate: '2024-03-30 13:04:00',
    Patient_Severity_Score: 61.4,
    ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ],
    Patient: [ { name: 'Niina Johanson' } ]
  },
  {
    _id: ObjectId('67fc3d2f40d2b3c843949f5c'),
    CheckinDate: '2024-04-29 06:44:00',
    CheckoutDate: '2024-05-30 19:53:00',
    Patient_Severity_Score: 57.7,
    ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ],
    Patient: [ { name: 'Cindy Wibisono' } ]
  },
  {
    _id: ObjectId('67fc3d2f40d2b3c843949f0e'),
    CheckinDate: '2024-10-06 13:43:00',
    CheckoutDate: '2024-11-29 09:37:00',
    Patient_Severity_Score: 55.1,
    ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ],
    Patient: [ { name: 'Asta Koch' } ]
  },
  {
    _id: ObjectId('67fc3d2f40d2b3c8439523de'),
    CheckinDate: '2024-08-24 22:40:00',
    CheckoutDate: '2024-10-09 12:18:00',
    Patient_Severity_Score: 66,
    ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ],
    Patient: [ { name: 'Paloma Aguero' } ]
  },
  {
    _id: ObjectId('67fc3d3040d2b3c843956f7e'),
    CheckinDate: '2024-11-04 14:50:00',
    CheckoutDate: '2024-12-31 22:59:59',
    Patient_Severity_Score: 51.5,
    ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ],
    Patient: [ { name: 'Aulikki Johansson' } ]
  },
  {
    _id: ObjectId('67fc3d3040d2b3c84395e0ff'),
    CheckinDate: '2024-01-14 19:09:00',
    CheckoutDate: '2024-02-07 15:43:00',
    Patient_Severity_Score: 47.6,
    ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ],
    Patient: [ { name: 'Laura Potter' } ]
  },
  {
    _id: ObjectId('67fc3d3140d2b3c843965ed2'),
    CheckinDate: '2024-01-03 09:39:00',
    CheckoutDate: '2024-02-09 12:55:00',
    Patient_Severity_Score: 57.6,
    ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ],
    Patient: [ { name: 'Gabriela Cassiano' } ]
  },
  {
    _id: ObjectId('67fc3d3140d2b3c843966ba1'),
    CheckinDate: '2024-07-03 13:38:00',
    CheckoutDate: '2024-07-17 07:46:00',
    Patient_Severity_Score: 60.3,
    ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ],
    Patient: [ { name: 'Monica Zuniga' } ]
  },
  {
    _id: ObjectId('67fc3d3140d2b3c843969226'),
    CheckinDate: '2024-04-06 11:36:00',
    CheckoutDate: '2024-04-26 07:02:00',
    Patient_Severity_Score: 62.9,
    ResponsibleDoctor: [ { name: 'Sayuri Shan Kou' } ],
    Patient: [ { name: 'Stanislava Beranova' } ]
  }
]

The star query approach focuses solely on filtering to obtain the input for further processing, while retaining the full power of aggregation pipelines.

Additional Aggregation after Filtering

When I have the set of documents efficiently filtered upfront, I can apply some aggregations before the projection. For example, the following grouped per doctor and counted the number of patients and the range of severity score:

db.FactEncounterLabTests.aggregate([  
  search,  
  {  
    "$lookup": {  
      "from": "DimDoctor",  
      "localField": "ResponsibleDoctorID",  
      "foreignField": "Doctor_ID",  
      "as": "ResponsibleDoctor"  
    }  
  },  
  {  
    "$unwind": "$ResponsibleDoctor"  
  },  
  {  
    "$group": {  
      "_id": {  
        "doctor_id": "$ResponsibleDoctor.Doctor_ID",  
        "doctor_name": { "$concat": [ "$ResponsibleDoctor.Doctor Name", " ", "$ResponsibleDoctor.Doctor Surname" ] }  
      },  
      "min_severity_score": { "$min": "$Patient_Severity_Score" },  
      "max_severity_score": { "$max": "$Patient_Severity_Score" },  
      "patient_count": { "$sum": 1 }  // Count the number of patients  
    }  
  },  
  {  
    "$project": {  
      "doctor_name": "$_id.doctor_name",  
      "min_severity_score": 1,  
      "max_severity_score": 1,  
      "patient_count": 1  
    }  
  }  
])

My filters got documents from only one doctor and nine patients:

[
  {
    _id: { doctor_id: 862, doctor_name: 'Sayuri Shan Kou' },
    min_severity_score: 47.6,
    max_severity_score: 66,
    patient_count: 9,
    doctor_name: 'Sayuri Shan Kou'
  }
]

Using a MongoDB document model, this method enables direct analytical queries on the operational database, removing the need for a separate analytical database. The search index operates as the analytical component for the operational database and works with the MongoDB aggregation pipeline. Since the search index runs as a separate process, it can be deployed on a dedicated search node to isolate resource usage. When running analytics on an operational database, queries should be designed to minimize impact on the operational workload.

Conclusion

MongoDB’s document model with Atlas Search indexes supports managing and querying data following a star schema approach. By using a single search index on the fact collection and querying dimension collections for filters, it is possible to perform ad-hoc queries without replicating data into a separate analytical schema as typically done in relational databases.

This method resembles the approach used in SQL databases, where a star schema data mart is maintained apart from the normalized operational database. In MongoDB, the document model uses embedding and referencing patterns similar to a star schema and is structured for operational transactions. Search indexes provide similar functionality without moving data to a separate system.

The method, implemented as a three-stage star query, can be integrated into client applications to optimize query execution and enable near-real-time analytics on complex data. This approach supports hybrid transactional and analytical processing (HTAP) workloads.

MongoDB Relational database Dimension (data warehouse)

Published at DZone with permission of Franck Pachot. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending