Using TypeDB to Assess Covid-19 Prevention Measures
What are the impacts of semantic graph technologies in the analysis of impact of anti-covid measures?
Join the DZone community and get the full member experience.Join For Free
On April 22, 2021, during the Orbit 2021 symposium of the TypeDB community, I explained the progress of our work around the use case "What are the impacts of Semantic Graph Technologies in the Analysis of Impacts of Anti-Covid Measures?”
In fact, the challenges of the open-source project: PKN12 (Pandemic Knowledge Network, I will come back to the meaning of the number 12 ) are to propose new tools for analyzing the data (open-data) available of the epidemic, using TypeDB.
This pandemic has progressed quickly in different ways…countries detect, test, treat, isolate, trace, and mobilize their people in response; those countries with a handful of cases can prevent those cases from becoming clusters, and those clusters becoming community transmission…But how can we begin to understand the influent factors of the evolution of this pandemic?
Governments provide "open data" about COVID19, such as daily PCR test statistics and since 2021 vaccine distribution statistics. With governments’ open data, we can juxtapose various events of the year 2020-2021 (lockdowns and curfews, non-essential store closures, restrictions to holidays, fests…). Following lockdown decisions, it was generally observed that the spread slowed down but then accelerated after recovery.
In addition, it is known that some regions are more affected by the effects of the epidemic: different assumptions have been proposed, such as the proximity of border areas favoring traffic, but also the climate.
So, to better understand the different enablers of the pandemic, the PKN12 project has aimed to integrate all the open insights into a single database (using TypeDB) to identify the interactions between social, medical, and environmental events around COVID 19 time period.
The PKN12 experimentation started in 2020, and the database now includes a large amount of interconnected data:
The data were integrated through a 5W generic model (Who, What, Why, Where, and When) of concepts and relations, taking advantage of TypeDB's abstraction of the logical model and polymorphism. Such as illustrated by this extract:
It is illustrated below by a PCR Test record that shows raw integrated data augmented by a step of post-calculation of derived values (slipping average of 3 days, speed and acceleration data). The lockdown integration example is also given: in detail, the lockdown record is quite similar to the PCR test record. Still, it only differs in the `time` class, modelled as a period, and with curfew slots depending on the government’s decisions (e.g. 9 P.M – 6 A.M).
We can see in this illustration of the analysis capacity of the technology: how a lockdown record may be linked to PCR Test records:
Now, let’s talk about the methodology: the general principle is first to ingest historic raw data, and then each day the new refreshed raw data as event data (5W pattern).
The events can be completed by some master data such as the list of regions/departments, the topology of departments/regions (neighbors and main routes between departments/regions), the register of shops by department/regions with their opening hours in normal time.
After raw data ingestion, post-calculation steps are run (derived values, inferring the succession of events, linking PCR test events to probable causes of effect); the derived post-calculation is of great importance as its purpose is to identify singular points that are synonyms of inflexion in the trend of the epidemic.
If we take a deep dive into the analysis of derived values, the evolution of data are quite non-linear, as shown by the below curves of speed and acceleration (red and orange curves); the assumption is that there are cycles inside the time-series data:
So, a step of the methodology was to perform mathematical treatments (FFT, filtering) of the time series data in order to identify cycles and isolate singular points. These treatments have several frequencies: 3 days or 7 days with harmonics.
The first « 3 days » frequency is the parameter of the slipping average, so do not seem to be significant.
But 7 days is the duration of a week, and so seems to underline the influence of the weekend.
Then, after frequencies filtering, we can extract automatically the singular points from which we would like to analyse subgraphs (e.g., probable causes of effect):
Singular Points Automatic Filtering
So, as we have identified singular PCR test points among all the data, the goals are to analyse subgraphs from these points.
The two following examples also show the kind of subgraphs that can be built from singular points to lockdown date or weather factors:
Weather Nodes Correlation
Regarding analyzing climate impacts, the opportunities are numerous, and we look to group nodes by aggregation of values (e.g., <-10°c, from -10°c to -5°c, -5°c to +5°c, 5°c to 10°c, 10°c to 20°c, 20°c to 30°c, >30°c, …). But in this case, it is not only a record but the analysis of a succession of previous records linked to the singular point that is interesting. For this analysis, TypeQL rules help to explore time-series data (sequenced nodes to facilitate navigation), but also ancestors to a specific depth level (between 7 and 20 days as the COVID 19 incubation is among 8 days).
Then, the impact of links between regions can also be analyzed, thanks to the neighbor and populations migration (including distance weight to population migration) axis:
If we take a step back, these different works based on the modeling of data relating to COVID in a graph have allowed us to explore new ways of analyzing and correlating these different data compared to classical, purely statistical methodologies.
The sources of the open-source project have been published on GITHUB.
The latest figures for the epidemic in France are rather encouraging, and at the same time, vaccination is progressing.
So, the rules have been relaxed in France as with many countries in Europe: reopening of the shops, bars, and restaurants under certain restrictions which are gradually reduced.
However, everyone must remain cautious, and the PKN12 project will continue its analyses. Our latest work now includes data on vaccine rates by regions/departments according to the different age groups and types of vaccines.
It is then a question of correlating the graph records of the PCR tests with the previous records of the vaccine while juxtaposing the evolution of the opening rate of stores, bars, and restaurants for a given region/department (reduced to opening hours actually outside of curfew), but also weather conditions at the same time.
Although this work was carried out on open data datasets from France, our wish is that it can be completed with the help of the TypeDB community, with datasets from other countries in order to enrich the analyses, improve the methodology and take advantage of collaborations globally.
Many areas of benefit are possible beyond the enrichment of international data, such as the implementation of anonymized individual data: tracking (blockchain) surveys allowing the introduction of the notion of a recommendation engine but also the implementation of finer granularity data, the identification of clusters.
In order to take advantage of all these data now ingested, a great next step will be the implementation of machine learning algorithms such as KGCNs/KGLIB(https://github.com/vaticle/kglib) to allow predictions. Another important aspect will be to publish the database made in the cloud so that everyone can query the data as a service.
Opinions expressed by DZone contributors are their own.