Data gathering /collection (applying existing methodology) That includes the inventory of the existing data from the first 6 months of the pandemic in Rwanda (first case was identified in march 2020) and the 1-year data collection (through Mobile Applications surveys, telephone calls and face-to-face). Expected data sources will be in different formats. Ranging from Covid-19 related data registered in Excel documents, via data sources containing Minimum Clinical Data (MCD) in DHIS2 and other systems, to more granular Electronic Medical Record (EMR) data in Open Clinic, OpenMRS and other EMR systems. We will start by mapping full hospital patients records, focusing on 15 hospitals located in regions with high number of COVID-19 patients and completing with other isolated datasets. The list of hospital will be defined at the start of the project as data on COVID-19 are still increasing. In total the study will target use of 1Terabytes (1000 Gigabytes) size data which size is big enough for such AI modelling.
- The new data collection will follow validated guidelines/principles in terms of data collection and will be done by a longitudinal approach using mobile App questionnaires: A minimum of 214 people per administrative district (6.420 persons throughout Rwanda totalizing 154.080 survey-entries over 24 weeks) will be required for mobile App responses weekly for 6 months (24 weeks).
- A minimum sample of 10 persons per district will be reached out by the data collector (2 times: at the beginning of the study and the end) via validation phone call or face-to-face questionnaire if the COVID-19 situations in Rwanda allows.
- A sub-group of patients cured from COVID-19 will be specifically followed. If a followed subject has a medical file in participating hospitals, the two datasets will be linked with possibilities of linkage data request in future.
- Face mask use;
- Hand hygiene;
- Respect of social distancing measures and risk minimization measures;
- Recent risk situations exposures and COVID-19 measures.
Infrastructure for data harmonizing (developing novel techniques) For data harmonization the custom designed ETL scripts will be developed per data source to extract, transform and load the source data to an OMOP CDM database instance. In early stages when the hospital EHRs are not yet harmonized, we will also use synthetic data approaches to help automate harmonization processes. The data owner-side infrastructure will include the OMOP CDM database instance, the Arachne client, the OHDSI Atlas analytical tool, R Studio, and Jupyter. The data harmonization process converts the observational data from the format of the source data system to the OMOP Common Data Model (OMOP CDM), the CDM supported by the Observational Health Data Sciences and Informatics (OHDSI) organization. This project will benefit from consortium members (lead by the UGent with Edence Health NV company experts support) in the steps involved in the data harmonization process, typically:
- Mapping workshop: this a face-to-face (in person or via video conference) workshop, usually a full day, where the initial mapping from source data to OMOP CDM is discussed in detail.
- Structure mapping + final mapping doc: Based on the mapping workshop, documentation and notes, the structure mapping is finalized and documented in the mapping document. This forms the basis of the ETL design.
- Code mapping: depending on which source terminologies are used in the data source, mapping the local codes to the standard vocabularies used in OMOP CDM (LOINC, SNOMED, RxNorm, etc.) can be either a short, easy process or a long, involved one with multiple iterations.
- Implementation of ETL(Extract, transform and load database functions that are combined into one tool to pull data out of one database and place it into another database) : the ETL script(s) to transform the source data into the OMOP CDM database instance; normally done in Python.
- ETL testing: the ETL scripts are tested both on development data, and ideally also on the data source’s test data.
- ETL deployment: once the ETL scripts will be tested successfully, and packed and deployed using GitHub and Docker.
HIGH-LEVEL CONCEPTUAL FRAMEWORK OF THE PROJECT
Infrastructure for data access, query, and data analysis (Mixing existing methods and innovative techniques) The central platform data access, query, and data analysis, or central site setup, will manage and coordinate the studies that will be performed across the participating data sources. The central site should at a minimum consist of a database, and Atlas instance, a catalogue of data sources, an R Studio instance, and possibly also a Jupyter server instance. Depending on the network infrastructure chosen (see above), there may also be an installation of the Arachne central server. The database, for example a PostgreSQL database, will include an OMOP CDM schema, as well as additional schema(s) to support a central data catalogue and study coordination.
- There are new techniques with regards to the creation of synthetic data and using data to help automate harmonization processes and training models: This approaches will be also used in our project from early beginning when the harmonized data from hospitals EHRs are not yet available, specially leveraging the OHDSI community available mock up data (like Synthea) to train different algorithms /models, before we use them on real data.
- The OMOP CDM schema will have the same OMOP CDM vocabulary version as the participating sites and will allow studies to be prepared and tested. If needed, a synthetic data set (e.g. Synthea) or available local data set can be loaded.
- There will also be result schemas that will be able to hold the Achilles output per data source site – this will allow a central view on the descriptive statistics for each site.
- The database will also be the place to gather aggregated results from the data source sites as part of defined studies.
Data analysis and interpretation (Mixing existing methods and innovative techniques) The federated datasets are challenging to analyse with traditional statistical methods, because they are, like other real-world-data (RWD), 1) collected without any intention for being used in research; 2) incomplete and not cleaned and 3) collected in sporadic way, not pure longitudinal approach so no way to derive cohort-like data from them. The current project will leverage the AI techniques including is Machine learning techniques and data mining that bring an added value in discovering hidden patterns or relationships between data points. The Machine learning model consists of two modules: GRU-ODE, responsible for learning the continuous dynamics of the latent process that generates the observations and GRU-Bayes, responsible for dealing with incoming observations and update the conditional current estimate of the latent process. Those two steps and modules are similar in essence to the propagation and update steps of a Kalman filter. With GRU-ODE, we are able to project in time the hidden process h(t) and hence indirectly future observations. GRU-Bayes perform the update of the hidden state conditioned on new observations. Yet, unlike a Kalman filter, this approach allows to learn very complex dynamics for the latent process. The subsequent figure below show the overall architecture that we propose to support this project. The design incorporates the following parts: Central platform: includes a data catalogue describing the different data sources, the Arachne central hub, a central OHDSI Atlas instance, a central database, as well as R Studio and Jupyter