Student research opportunities
Flexible and realistic synthetic medical data generator
Project Code: CECS_1115
This project is available at the following levels:
CS single semester, Honours, Masters
Keywords:
Medical data, data types, longitudinal data, medical codes and classes, realistic test data
Supervisors:
Professor Peter ChristenDr Dinusha Vatsalan
Outline:
There has been increased use of information technology to support healthcare applications and clinical research. Medical data are recorded electronically to enable advanced patient care, management of diseases, drugs, and resources, and health surveillance more efficiently and effectively. Such digital medical data also support clinical research through the aggregation and statistical analysis of observations gathered from populations of patients.
Clinical research and development requires medical data for evaluation of new algorithms and systems. However, privacy and confidentiality concerns impede the collection or sharing of such medical data across different organizations. An alternative is to generate synthetic medical test data for the evaluation of clinical research. Few data generators have been developed so far for medical test data. An important aspect of such medical data generators is that they should exhibit real data characteristics. To the best of our knowledge, there is no such realistic medical test data generator freely available for clinical research evaluation and development.
Medical data are longitudinal and contain different types ranging from narrative, textual data to numerical measurements, recorded signals, drawings, and images and videos. There have been several internationally accepted codes of diseases, drugs, etc. used in healthcare systems and applications. Developing a flexible and extensible tool that can generate realistic longitudinal medical data incorporating different data types and standard codes would be a useful direction for clinical research and health data analytics.
Goals of this project
The project aims to develop a synthetic data generator tool for medical data with preserved various original data characteristics. The goals of the project are:
1. Conduct research on standards, codes and different types of medical data.
2. Design and develop a tool for synthetic medical data generator by modelling real data characteristics and relationships.
3. Test the tool by generating and analyzing different sets of medical data using the proposed tool.
Requirements/Prerequisites
This project is available as a one semester Computer Science project for both undergraduate and MComp students, or as a one year honours project.
Interested students should have good programming skills (ideally including in Python) and background knowledge in algorithms and data structures, and software engineering.
It is of advantage if the students have knowledge in medical data representation, analysis and mining and/or successfully attended some courses on databases, data mining, or document computing.
Student Gain
Clinical research has been emerged as a promising field in healthcare industry. Much research in clinical data mining and analytics rely on some medical test data for evaluating and comparing new techniques and algorithms. This project allows the student to learn the basics of medical data representation, storage and processing and to contribute to an important problem in the clinical research and development. The project contributes a baseline for privacy-preserving medical data mining that has a high impact in the healthcare and research industries.
Background Literature
The following materials provide specific background literature on medical data, different data types, and characteristics of synthetic data generators that will be required to conduct the project.
Links
Flexible and extensible generation and corruption of personal data (Christen and Vatsalan, 2014Medical Data: Their acquisition, storage, and use (Shortliffe and Barnett)
A Method for Generation and Distribution of Synthetic Medical Record Data for Evaluation of Disease-Monitoring System (Lombardo and Moniz, 2008)
Customized test data generator for HL7v3 based healthcare information systems (Egner et al., 2013)
GeCo: an online personal data generator and corruptor







