ECML PKDD 2016 - Medical Mining Tutorial

Tutorial Title: "Learning from Hospital Data and Learning from Cohorts"

By: Panagiotis Papapetrou and Myra Spiliopoulou

Data mining is intensively used in medicine and healthcare. Electronic Health Records (EHRs) are perceived as big medical data. On them, scientists strive to perform predictions on patients' progress while in the hospital, to detect adverse drug effects, and to identify phenotypes of correlated diseases (as they occur in a hospital), among other learning tasks. Next to EHRs, medical research is no less interested in learning from cohort data, i.e., from a carefully selected set of persons with and without the outcome under observation. From these data, which are small in numbers but have a big number of dimensions, scientists want, e.g., to predict how people with and without a disease evolve, to assess how they respond to a treatment, and to identify phenotypes of a disease as it occurs in the population.

In this tutorial, we discuss learning on hospital data and learning on cohorts. We begin by introducing key terms and then discuss example objectives for mining on hospital data and on cohort data. Then, we focus on specific application areas. For the cohort data, we present examples of exploratory analysis on population-based and clinical studies, with emphasis on the role of time in these studies. For the hospital data, we present examples of learning from time-stamped data, heterogeneous data, and then focus on the problem of discovering adverse drug effects.

Hospital data vs Cohort data

The proliferation of medical data and applications has increased the need for extracting useful knowledge that can be effectively used by the healthcare domain experts. Our main focus in this tutorial will be on EHRs and Cohorts.

The adoption of EHRs has caused a massive increase in the amount of healthcare documentation. Numerous data sources are available in EHRs, including billing codes of diagnoses, laboratory results, drug prescriptions, and clinical notes. Such data sources can be exploited for developing robust predictive models for solving challenging tasks within the domain of healthcare, such as detecting adverse events (AEs). It has been estimated that preventable AEs in hospitals have an annual cost of $3.5 billion in the United States and 6.5 billion SEK in Sweden. Avoiding or reducing AEs within healthcare can, not only lead to reduced human suffering, but also substantial economical savings.

At the same time, cohort data are abundant and they typically refer to medical data obtained from a carefully selected set of persons with and without the outcome under observation. The challenging characteristic of these data is that they are small in numbers but consist of a large set of dimensions. Interesting problems involving cohort data include the construction of machine learning models for predicting how people with and without a disease evolve, to assess how they respond to a treatment, and to identify phenotypes of a disease as it occurs in the population.

Finally, it should be noted that a serious obstacle in deploying mining in medical research and healthcare informatics is that this research is hypothesis-driven and follows workflows that do not agree with the way typical data mining scholars are used to approaching and solving research problems. While medical researchers are often willing to offer their data for data-driven learning, it is the task of data mining scholars to analyze the data in a way that can be understood and exploited by medical researchers. The knowledge and techniques that will be presented in this tutorial will also work as guidelines for novices and experienced data mining researchers, so that their methods and results when mining medical data will be useful to the medical domain and healthcare experts.

Tutorial Outline

We begin by introducing key terms and then discuss example objectives for mining on hospital data and on cohort data. Then, we focus on specific application areas. For the cohort data, we present examples of exploratory analysis on population-based and clinical studies, with emphasis on the role of time in these studies. For the hospital data, we present examples of learning from time-stamped data, heterogeneous data, and then focus on the problem of discovering adverse drug effects.

PART 1. Learning from cohort data - MYRA SPILIOPOULOU

In this part, we introduce key terms and core scientific questions in epidemiological research.

We start with examples of epidemiological studies, especially population-based studies, and we discuss how such a study is used in medical research. We look briefly at the role of supervised learning, and then turn to explorative, unsupervised and semi-supervised learning on cohort data.

We focus on the role of explorative analysis for the identification and the description of subpopulations exposed to higher risk with respect to some disease. There, we distinguish between exploration of the data space, e.g. with association rules and classification rules, and exploration of data and feature space, e.g. with subspace clustering methods. We close this part with a discussion on constraint-based methods.

Longitudinal epidemiological data require methods that exploit time. Among the many challenges posed in the analysis of cohorts over time, we focus on the challenge of systematically incomplete data and discuss methods for addressing this challenge.

We close this part by switching from population-based epidemiological studies to clinical studies. We explain how unsupervised learning, especially clustering, is used in such studies, and provide two examples.

A preliminary set of slides of Part A is here. In this slideset, all pictures and tables have been replaced by placeholders, due to copyright reasons. After the tutorial, this slideset will be replaced by the presented one. Pictures and tables not subject to copyright restrictions may also be added.

PART 2. Learning from hospital data - PANAGIOTIS PAPAPETROU

This part of the tutorial introduces core scientific research questions within the area of learning from electronic health records by exploiting static as well as temporal features. Our main focus is on representations and methods for learning from time-stamped electronic hospital data. Different representations of electronic patient records are discussed in conjunction with state-of-the-art methods on descriptive and predictive modeling.

We first elaborate on feature extraction from electronic health records and how these features can be employed for building efficient and effective predictive models. We then shift the focus towards the temporal dimension of the data, and present key algorithms for obtaining effective temporal abstractions that can be used for machine learning tasks, such as sequential pattern mining, subgroup discovery, and classification.

In last part of this session, we present key research topics, algorithms, and results related to detection of adverse drug events from electronic health records. Our discussion is centralized both on the algorithmic details but also on the medical implications of the obtained results. The presented data sources come from electronic health records including text, as well as clinical measurements, sequences of diagnoses codes, and prescriptions.

The slides of Part B can be found here. In the present version, pictures have been removed.

The Tutors

Myra Spiliopoulou is Professor of Business Information Systems at the Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, Germany. Her main research is on mining dynamic complex data. Her publications are on mining complex streams, mining evolving objects, adapting models to drift and building models that capture drift. She focusses on two application areas: business (including opinion stream mining and adaptive recommenders) and medical research (including epidemiological mining and learning from clinical studies). She served as PC Co-Chair of ECML PKDD 2006, NLDB 2008 and of 36th Annual Conference of the German Classification Society (GfKl 2012, Hildesheim, August 2012) She is involved in the organization committees of several conferences. She is PC Co-Chair for CBMS 2016. She was Tutorials Co-Chair at ICDM 2010 and Workshops Co-Chair at ICDM 2011, Demo Track Co-Chair of ECML PKDD 2014 and 2015, and is senior PC member of recent conferences like ECML PKDD 2014, 2015 and SIAM Data Mining 2015. She has held tutorials on topics of data mining at KDD 2009 and 2015, PAKDD 2013 and PAKDD 2016 and in most ECML PKDD conferences since several years.

Prof. Myra Spiliopoulou

Research Group on Knowledge Management and Discovery (KMD),

Faculty of Computer Science, Otto-von-Guericke-University Magdeburg,

PO Box 4120, 39016 Magdeburg, Germany

Email: myra@iti.cs.uni-magdeburg.de

URL: http://www.kmd.ovgu.de/Team/Academic+Staff/Myra+Spiliopoulou.html

Panagiotis Papapetrou is Associate Professor at the Department of Computer and Systems Sciences at Stockholm University and Adjunct Professor at the Computer Science Department at Aalto University. His area of expertise is algorithmic data mining with particular focus on mining and indexing sequential data, complex metric and nonmetric spaces, biological sequences, time series, and sequences of temporal intervals. Panagiotis received his PhD in Computer Science at Boston University in 2009, was a post-doctoral researcher at Aalto University during 2009-2013, and lecturer at the University of London during 2012-2013. He has participated in 4 EU Projects, 5 NSF grants, and 2 Academy of Finland centers of excellence. He is general chair of the 15^th International Symposium on Intelligent Data Analysis 2016 and board member of the Swedish AI Society. He is Associate Editor for the Journal of Data Mining and Knowledge Discovery.

Prof. Panagiotis Papapetrou

Data Science group

Department of Computer and Systems Sciences

PO Box 7003, 164 07, Stockholm, Sweden

Email: panagiotis@dsv.su.se

URL: http://people.dsv.su.se/~panagiotis/