KDD 2015 - Medical Mining Tutorial

KDD 2015 Tutorial on "Medical Mining"

By: Myra Spiliopoulou, Pedro Pereira Rodrigues, Ernestina Menasalvas

Abstract

In year 2015 we experience a proliferation of scientific publications, conferences and funding programs on KDD for medicine and healthcare - KDD for health. However, medical scholars and practitioners work differently from KDD researchers: their research is mostly hypothesis-driven, not data-driven. It is the KDD researchers who should learn how medical researchers and practitioners work, what questions they have and what methods they use, and how mining methods can fit into their research frame and their everyday business. Purpose of this tutorial is to contribute to this learning process.

We address medicine and healthcare; there the expertise of KDD scholars is needed and familiarity with medical research basics is a prerequisite. We aim to provide basics for (1) mining in epidemiology and (2) mining in the hospital. We also address, to a lesser extent, the subject of (3) preparing and annotating Electronic Health Records for mining.

Target audience and prerequisites

This tutorial is intended for all conference participants who have an interest in medicine (including healthcare and medical research) as application domain for mining.

Background

Background on data mining is expected, since we are not going to explain the learning approaches (e.g. artificial neural networks, density-based clustering, applying hidden Markov models) but only how each approach has been adjusted to perform the medical mining task. However, this we can expect from KDD participants, including students.

Experts in some field of medical mining (e.g. knowledge discovery in biotechnology, knowledge discovery from medical images, machine learning for brain signals) are likely to be familiar with some part of the tutorial but not with all three of them.

Importance of topic and benefit for the KDD participants

There is proliferation of medical data and of applications, in which mining is needed. One serious obstacle in deploying mining in medical research is that this research is hypothesis-driven and follows workflows that do not agree with the way mining scholars work. Medical researchers are willing to offer their data for data-driven learning, but it is the task of mining scholars to analyze the data in a way that can be understood and exploited by medical researchers. A most disappointing experience for a mining scholar, especially for novice ones, is to hear a medical researcher say "nice results but I will not use them". This kind of episode is mentioned in our tutorial and thoroughly explained. The purpose of the tutorial is to provide some basics that will help mining scholars, and not only novices, avoid such experiences.

Who will attend?

The application of KDD methods in medical research, biomedicine and healthcare enjoys increasing interest among KDD participants. KDD 2013 featured two workshops and one tutorial associated to KDD for health, while KDD 2014 hosted 6 workshops on health related subjects. So, we expect a strong interest on this tutorial.

Isn't these issues already known to KDD participants?

KDD workshops and older tutorials took the data perspective, elaborating on how hospital data, big health data, m-health data, brain data etc can be successfully analyzed. However, understanding what medical researchers and hospitals need and expect is no less essential, and none too easy, because of gaps in terminology, work style and research style.

The KDD 2014 tutorial on "Computational Epidemiology" partially touched this need: it informed KDD participants on what computational epidemiology is, what scientific questions are there and how data miners can contribute to them. However, epidemiology covers much more than the study of epidemics. This tutorial is about discerning different aspects of epidemiological research, linking it to hospital research, separating between research in the hospital and patient management in the hospital, and pointing to where the big health data are (hidden). We expect that the tutorial will be of use to those many KDD participants who are interested on mining for health, but are not (yet) familiar on how those doing research in medicine and healthcare work.

Outline of the Tutorial and Slides

Introduction:
Getting access to medical datasets is often a moment of great excitement. Analyzing the data on a disease means getting the chance to contribute to medical advancement: new ways to cure a disease, new prevention measures, better diagnostics. Not rarely, the reality is disillusioning: the mining model is performing well, but the medical expert is reserved about using it. Why? We start this tutorial by different forms of concern that a medical expert may have when being confronted with a mining model. In the following parts of the tutorial, we focus on improving the interaction between the mining expert and the medical expert. To this purpose, we explain how medical experts work with data, what data they work on, and what mining experts can do for them.

PART 1. Mining in Epidemiology (by Myra Spiliopoulou). This part of the tutorial starts with explaining what epidemiologists study, and brings forward some basic terminology on different kinds of studies in epidemiology. In this part, we will see what is a cohort and what is a wave, what is the difference between a longitudinal and a cross-sectional study, and why clustering methods must always take the target variable (!) into account. We will discuss how basic and elaborate data mining methods can be framed to be useful in epidemiological research. Mining examples are presented throughout this part of the tutorial, but the emphasis is on showing how mining should be applied and not on identifying the most powerful methods. Many methods used in medical mining papers are rather simple; the main challenge is often in modeling the medical problem.

PART 2. Mining Hospital Data for Clinical Research and Clinical Decision Support (by Pedro Pereira Rodrigues). This part of the tutorial deals with knowledge discovery and decision support in the hospital. It starts by explaining Electronic Health Records (EHR) and lists the most prominent dangers faced by a mining scholar who wants to analyze them. We will see the processes in which EHR are used, filled or modified, the knowledge discovery tasks in which these records must be analyzed, and the challenges of such an analysis. Data mining in the hospital must ideally flow into clinical decision support (CDS). This part of the tutorial contains several cases of CDS, highlighting the importance of adhering to the hospital protocols for data processing and model evaluation, and the importance of integrating CDS into the hospital processes.

PART 3. Preparing non structured information for medical mining (by Ernestina Menasalvas). This part of the tutorial focusses on the preparation of non-structured information contained in text and images for mining. The information in images and texts is very valuable, and there is abundance of dedicated methods for e.g. medical image analysis. In this part, we rather discuss preparation tasks and workflows for the incorporation of such information into medical records for the dedicated medical mining tasks discussed in the other parts.

NOTE: In the tutorial, we do not necessarily present these three parts in that order. Also, Part 2 is usually split into a subpart for Clinical Research and on for Clinical Decision Support.

Tutors' short bio

Myra Spiliopoulou is Professor of Business Information Systems at the Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, Germany. Her main research is on mining dynamic complex data. Her publications are on mining complex streams, mining evolving objects, adapting models to drift and building models that capture drift. She focusses on two application areas: business (including opinion stream mining and adaptive recommenders) and medical research (including epidemiological mining and learning from clinical studies). She served as PC Co-Chair of ECML PKDD 2006, NLDB 2008 and of 36th Annual Conference of the German Classification Society (GfKl 2012, Hildesheim, August 2012) She is involved in the organization committees of several conferences; she was Tutorials Co-Chair at ICDM 2010 and Workshops Co-Chair at ICDM 2011, Demo Track Co-Chair of ECML PKDD 2014 and 2015, and is senior PC member of recent conferences like ECML PKDD 2014, 2015 and SIAM Data Mining 2015. She has held tutorials on topics of data mining at KDD 2009, PAKDD 2013 and in many ECML PKDD conferences.

Prof. Myra Spiliopoulou
Research Group on Knowledge Management and Discovery (KMD),
Faculty of Computer Science, Otto-von-Guericke-University Magdeburg,
PO Box 4120, 39016 Magdeburg, Germany
Email: myra@iti.cs.uni-magdeburg.de
URL: http://www.kmd.ovgu.de/Team/Academic+Staff/Myra+Spiliopoulou.html

Pedro Pereira Rodrigues is Professor at the Department of Health Information and Decision Sciences, Faculty of Medicine of the University of Porto, and a researcher at the Biostatistics and Intelligent Data Analysis group of the Center for Health Technologies and Services Research. His main research area is machine learning, currently devoted to Bayesian networks applications to clinical research and decision support. He has edited 4 conference proceedings, and published articles in indexed peer-reviewed journals and conference proceedings. He helped organizing events as also general chair (CBMS 2013) and PC chair (ECMLPKDD 2015, CBMS 2014-15, and several thematic tracks and workshops since 2007) and is a member of the steering committee of CBMS, and was a member of the program committee for more than 20 editions of international conferences (e.g. IJCAI, ECMLPKDD, ICML, CBMS). He has also co-organized tutorials in IBERAMIA 2012 and ECMLPKDD 2014.

Prof. Pedro Pereira Rodrigues
CINTESIS & LIAAD, Health Information and Decision Sciences Department,
Faculty of Medicine of the University of Porto, Alameda Prof. Hernani Monteiro, 4200-319 Porto, Portugal
Email: pprodrigues@med.up.pt
URL: http://users.med.up.pt/pprodrigues/

Ernestina Menasalvas is Professor at the Department of Computer Systems Languages and Sw Engeneering, Faculty of Computer Science of Universidad Politecnica de Madrid (UPM). Her subject area is Data Mining. She studied Computer Science and she has a PhD in Computer Science. She is nowadays member of MIDAS “Data Mining and data simulation group” at the Center of Biotechnology in UPM and data bases and data mining professor at UPM. Her research activities are on various aspects of data mining project development and in the last years her research is focused on data mining on the medical field. She has participated in different research and development project related to data integration and mining on mobile devices. She has published three international books on web mining (edited by Springer in 2003, 2004 and 2009 respectively) and many international journals including Data and Knowledge Engineering Journal, Information Sciences, Expert Systems with applications, Journal of Medical Systems and International Journal of Intelligent Data Analysis.

Prof. Ernestina Menasalvas
Centro de Tecnologia Biomedica, Universidad Politecnica de Madrid,
Campus de Montegancedo, Pozuelo de Alarcon, Spain
Email: ernestina.menasalvas@upm.es
URL: https://scholar.google.com/citations?user=CVyVk2UAAAAJ&hl=en