ECML/PKDD 2014 - Medical Mining Tutorial

ECML PKDD 2014 Tutorial on "Medical Mining for Clinical Knowledge Discovery"

By: Pedro Pereira Rodrigues, Myra Spiliopoulou, Ernestina Menasalvas

Short Description

Medical data mining is a mature area of research, characterized by both simple and very elaborate methods, mostly dedicated to solving a concrete problem of disease diagnosis, disease description or success prediction for a treatment. Clinical knowledge discovery encompasses analysis of epidemiological data, and of clinical and administrative data on patients; clinical decision support builds upon findings on these data. We elaborate on how data mining can contribute to such findings, we enumerate challenges of model learning, data availability and data provenance, and identify challenges on Big Medical Data.


  • Prof. Ernestina Menasalvas
    Centro de Tecnologia Biomedica, Universidad Politecnica de Madrid,
    Campus de Montegancedo, Pozuelo de Alarcon, Spain
  • Prof. Pedro Pereira Rodrigues
    CINTESIS & LIAAD, Health Information and Decision Sciences Department,
    Faculty of Medicine of the University of Porto,
    Alameda Prof. Hernani Monteiro, 4200-319 Porto, Portugal
  • Prof. Myra Spiliopoulou
    Research Group on Knowledge Management and Discovery (KMD),
    Faculty of Computer Science, Otto-von-Guericke-University Magdeburg,
    PO Box 4120, 39016 Magdeburg, Germany


  • (+) Self-presentation of the Tutorialists and Overview of the Domain (all)
  • (1) Knowledge Discovery from Epidemiological Data - Myra Spiliopoulou
  • (2) Knowledge Discovery from Clinical and Administrative Data - Pedro Pereira Rodrigues
  • (3) Knowledge Discovery Challenges on Big Medical Data - Ernestina Menasalvas
  • (4) Knowledge Discovery for Clinical Decision Support - Pedro Pereira Rodrigues
  • (+) Concluding Remarks

Part 1 - Knowledge discovery from epidemiological data
We start the tutorial with knowledge discovery from epidemiological data: clinical diagnosis and treatment prescriptions are based on the findings of epidemiological research. Epidemiological data come from population-based studies with randomly selected participants, from cross-sectional studies and from clinical trials. Epidemiological research is largely hypothesis-driven; mining studies are rare. We elaborate on what epidemiological data look like, discuss how mining can contribute to their analysis and highlight inherent challenges of data provenance, big feature spaces, data reliability and novel types of concept drift.

Part 2 - Knowledge discovery from clinical and administrative data
Electronic Health Records (EHR) and Admission-Discharge-Transfer (ADT) systems are valuable data sources for medical data mining focusing botmh on clinical research and health services research. However, these sources are also usually prone to erroneous, bogus, missing and default data. We will present and discuss case studies where these data quality problems yielded incorrect data mining results. Furthermore, we will present success cases where mining these sources resulted in relevant knowledge discovery in the fields of clinical and health services research.

Part 3 - Knowledge discovery challenges on Big medical data
Big Data in the Healthcare Sector for improving the overall efficiency and quality of care delivery has still to address several technical requirements such as : i) Generalized use of Electronic Health Records (EHR) and its implications; ii) ) preprocessing of natural text contained in reports, notes, etc; iii) annotation of images; iv) dealing with data silos and building of solutions avoiding them and v) data quality mechanisms. On the top of it one important issue is access to data and related to this aspects legal aspects have to betaken into account. We will analyze al this challenges with special emphasis on text and images processing.

Part 4 - Knowledge discovery for clinical decision support
Clinical decision support is usually seen as the final goal of knowledge discovery and modeling for clinical practice, as it aims to apply developed models to individual patients. However, the real-world application of learning-based models for clinical decision support is hindered by the need to integrate with evidence-based medicine and the acceptance by the clinicians that the model includes quality evidence regarding the particular patient. We will discuss the main issues regarding this struggle, addressing the advantages of probabilistic methods, and present success cases of probabilistic learning-based decision support systems.

Target Audience

The target groups are: postgraduate students with solid background in data mining; research scholars who are interested in medical mining and need some guidance through the subfields of this huge research area; research scholars who work on one of the medical mining areas and are interested in transferring their methods in other areas.

The Presenters

Ernestina Menasalvas is Professor at the Department of Computer Systems Languages and Sw Engeneering, Faculty of Computer Science of Universidad Politecnica de Madrid (UPM) and a member of the MIDAS, Data Mining and data simulation group at the Center of Biotechnology at UPM. Her subject area is Data Mining, and most recently using medical data. She has also participated in a range of projects related to data integration and mining on mobile devices. She has published three international books on web mining (edited by Springer in 2003, 2004 and 2009 respectively) as well as in several key international journals.

Pedro Pereira Rodrigues is Professor at the Department of Health Information and Decision Sciences, Faculty of Medicine of the University of Porto, and a researcher at the Biostatistics and Intelligent Data Analysis group of the Center for Health Technologies and Services Research. His main research area is machine learning, currently devoted to Bayesian networks applications to clinical research and decision support. He has edited 4 conference proceedings, and published articles in indexed peer- reviewed journals and conference proceedings. He helped organizing events as general chair (CBMS 2013) and PC chair (CBMS 2014, ECMLPKDD 2015, and several thematic tracks and workshops since 2007), is a member of the steering committee of CBMS, and was a member of the program committee for more than 20 editions of international conferences (e.g. IJCAI, ECMLPKDD, ICML, CBMS). He has also co-organized a tutorial in IBERAMIA 2012.

Myra Spiliopoulou is Professor of Business Information Systems at the Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, Germany. Her main research interest is knowledge discovery and adaptation. She has publications in international journals and conferences on web mining, text mining, model monitoring and adaptation over evolving data. She served as PC Co-Chair of ECML PKDD 2006 and NLDB 2008, as Tutorials Chair at ICDM 2010 and Workshops Chair at ICDM 2011. In 2012, she is PC Chair of the 36th Annual Conference of the German Classification Society (GfKl 2012, Hildesheim, August 2012). Next to several tutorials at ECML PKDD, she has given tutorials at User Modeling 2007 and at KDD 2009.

LITERATURE, as of April 2014 (own papers marked with a *)

The literature below comes from the time of tutorial submission. For the updated literature list, please consult the slides of the tutorial.

Part 1a - Mining Epidemiological Data

  1. S.E. Baumeister, H. Voelzke, P. Marschall, (...), C. Schmidt, S. Flessa, D. Alte. Impact of fatty liver disease on health care utilization and costs in a general population: A 5-year observation. Gastroenterology 134 (1), 85-94, (2008)
  2. * U. Niemann, H. Voelzke, J.-P. Kuehn, M. Spiliopoulou. Learning and Inspecting Classification Rules from Longitudinal Epidemiological Data to Identify Predictive Features on Hepatic Steatosis. Journal of Expert Systems with Applications, accepted (02/2014)
  3. B. Preim, P. Klemm, H. Hauser, K. Hegenscheid, S. Oeltze, K. Toennies, H. Voelzke. Visualization in Medicine and Life Sciences III. Springer, Ch. "Visual Analytics of Image-Centric Cohort Studies in Epidemiology" (2014)
  4. H. Y. Shi, S. L. Hwang, K. T. Lee, and C. L. Lin. In-hospital mortality after traumatic brain injury surgery: a nationwide population-based comparison of mortality predictors used in artificial neural network and logistic regression models. Journal of Neurosurgery, 118, 746-752, (2013)
  5. C. Zhanga, R.L. Kodell. Subpopulation-specific confidence designation for more informative biomedical classification. Artificial Intelligence in Medicine 58 (3), 155-163, (2013)

Part 1b - Dealing with Evolution in Epidemiological Data

  1. S. Ebadollahi, J. Sun, D. Gotz, J. Hu, D. Sow, and C. Neti. Predicting patient trajectory of physiological data using temporal trends in similar patients: A system for near-term prognostics,. AMIA Annu. Symp. Proc., vol. 2010, pp. 192-196, (2010)
  2. * G. Krempl, Z. F. Siddiqui, and M. Spiliopoulou. Online clustering of high-dimensional trajectories under concept drift. In Proc. of ECML PKDD 2011, ser. LNAI, vol. 6912. Athens, Greece: Springer, (2011)
  3. * Z. Siddiqui, M. Oliveira, J. Gama, and M. Spiliopoulou. Where are we going? predicting the evolution of individuals. In Proc. of the IDA 2012 Conf. on Intelligent Data Analysis, vol. LNCS 7619. Helsinki, Finland: Springer, Oct. 2012, pp. 357-368, (2012)
  4. H. Wang, F. Nie, H. Huang, J. Yan, S. Kim, S. Risacher, A. Saykin, and L. Shen. High-order multi- task feature learning to identify longitudinal phenotypic markers for Alzheimer's disease progression prediction. In Adv. in Neural Inf. Processing Systems 25, eds., P. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, 1286-1294, (2012)
  5. J. Zhou, J. Liu, V. A. Narayan, and J. Ye. Modeling disease progression via fused sparse group lasso. In Proc. of KDD 2012, pages 1095-1103. ACM, (2012)

Part 2 - Clinical and Administrative Data Mining

  1. * Cruz-Correia, R., Rodrigues, P. P., Freitas, A., Almeida, F., Chen, R., & Costa-Pereira, A. (2009). Data Quality and Integration Issues in Electronic Health Records. In V. Hristidis (Ed.), Information Discovery on Electronic Health Records (pp. 55-95). CRC Press.
  2. Cismondi, F., Fialho, A. S., Vieira, S. M., Reti, S. R., Sousa, J. M. C., & Finkelstein, S. N. (2013). Missing data in medical databases: Impute, delete or classify? Artificial Intelligence in Medicine, 1– 10. doi:10.1016/j.artmed.2013.01.003
  3. Jiang, X., & Cooper, G. F. (2009). A real-time temporal Bayesian architecture for event surveillance and its application to patient-specific multiple disease outbreak detection. Data Mining and Knowledge Discovery, 20(3), 328–360. doi:10.1007/s10618-009-0151-4
  4. * Rodrigues, P. P., Dias, C. C., Rocha, D., Boldt, I., Teixeira-Pinto, A., & Cruz-Correia, R. (2013). Predicting visualization of hospital clinical reports using survival analysis of access logs from a virtual patient record. In Proceedings of the 26th IEEE International Symposium on Computer- Based Medical Systems (pp. 461-464). Porto, Portugal. doi:10.1109/CBMS.2013.6627841
  5. * Vasco, D., Rodrigues, P. P., & Gama, J. (2013). Contextual anomalies in medical data. In Proceedings of the 26th IEEE International Symposium on Computer-Based Medical Systems (pp. 544-545). Porto, Portugal. doi:10.1109/CBMS.2013.6627869
  6. Lian Duan, L., Khoshneshin, M., Street, W. N., & Liu, M. (2013). Adverse drug effect detection. IEEE Journal of Biomedical and Health Informatics, 17(2), 305–11. doi:10.1109/TITB.2012.2227272

Part 3: Big Medical Data

  1. Cusack CM, H. G. (2012). The future state of clinical data capture and documentation: a report from AMIA's 2011 Policy Meeting. Journal of the American Medical Informatics Association, 1-7.
  2. Hani Neuvirth, M. O.-F. (2012). Toward Personalized Care Management of Patients at Risk--the Diabetes Case Study.
  3. Raghupathi W: Data Mining in Health Care. In Healthcare Informatics: Improving Efficiency and Productivity. Edited by Kudyba S. Taylor & Francis; 2010:211-223.
  4. Raghupathi W, Kesh S: Interoperable electronic health records design: towards a service-oriented architecture. e-Service Journal 2007, 53-57.
  5. IBM: Data Driven Healthcare Organizations Use Big Data Analytics for Big Gains; 2013. s_use_big_data_analytics_for_big_gains.pdf.
  6. Ikanow: Data Analytics for Healthcare: Creating Understanding from Big Data.
  7. jStart: How Big Data Analytics Reduced Medicaid Readmissions. A Start Case Study; 2012.

Part 4 - Clinical decision support

  1. * Cardoso, T., Teixeira-Pinto, A., Rodrigues, P. P., Aragao, I., Costa-Pereira, A., & Sarmento, A. E. (2013). Predisposition, Insult/Infection, Response and Organ Dysfunction (PIRO): A Pilot Clinical Staging System for Hospital Mortality in Patients with Infection. PLoS ONE, 8(7), e70806. doi:10.1371/journal.pone.0070806
  2. * Sebastiao, R., Gama, J., Rodrigues, P. P., & Bernardes, J. (2010). Monitoring Incremental Histogram Distribution for Change Detection in Data Streams. In M. M. Gaber, R. R. Vatsavai, O. A. Omitaomu, J. Gama, N. V Chawla, & A. R. Ganguly (Eds.), Knowledge Discovery from Sensor Data (Vol. 5840, pp. 25-42). Springer Verlag. Doi:10.1007/978-3-642-12519-5_2
  3. Celi, L. A., Hinske, L. C., Alterovitz, G., & Szolovits, P. (2008). An artificial intelligence tool to predict fluid requirement in the intensive care unit: a proof-of-concept study. Critical Care (London, England), 12(6), R151. doi:10.1186/cc7140
  4. Nee, O., & Hein, A. (2010). Clinical Decision Support with Guidelines and Bayesian Networks. In Advances in Decision Support Systems. INTECH.
  5. Sesen, M. B., Nicholson, A. E., Banares-Alcantara, R., Kadir, T., & Brady, M. (2013). Bayesian networks for clinical decision support in lung cancer care. PloS One, 8(12), e82349. doi:10.1371/journal.pone.0082349

Last Modification: 14.12.2016 - Contact Person:

Sie können eine Nachricht versenden an: Prof. Dr. Myra Spiliopoulou