Image by juhansonin via Flickr
ACM Silicon Valley Data Mining Camp on November 1, 2009 has attracted more than 200 people with different backgrounds and interests. It was held at Hacker’s Dojo, sponsored by REvolution Computing, KXEN (Knowledge Extraction Engines), and LinkedIn (See notes on this event by @Andraz of Zemanta, Ken's open source tools, and relevant #dmcamp twits) .
Biomedical/Healthcare data mining topic was suggested by Junling Hu and Irene Gabashvili and supported by A.J. Chen, Greg Makowski, Sukanta Ganguly, and 40 other participants of the Data Mining Camp. Below is a brief transcript of the discussion.
The session started from introductions, here are some of them:
- Irene, with background in biophysics, medical informatics and CS, pursuing a personal health management venture, interested in data mining to advance personalized medicine;
- Lawrence, with background in physics and software engineering and interest in health IT. He is the organizer of Google Wave meetup (you may know about Google Health Wave);
- Hua, formerly with Kaiser, interested in medical scheduling and web development;
- Liana, interested in Natural Language Processing for biomedical knowledge mining;
- Maura, interested in Health IT, medical engineering and security;
- Magnus, developing Medical Databases;
- Kevin, interested in medical startups;
- Watson, with background in genomics and machine learning;
- Peters, working on medical devices and software embedded systems;
- Steve, formerly of Applied Biosystems;
- Roy of Codexis, focusing on data mining and pattern recognition in multivariate time series
- Jima, with background in medical informatics;
- Karsleep, interested in biomedical data mining;
- Deena, scientific analyst interested in how data mining technologies could be applied to healthcare;
- Junling Hu, scientist at Bosch, working on a device and software collecting and analyzing patients' information, based on daily questionnaires and other collected data.
Junling started the session from mentioning a recently published paper on computer technologies for healthcare determining strategic directions in the area. Irene also suggested to check the mHealth Summit focusing on mobile technologies to improve research data collection, healthcare delivery, and health outcomes.
Junling described the project she was working on - inexpensive device collecting data and sending it to a "coaching" nurse that monitors stay-at-home patients. Next step is to mine the data automatically, thus reducing the load on healthcare professionals without sacrificing patients' well-being. Junling also mentioned some of the challenges such as compliance of participants who are typically not eager to fill out the 20-question surveys. This is especially bad for obesity studies.
The data mining challenges mentioned during the session were:
(1) Missing Data
We are not talking about sparse data (discussed in one of the previous sessions on data mining with R), but actually missing data. Data is sparse if only a small fraction of the attributes are non-null - like the number of items we typically buy in a grocery store is much less than the number of products they offer. Data is missing if the values were never entered or the member combination is not meaningful (for example, obstetrics/gynecolgy values not meaningful for men) . One of the suggestions from the experts in the audience was to utilize "multiple imputation". Other suggestions included "once-a-week" questioning instead of daily surveys. Irene mentioned the 7D-PAR (Seven-Day Physical Activity Recall) , one of standardized questionnaires developed in the 80s (1,2) and other established methods.
Questions and comments from the audience:
- Data mining methods utilized for Chronic Disease Assesment and Elderly monitoring. Junling talked about unsupervised classification algorithms and two supervised learning methods she found to be most useful for her work - SVM and logistic regression. Both were equally good in predicting hospitalization events
- Indicators of Goodness of Model Predictions. Suggested events were hospitalizations, mortality... It was noted that good indicators are yet to be found.
This was another health data mining challenge emphasized during the discussion. All standard methods can be applied such as accuracy, precision, recall, true positives, false positives and especially combinations of the last 2 measures. Junling mentioned breast cancer classifier developed by Siemens and other algorithms predicting emergency situations with 90% accuracy. Irene noted that one of the problems of digital mammography and other cancer predictors is a high rate of false positives. From 30 to 40% of cancers are overdiagnosed (3), thus increasing healthcare costs. This has to be changed.
Several people in the audience emphasized that existing methods are averaging the population. Medicine needs to be truly personalized, we need better methods and more data.
(3) Large Number of Input Features
One of the main problems of health data mining is coping up with large number of input features. Obviously, a 20-question test is not sufficient. Should it rely on thousand questions or trillion inputs? And how to select a subset of relevant features to build robust learning models? Junling's preffered approaches are logistic regression and singular value decomposition. She would add features one by one and check if the overall accuracy for predictions remains good.
Questions from the audience included:
- A 3-5 year Vision for Health Data Mining: what do we expect to achieve?
Participants expressed an optimistic outlook
- Ray: Are most input variables discrete or continuous? The answer was: mixed
The good thing about pattern recognition is that the more patterns you have, the better it performs. Google translator is a good proof of this assertion (although this translator needs even more patterns to do a decent job).
Biomedical data sets such as CT, MRI, PET scans and other image data, gene expression, genetic variation are very large scale in nature. The challenge for data miners is to integrate and extract information from data of such scale.
Questions from the audience:
- What are the other large-scale studies trying ot mine patterns in health data, outside of US?
Studies in China and Taiwan using similar devices and models; also in Europe
Adding to this interesting discussion that was unfortunately interrupted because of the lack of time, I'd like to mention a few other challenges facing biomedical data mining.
- We should not underestimate the complexity of relationships between causative and effect variables in human health. Simplistic approaches are deemed to fail. Over-fitting could be a problem too
- Integration between heterogeneous data sources and types,and putting content in context (semantic integration) remains a challenge.
- Privacy Concerns associated with the Sharing of Individual Health Information.
- Blair S. How to assess exercise training habit and physical fitness. In: Behavioral Health, edited by Matarazzo JD. New York: Wiley, 1984, p. 424-447.
- Rauramaa R., Tuomainen P., Väisänen S., and Rankinen T. Physical activity and health- related fitness in middle-aged men. Med Sci Sports Exerc 27: 707-712, 1995.
- Gøtzsche, P.C., Jørgensen, K.J., Mæhlen, J. and Zahl, P.-H. Estimation of lead time and overdiagnosis in breast cancer screening. British Journal of Cancer (2009) 100, 219–219.