Sunday, March 28, 2010

Health Data, Self-serve, Visualization, Semantic Analysis and Collective Intelligence

Notes from the Biomedical Data Mining Camp Session led by Dr. Irene Gabashvili and other health-related discussions at Data Mining Camp and Transparency Camp 2010.

The title for the session was "Biomedical Data Mining: Successes, Failures, and Challenges" (streamed online from Fireside C).
The topic stemmed from the similarly named last year's session - Biomedical Data Mining: Dimensionality, Noise, Applications - now split into several discussions including Bioinformatics & Genome Sequencing organized by Raymond McCauley, Dimensionality Reduction moderated by Luca Rigazio (HLDA / HDA; LDPP; Core Vector Machines; Sparse Proj SQ; Random Projection and Feature Selection).

Main sub-topics of the biomedical session were:
  • Reality Mining
  • Visualization
  • Imaging
  • Signal Processing
Reality mining is expected to improve public health and medicine. It was named as one of "10 emerging technologies that could change the world". This is not only about mining data pertaining to social behavior - although social factors do impact well-being and social data provides valuable health predictors. It's also about health data collected in real and near real time. Audience asked about ways to collect health data and their limitations. Some of the questions reflected earlier Q&A with the Data Mining Camp expert panel , especially Dr. Michael Walker, author of numerous FDA and CLIA-approved products to diagnose and treat disease. He commented on the need to have tools well beyond the data available to take cell samples every two minutes instead of laboratory testing every few months - in order to allow cellular simulations and obtain parameters for differential equations. Sampling frequencies don't come close to allow this kind of modeling. Irene Gabashvili agreed that first-principles cellular modeling for predicting health won't be possible (although there are engineers that believe a platform for real-time in-vivo measurements of most cells can be developed). Yet, good predictors could be and will be developed - based on sensors measuring macro-level observations and missing value estimators. Genetic information is not enough, we need to capture environmental risk factors. How can we separate genes and environment?, asked one of the participants. Aurametrix' initial focus is on chemicals in our food - and even though some may argue that our taste and satiety mechanisms are dictated by genes, food analytics provides insights into non-genetic components of our health. Other questions were on the time line for body sensor networks and data growth. Jeffrey Nick of EMC estimates that personal sensor data will balloon from 10% of all stored information to 90% within the next decade. Irene Gabashvili thinks that this will happen rather sooner than later, perhaps in the next two years.

Another interesting aspect of reality mining is crowdsourcing or collective intelligence - in order to get useful information from all the data (temporospatial location, GPS, activity, food, symptoms, behavior, communication content, proximity sensing), we need to analyze it not only on individual but also group level. We need to share more, without sacrificing privacy and security. Collective contributions can be reliable - Shamod Lacoui's answer to this is in selecting those who contribute, restricting inputs to domain. It would help to “filter out the dross”, while “saving the best”. It is needed to suppress noise, to infer intelligence from the collection of facts, clicks, steps, whatever one can contribute. This resonates with discussions at the Transparency Camp - one of the useful tools is SwiftRiver - free, open source software platform to validate and filter news. Swift relies on Natural Language Processing, Machine Learning and Veracity Algorithms to track and verify the accuracy of reports and suppress noise (like duplicate content, irrelevant cross-chatter and inaccuracies). Transparency Camp also posed a question on whether there is a need for an FDA-like institution to ensure information safety and healthy information consumption.

Self-serve was a topic of a smaller Data Mining Camp Session. Even though it was aimed at sales reps that need to go beyond Excel spreadsheets to mine private data of their interest, self-service is currently the only option for health care consumers. People need to analyze everyday life for health implications. They need better tools to not focus on metrics that are easy to collect instead of metrics we need to collect.

In order to mine high-dimensional health space, many disparate types of data should be mashed and validated, gaps should be bridged and structured metadata added to data. Randy Kerber talked about data formats and approaches to make it happen. Semantic web discussions involved NoSQL experts that mentioned limitations of gaining popularity technologies such as MongoDB, Cassandra and HBase. Another relevant session - on cloud computing - discussed its (sometimes over-rated ?) performance and Hadoop technologies.

Visualization techniques provide one of the most effective methods of extracting knowledge from health data. Remember who invented the pie chart? That's right, it was Florence Nightingale, a nurse who needed a way to better represent her data. one of the most famous examples of visualizing epidemiological data was Dr. John Snow's map of deaths from a cholera outbreak in London, Many other techniques and software tools exist, but maps remain popular - especially google maps API. One of popular tools for epidemiological data is Google Maps API. For example, it embeds Google Maps into with JavaScript.

One of the participants of Biomedical Session developed (@kidsdata on twitter). It provides insights into geospatial autism statistics and visualizes trends and other useful health-related information.

Another way to display geospatial data is Dynamic Choropleth Maps. Complex networks can be also explored with alluvial diagrams and other approaches. More visualization techniques and tools can be applied to health data - to look at the data in new ways and gain useful insights.

Some of the questions from the audience were on the availability of data. Sources discussed included CDC (see, for example, NHANES laboratory files; eHealth metrics) and Entrez Life Sciences databases.

Signal Detection and Signal Processing for Mining Information was another discussion topic.
Questions were on data mining versus simple tracking and signal monitoring. It was agreed that data mining is the key to health management. Cardionet, body sensors (see posts on teletracking, M-health, Telemedicine: part 1; Telemedicine: part 2; Health 2.0 Software tools, Devices to keep you healthy), SNP detection, telemedicine applications, random and rare electrocardiographic events and other applications were also discussed.

See other materials from Data mining Camp 2010:

Reblog this post [with Zemanta]

Sunday, March 21, 2010

Mining Data Mining Camp Impressions

Data Mining Camp organized by Patricia Hoffman and San Francisco Bay Area Chapter of ACM, the Association for Computing Machinery, uses an Open Space Technology (OST) approach - no formal agenda beyond the overall data mining theme. Except the expert panel, sessions are compiled on-the-fly, based on real-time interest and participation.

Overal, the unconference - with a new location and almost doubled attendance - was a success - even if judging only by results of twitter sentiment analysis tools (subject of a not-so-successful data mining camp topic) - tweetfeel, twitrratr and twendz.

Obviously, completely ad-hoc sessions could be a bit chaotic - even though organizers briefly presented their topics and rooms were assigned adter counting a show of hands, there were surprises and unmet expectations. Here are sample quotes:
DataJunkie: OMFG Chaos trying to set up and plan which sessions to attend. Idea: have an online vote, and use sim annealing for scheduling. #DMCAMP

ihat: some sessions at #dmcamp have very low signal-to-noise... feature selection referenced tibshirani and boyd. and ppl butchered their work...

Many people preferred traditional formats to round table discussions - tutorials were the most attended sessions, while discussions were either the most or least liked sessions. The arrangement of chairs and the look of the room preset expectations of participants - some organizers did not really plan to present but had to come out with slides or tutorials. Great observation by Dominique Levin:

Room shape impacts success of un-conference: Circle of chairs works wonders to solicit audience participation at #dmcamp. Circle time!

Another interesting observation was that Linkedin turned to be the most efficient marketing tool for the conference. Twitter and other social networks did not seem to have an impact. The explanation could be very simple though - age group and education level of the target audience.

A winner of retweets was Chris Wensel -(interpreted as "influencer" by twitter data mining tools) - his message "Facebook dropped Cassandra for inbox search and hired HBase person to switch" was retwitted 18 times.

See also:
and last year's notes:

Reblog this post [with Zemanta]

Thursday, March 11, 2010

Bay Area Startups Looking for Cofounders: March 2010

Starting a company can be a difficult experience. Working alone is hard and a lot less fun than working with co-founders. And even though two is said to be the right number, some data point to an increase in success rates with up to 4-5 co-founders. Co-founders wanted March meetup (organized by Alain Raynaud and Thad) featured startups dedicated to improving many aspects of our lives - from education to work productivity, transportation, energy and health.

Srini Reddy is developing online service for high school students. His startup is looking for a web application developer, senior architect, and senior product manager to join the founding team.

David Pollak is working on the next generation youtube and looking for the 4th co-founder, a web application developer

The audience gave their hearts out in applause for Beto Juarez' performance of Mariachi La Bamba. His vision is a Mariachi website similar to (see also Alain's post-presentation interview notes and the technical co-founder description). Beto can be reached as BetoIII at gmail and twitter.

Dave is working on
the application helping to discover iPhone apps. Looking for one more tech cofoudner to extend bootstrapping time. He can be reached as dli at apptizr. Twitter: @apptizr

@roxolar is generating leads for solar energy, the model is similar to lendingtree

Fred Gibson is looking for a business/marketing cofounder and a coder (Java, Lisp) - the startup is focusing on streamlining workflows based on Steve Blank's Customer Development process. Twitter: @gibsonf1

Navneet Dalal – with background in machine vision and learning, has an idea on gesture detection using cameras

Doug gave a brief presentation about this company supported by investments from google and facebook - they won Android Developer Challenge for building one of the top 10 mobile apps and yet are looking for more co-founders.

Ivan presented this early-stage bootstrap making carpooling more accessible form of transportation. Looking for a technical co-founder with 8-10 yrs of experience with web apps

Tristan Kromer presented MarketResearch Wiki, alternative to Gartner ways to deliver market research and technology insights. The startup needs one more developer/lean co-founder (in addition to 4 technical co-founders). Twitter: @startupSquare

Matt Howes tries to eat healthier and wants others to do the same. His startup - incorporated in September 2009 - allows members to set goals and track their progress, tapping into the $11 billion self-health market. It integrates with Facebook and Twitter (@risetribe)

"What if you could have powerful solutions at your fingertips that help you manage your health the way personal finance software manages your money?" asked Irene Gabashvili, who presented her vision for health management systems, including a product in development that focuses on relief for those suffering from digestive problems. Twitter: @Aurametrix

For more startups see:
January minutes
November minutes

Reblog this post [with Zemanta]

Saturday, March 6, 2010

Test your genes and loose the weight?

Do Your Genes Determine Which Diet Means Weight-Loss or General Health Success?
Can nutrition be optimized with respect to your genome?

Commercial genetic tests promising to provide personalized diet recommendations (there were 6 in 2005, not counting companies developing DNA nutri-chips to pinpoint an animal’s nutrition needs) are constantly offered and withdrawn. Interest in genetic testing over the past few years was going down (click on the right figure to see google trends), but continuously published findings cause bursts of interest (click on the right figure to see XRanks by Bing) .
Clearly, genetic variations determine how we absorb, filter, metabolize, store and eliminate nutrients and how these nutrients affect processes in our body. But should we get more vitamin B if our MTR gene carries predisposition to heart disease or have extra calcium and vitamin D if our VDR gene indicates possibly weak bones?
The use of this information may be premature, the gene-vitamin linkages may not be credible, and far inferior to your family health portrait with much higher cost-to-benefit ratio than self-evaluation and self-awareness through personal health management tools.

It Isn't Just What You Eat That Can Kill You, and It Isn't Just Your DNA That Can Save You--It's How They Interact, said a 2005 newsweek article. It is also about other environmental factors, your history of interactions, and microbes you cultivated in your system that determine what food is best for you.

Lactose intolerance, for example, may be derived from genetic tests showing how effective our bodies are in producing the lactase enzyme (the LCT gene). We can determine it in a much cheaper and reliable way, however, by drinking milk and looking at the symptoms. Same about FMO3 variations and trimethylaminuria. A study of lung cancer rates in China found that people at lowest risk were genetically deficient in an enzyme that metabolizes isothiocyanates in cruciferous vegetables, but crucifers don't seem to harm those with a functional enzyme either... except those who gets get intestinal gas from eating them. But this may not be because of the genes, but due to the lack of beneficial gut bacteria.

The weight-loss industry long admitted that one specific diet isn't likely to work for everyone. How do diet programs work? Forget about the ratios of carbs, fiber and protein, the most important factor is to burn off more calories than you're taking in. Behavioral factors rather than macronutrient metabolism are the main influences on weight loss, so it's more about mindful eating versus mindless eating. And here genes come into play again. Satiety mechanisms are complex and involve multiple sensors, operating differently according to underlying genetics. that's why a carb-heavy breakfast can leave you hungry in two hours but keep your neighbor full until noon, while high-protein diet makes you fill fuller.

This week (at the American Heart Association’s Joint 50th Cardiovascular Disease Epidemiology and Prevention – and – Nutrition, Physical Activity and Metabolism conference, being held March 2-5 in San Francisco, CA) Stanford University researchers reported that a genetic test can help people choose which diet works best for them. The study involved 133 overweight women, who lost more weight on a diet that matched their genes - either low-carbohydrate or low-fat. Improvements in clinical measures related to weight loss (e.g., blood triglyceride levels) paralleled the weight loss differences.

The findings are also partially based on an earlier paper, called the A to Z weight-loss study published in the Journal of the American Medical Association in 2007 and a small (100-person) unpublished study by Interleukin Genetics.

More than 6,000 genes may be affecting our weight, including multiple genes determining our taste preferences (e.g., to bitter foods, calcium, garlic or coffee) and behavioral predispositions. Genetic testing along with protein- and metabolomic-based diagnostics would provide invaluable data for understanding ourselves. Data-mining tools for health-related factors will allow integrated and holistic analysis and help to personalize diets, fitness and medical treatments.

Reblog this post [with Zemanta]

Tuesday, March 2, 2010

Organizing the World's information

"My guess is (it will be) about 300 years until computers are as good as, say, your local reference library in doing search,"says Google's first employee and director of technology Craig Silverstein. "But we can make slow and steady progress, and maybe one day we'll get there." (Inside the Wide World of Google CBS News, March 28, 2004). According to CEO Eric Schmidt, people care a lot about information and the possibilities of the unfolding revolution in technology are greater than many of them even realize: “Imagine the scale of the kinds of questions you could ask that you could not ask before.”

The figure is a great compilation of key Google Facts by PingDom. Among the key technical facts not shown here are far-reaching inventions such as Programmable Search Engine (PSE, see patent application) based on the BigTable database (See Research Publication about it in PDF format) and other systems that could help it become more semantic.

The distributed storage system for managing structured data called Bigtable resembles a database sharing implementation strategies with parallel and main-memory databases. Instead of a full relational data model it uses a simple data model with data indexed using row and column names that can be arbitrary strings. A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes, although clients often serialize various forms of structured and semi-structured data into these strings, controlling it through careful choices in their schemas.

PSE and other integration technologies may be providing a higher level of semantic analysis.
These techniques could figure out the meaning of content and “fill in the blanks” when an item of information is ambiguous or missing. The idea is to enrich an information object with additional tags so that queries about lineage (where something came from) and likelihood of accuracy (the “correctness” of an information element) can be used to generate a result.

Another new concept is a probabilistic mediated schema automatically created from the data sources. Semantic mappings between the schemas of the data sources are mediated by schemas with probabilities attached to each - to model uncertainty at its core. A deterministic mediated schema created from the probabilistic ones will be exposed to the user who could use the terminology of this mediated schema to interact with the system.

The Semantic Web is emerging to help us get the most out of the world's information. Many interesting applications are already here. Some of them already acquired by major search players - Bing, for example, is based on semantic technology from Powerset that Microsoft purchased in 2008. This blog article is only about one of the players organizing the world's information. Stay tuned for more.

Reblog this post [with Zemanta]
blockquote { margin:1em 20px; background: #dfdfdf; padding: 8px 8px 8px 8px; font-style: italic; }