By Brenda Cass, Data Science Strategist, Validic™

Our data has a story to tell us, but only if we give up our assumptions and biases and learn how to listen. Nowhere is this more true than in healthcare, where patient-generated health care data (PGHD), electronic medical records, and insurance claims data combine to tell us the complex tale of a person’s life. Unsupervised learning is how we let machines tell us these stories. The technique finds patterns in data without human guidance so the insights it surfaces can come to us without bias. Clustering is an excellent first approach to unsupervised learning that is deceptively simple in its ability to show us new ways to look at our data.

As the word suggests, clustering algorithms group our data mathematically to show us their hidden relationships. If you imagine a two-dimensional graph with many points of data scattered across it, a clustering algorithm attempts to find a number of center points which are the shortest distance from as many points as possible and then assigns those points to a corresponding cluster. This means that, mathematically speaking, the points in each cluster are like each other, but also meaningfully different from points in another cluster. Unlike this simple example seen in the graph, however, clustering algorithms are able to do this in far more than two dimensions – well beyond what a human could ever visualize.

So why do we care? Well, in healthcare, these points are people, and the dimensions in the graph refer to all of the health risk surveys, PGHD, lab results, and other health data we have recorded for them. After letting a clustering algorithm run on this data, the machine points out people that are similar to each other in health-relevant ways. These people have similar outcomes, physical characteristics, and behaviors that we can use to target our interactions with them. When we recommend behaviors to them we can take a look at others in their cluster to see what was effective for people like them, personalizing the recommendation rather than suggesting a generic best practice.

Let’s take a look at a clustering experiment that Validic performed to explore these ideas in practice. In this experiment we clustered people based on over 90 different numerical dimensions, from the number of wellness program challenges the person participated in, to the average time spent in bed. Our final trial produced eight distinct groups. Each of these groups had a high degree of correlation with each other and the groups were also well-separated, so we are confident that people within a group resemble each other while also being different from people in different groups.

With our clusters generated, we were able to characterize these groups. Keep in mind that prior to this experiment we did not know these groups within our population existed; the machine is teaching us about our data. Cluster one, for example, has low HDL cholesterol, regularly measures their activity and sleep, spends a lot of time in bed, and has proportionally high weight for our whole population. These are people that have demonstrated engagement, are regularly submitting PGHD and regularly respond to health risk surveys, but have relatively low wellness challenge participation. If we assume that increasing challenge participation is an important and desirable goal, then we have an obvious first use for the insights gained from our clustering experiment.

We can target this group with messaging that encourages more challenge participation, and we can do it in a more nuanced way than simply filtering our users on one dimension for low participation would give us. If we applied this more rudimentary approach to messaging our users then we would inadvertently catch members of group two, which also have low challenge participation rates. Notably, however, group two has high HDL cholesterol and low proportional weight for the population – the opposite of the characteristics in group one! These people are going to have different needs and require different sorts of encouragement to boost challenge participation.

In a previous blog post I mentioned the additive nature of machine learning models, and clustering is no exception. Unsupervised learning models in particular become stepping stones that build up to more advanced machine learning models. Deeper analysis of our clustering results can be turned into a behavior recommendation system by studying the outcomes of our population. We can use the specific and measurable differences between members of a cluster to suggest individual behaviors that have worked for other people like them.

Stay tuned for my next blog post where I will be talking about the very exciting world of predictive classifiers!

View More >