Data Science Wanderlust: Analyzing Global Health with Protein Sequences
This data scientist takes us on a trip around the world, and shows how analytics of protein sequences can help us understand healthcare and economic needs more fully.
Join the DZone community and get the full member experience.Join For Free
fifteen years ago, i had the unique opportunity to go on semester at sea, an around-the-world trip on a converted cruise ship that combined college coursework stops at nine countries on four continents. this once in a lifetime trip instilled in me a strong sense of wanderlust and a deep desire to give back to the global community.
every journey begins with a single step
fast-forward to a few months ago, when i joined exaptive on an exciting new project. a large ngo enlisted us to analyze a massive set of historical data for countries. the goal: to develop a better, more granular means of grouping countries than the outdated and crude approach of "developed" and "developing." this large, complex, messy dataset and thorny problem were a great fit for my background in artificial intelligence and data science.
my first task was to organize and clean up the raw metrics. anyone who has done this before knows that data, like war or politics, always look a lot tidier when viewed from 30,000 feet than when you are in the trenches. i never blame the organization; this is simply reality. countries have different names and abbreviations. they can go through peaceful mergers or bloody coups. even something simple, like whether the data was collected by a us or eu agency, can wreak havoc on data processing.
i developed a series of python scripts that collapsed metrics into their correct countries, handled missing or improperly formatted data, and "pivoted" the metrics, which originated as one file per metric, so that the resulting data set was organized by country with its constituent metrics. the result was a json file (quickly replacing xml as the gold standard for data exchange) that was ready for further processing.
searching in the wilderness
what happened next can best be described as a digital version of wanderlust. it seems the itch from my world travels also benefits my role as a data scientist. after preparing the data, i began exploring ways to compare and group time series data like this. i started my search in the audio space, looking at algorithms that analyze and classify music. this didn't seem to be a great fit, since audio signals tend to be much messier and periodic in nature. audio processing algorithms likely would have been overkill.
i then stumbled upon an interesting algorithm, symbolic aggregate approximation (sax). (there is a later version of the algorithm called hot sax. who says scientists can't have a sense of humor?) sax converts time series into a sequence of letters, opening the door to standard data mining techniques that work with strings. having worked with biological data and text earlier in my career, this struck a chord with me. i discussed with the team, and we decided to pursue this avenue.
in the spirit of rapid iteration, i quickly prototyped a web api for ingesting input and returning json results, integrated an open source library called jmotif that implements the sax algorithm, and stood up an instance of the api on amazon web services. this took about ten hours of work over the course of a week, which is both an impressive feat and a testament to how technology has advanced.
an unexpected turn
when we first reviewed the country metrics as sax strings, we noticed the strings looked a lot like dna sequences. could a clustering algorithm be used on these strings, much the same way geneticists analyze and compare strands of dna? we began discussing possible clustering algorithms and how to calculate the similarity of two countries.
to prove the concept, i then augmented the web api to use a basic clustering algorithm (k-means clustering) and string distance function (levenshtein) to cluster the data. i then extended the concept by clustering individual metrics to create a vector “fingerprint” of each country. these could then be used to further cluster and analyze the countries.
the results were astounding. after just one more week, i had developed this, integrated it into the exaptive platform, and we had a working application that took in country data and rendered clusters in a map:
we immediately saw some interesting patterns, even with these preliminary results. (how cool is that?)
a serendipitous meeting of two strangers
since we were thinking of the sax strings like dna, we asked a distinguished cell biologist to join the conversation. he noted that the strings looked more like protein sequences than dna. i adapted the algorithms to use a 20-character alphabet like amino acids and added magnitude from our time series data to mimic gene expression (the amount of protein produced). we saw immediate improvement when adding these elements.
he then suggested we try the gold standard for protein alignment, the clustal omega algorithm. reflect on that for a moment. we started with geopolitical time series data that felt a lot like audio signals or financial data, and we ended up talking to a scientist about protein sequence clustering. that is cognitive exaptation and digital wanderlust at its finest.
i updated the web-based api once more to convert the sax strings to protein sequences and integrated the clustal omega algorithm. i also added statistics to the output, so we could assess the quality of clusters returned for different algorithms and other parameters.
stumbling upon an unexpected find
when we completed the analysis and studied the data, we found that the protein alignment performed a little worse than our earlier approach, suggesting that when events occurred plays an important role in grouping countries. we also found the ideal cluster size to proceed with further analysis.
i wanted to see if we were just picking up spurious correlations that had no real meaning or if we were on to something. i chose one of the metrics, malnutrition, to see if countries in a specific cluster showed similar levels and trends for malnutrition. the findings were astonishing. separate clusters accurately captured those countries that had low and steady levels, moderate but significantly improved levels, and those where malnutrition is real problem. i was even able to correlate specific policy decisions like education, healthcare and infrastructure to malnutrition trends.
a place to rest for the night
we presented our initial findings to the ngo, and they were visibly excited. we had some initial data to support their hypothesis that countries should be grouped not by geography or economic status, but by a deep, nuanced understanding of key metrics like literacy, gdp, co2 emissions, infant mortality rates, and hundreds of other data points. and, we had delivered results in an insanely short time frame. we were excited too. we were able to make an immediate positive impact by giving back to the global community. and, we now have the tools to analyze all sorts of other time series data in a novel way. who else can we help, and what other unintended exaptations await us? this is not the end, but a temporary place to rest before wandering again.
coming full circle
in a very short time, we traveled (virtually, at least) across two continents and six cities in search of answers, engaged a team of four people in multiple time zones, and built a working solution that - we hope - will make an impact on how we talk about countries and tackle health and poverty issues.
it seems my wanderlust is alive and well after all. great things can happen when you drop your preconceived notions and let the wind and road take you where they may.
a photo i took of school children from khayelitsha hopolang primary school in south africa
Published at DZone with permission of Matt Coatney, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.