The availability of genetic and genomic information has exploded in the last decade following decreasing costs in sequencing technology; however, much of this information exists scattered over many different resources. For example, different resources on the same gene often have different identifiers, formats, and information. The fragmented data landscape makes creating and maintaining bioinformatics pipelines challenging, frustrating, and time consuming.
As part of Dr. Andrew Su’s (Associate Professor) computational biology research group at the Scripps Research Institute, our team is interested in solving big data challenges like the aforementioned fragmented gene/variant data landscape. Dr. Chunlei Wu (Associate Professor) spearheaded the endeavor to create easy-to-use gene and genetic variant annotation services so that researchers can spend more time making new discoveries and less time on dealing with the fragmented data landscape.
Building the Solution
MyGene.info was the first of the two annotation services we built. In building our services, we knew there were several issues we needed to consider:
- We would be aggregating data on 13 million genes from 7 databases
- The amount of data from each data source AND the number of data sources were expected to continue to grow, so our service must be able to scale accordingly.
- Users would need to be able to find the information they needed quickly, with flexible ways of finding it, without perceptible drops in performance as the amount of data grows.
Given these constraints, we employed Elasticsearch in our Indexing Engine. Our previous experience with CouchDB for a different resource, enabled us to smoothly transition into using Elasticsearch and we were early adopters of Elasticsearch (circa v0.5.x). Even at the earlier stages of development, Elasticsearch has been a valuable tool in our arsenal, and we had no doubt it would be able to suit our needs.
Applying our success in building MyGene.info into a highly scalable service, we followed by building MyVariant.info to address the even more fragmented data landscape of genetic variant information. MyVariant.info currently has more than 334 million unique gene variants from over 14 databases.
Users were able to search for one or thousands of gene or variant-specific JSON object(s) using flexible query terms and return just the information of interest to them. If they were only interested in variant annotations from dbSNP or gene annotations from worms, they were able to specify those filters in their search. Most importantly, users got their results quickly. MyGene.info handled traffic from >5000 concurrent users for approximately 10,000 requests per minute; and over 95 % of actual user requests take less than 30 ms to process. It receives requests from over 4000 unique IP addresses on a monthly basis.
Tracking Our Success
We already had BioGPS.org, a well-used, user-friendly resource, which originally utilized CouchDB (v1). As we migrated the service over to utilize MyGene.info, we wanted a way to distinguish the MyGene.info traffic coming from BioGPS.org from our various clients (python, R, etc). We utilized Kibana to help visualize the different sources and volumes of traffic for MyGene.info and MyVariant.info. Both MyGene.info and MyVariant.info consist of two endpoints each, and Kibana was an easy way for us to inspect the usage of our service endpoints.
Scaling Towards Other BioThings
MyGene.info currently has 10 shards spread across two web nodes, three master nodes, and three data nodes. Scaling up from 13 million genes to cover 334 million variants, MyVariant.info is made up of 20 shards spread across three web nodes, three master nodes, and five data nodes. We use load balancers to handle the queries coming into our web nodes to ensure fast and stable processing. Given the lessons learned on scaling when we developed MyVariant.info following MyGene.info, we expect to be able to readily extend coverage to other research areas with excess data fragmentation. Gene annotation and Variant annotation data are only two examples of “BioThings” with fragmented data sources, and we hope to expand our service to be of greater use to the research community.
Our blog URL is: https://www.elastic.co/blog