Visual Proteogenomics Analysis on the NetBeans Platform
My name is Jeff Jensen and I am a Research Scientist at the Pacific Northwest National Laboratory (PNNL) operated by Battelle Memorial Institute for the U.S. Department of Energy. I have a B.S. in computer science from Washington State University and am the primary software developer for the VESPA (Visual Exploration and Statistics for Proteomics Analyses) client interface.
VESPA in a Nutshell
VESPA is a visual analytics platform for exploring proteogenomics data. VESPA focuses on the integration of peptide-centric proteomics data with other high-throughput, qualitative and quantitative data, such as data from ChIP-seq analyses:
At its core, VESPA integrates bottom-up proteomics data with genome level information, i.e., mapping peptides to their respective genome locations. The visualization allows the user to observe the location and sequence of peptides that do not match current annotations, as well as offering valuable filtering criteria such as the removal of ambiguous peptides:
This capability is a necessity in proteogenomics where scientists are correcting either mis-annotations or identifying new genes. In addition, the integration capabilities of VESPA support user driven information layering or filtering to be used in the visualization, for example adding peptide confidence scores, cleavage state (fully or partially tryptic), or the number of identifications.
History of VESPA
The concept that is currently driving VESPA was originally funded as an internal project for single nucleotide polymorphism in proteomics research and was funded for two years at one FTE in 2005. In 2007, a grant proposal was submitted to the National Institute of Health (NIH) and was not originally funded.
In 2009, the proposal was revisited by NIH and PNNL received a two year grant (1R011GM-084892) using American Recovery and Reinvestment Act of 2009 (ARRA) funding to develop VESPA.
Technical Challenge 1: Visualizing Massive Amounts of Data
Coming up with ways to visualize massive amounts of data has always been challenging. How do we summarize large quantities of information and show it to users in a meaningful way and without overloading their senses? Some of the genome data we are visualizing at this time in VESPA has DNA strands approaching 5 million base pairs in length. The circular visualization (Genome window) is meant to show the user an overview of the entire genome currently loaded in VESPA.
I decided to render protein density over fixed equal intervals of the DNA in one degree segments of a circle having a 10 degree gap at the 12 o’clock position. The display resolution is shown in base pairs per degree and is drawn with a variable width and transparency Java 2D arc one degree in length. In this style, the user can very quickly see where massive amounts of protein (ring is opaque and wide), or lack thereof (ring is transparent and narrow) are located on the gene.
This principle was extended to also show density of peptides or orphan peptides within the DNA. Other windows in VESPA show smaller segments of DNA with much higher levels of detail, allowing the user to focus on only the pieces of DNA of greatest interest to them. The higher level visualizations allow quick navigation and show large anomalies quite readily. There are also tabular data views that show polymer data contained in lower level detailed visualizations as well as search options. This combination of visualizations represented in a tiered hierarchy, supporting data views, and search options allow the user to see an overview, quickly navigate, and find anomalies in the genome they are studying.
Technical Challenge 2: Dynamically Registering Visualizations
Another significant challenge was dynamically registering visualizations contained in various modules. A tiered hierarchy scheme was developed in which a visualization will register itself after its module starts up.
It is simple to keep a number of visualizations focused or centered on the same base pair of a sequence, but the complexity increases when each visualization’s selected or visible interval impacts visualizations on the same tier, one above it, or one below it.
For example, the selection in the Genome window (red box in the upper left corner) is equivalent to the visible interval in the Intermediate window (blue ribbon top center) and the selected interval in the Intermediate window (green box in the center) is equivalent to the visible interval in the Reading Frame window. That is because the tier for the Genome is 1, Intermediate is 5, and Reading Frame is 8. The Reading Frame Data window displays all the polymer data shown in the Reading Frame window and also is a tier 8 visualization.
The user can not only change the location of any visualization’s focus, but also the size (width in base pairs) of the selected interval, which can impact other visualizations on the same and surrounding tiers. Any visualization can register dynamically or at start up as a tier N visualization where changes to the selected interval at the next highest tier (numerically lower) will affect all tier N visualizations’ visible intervals. And changes to this tier N visualization’s selected interval will affect all visible intervals of the next lowest tier (numerically higher) visualizations.
Tier assignments are arbitrary positive integers and have gaps between them so that other visualizations may later be inserted into the hierarchy. Of course, this all happens without any visualization’s knowledge of another.
Besides being built on the NetBeans Platform, VESPA relies heavily on Java 2D for all of its visualizations. Each visual component was built by extending existing Swing components or containers and providing custom paint code to render data provided by specific polymer models.
There is also a local, per user data store attached to VESPA which is implemented using an embedded H2 Java database. To create a project, users specify up to five files representing DNA, protein, peptide, probe (oligonucleotide), and RNA sequence information where an analysis module processes and writes data to the database.
Apache POI is used to process files in Excel format and Super CSV for .csv files. SOCR (Statistics Online Computational Resource) from UCLA is used for doing statistical analyses to support our peptide scoring algorithms.
Having built numerous Java desktop applications and drawing from my own homegrown "mini application framework" over the years, I decided it was time to invest in a real application framework.
Several factors led to the choice of using the NetBeans Platform for VESPA. The NetBeans Platform is open source and written in Java, it is the only major application framework that supports its own module system as well as OSGI, applications build with it can use Swing, and there is an abundance of support and documentation available.
Working with such a framework frees developers from having to manage window systems, toolbars, menus, actions, persistence and so forth, and allows them to focus on business logic in a modular sense.
Adding new developers or even having the community contribute, should this project become open source, becomes much easier.
After attending several NetBeans Platform focused sessions at JavaOne in 2010, I spoke with my project team and we decided to migrate our Java Swing built application.
Learning the NetBeans Platform
In my opinion, there is a substantial learning curve to using the NetBeans Platform and my best advice is to jump in with some of the tutorials and follow them through on your own machine. This of course is my own perspective where I did not join a team already using the NetBeans Platform to develop software. The community provides countless sources of information including tutorials, screencasts, how tos, faqs, books, forums, and mailing lists, all of which have proven very helpful.
One area I think puzzles a lot of developers is defining window modes. There are several good examples of how to capture the definition of a mode, but sometimes complicated window setups still don’t come out right.
Keep in mind that the editor not only has its own mode defined but you will find a definition for the editor mode in the WindowManager.wswmgr file found by looking at any module’s layer file in context directly under the Windows2 folder. This file also defines the default size of the main application window and its location. I stumbled across a recursive definition of the split desktop in an old document titled "New Window System API Changes". This helped me to better understand the multiple horizontal and vertical constraints found in .wsmode files.Finally, to the NetBeans community, thank you and keep up the good work!