A Former Large Hadron Collider Worker Discusses Big Science, Big Data
Simon Metson started working for Cloudant in March 2012 as an “ecology engineer” and became a full time employee in July. He is based in the UK.
Before I joined Cloudant, I worked on the Compact Muon Solenoid (CMS), one of the two Large Hadron Collider (LHC) detectors that announced their discovery of the Higgs Boson earlier this month. I was a member of this collaboration for 10 years, from the start of my PhD through a bunch of post doc positions (all based in the UK at Bristol Uni), so was really excited and pleased to see the result being announced. It is the result of an amazing endeavor, where everything (hardware, software, wetware) has been pushed beyond known limits and, more importantly, delivered.
Candidate Higgs event - copyright CERN/CMS See the original
For most of my time with CMS, I was involved in one way or another with data and workflow management (DMWM); designing and delivering software that would be able to process, catalog and transfer the tens of petabytes of data that the detector would deliver each year, and for the last four and a half years I ran the DMWM project with a colleague from Fermi Lab.
CMS has a tiered, distributed computing system. Initial processing happens in the Tier 0 (at CERN). The resulting data is shipped to 7 national labs (Tier 1’) located in US, EU and Asia for second pass processing and offsite tape backup. Once on tape the data is made available to the collaboration; any CMS collaborator can request the transfer of any dataset to their local Tier 2 cluster. Last time I check we’d moved over 100PB of data around and had about 50PB of data active in the system.
CMS 2011 transfer link map - copyright Paul Rossman See the original
The DMWM project writes the software that manages these transfers, records and manages the metadata for all the datasets and submits and tracks all the processing jobs through the various batch and grid systems for both organized central processing and more ad hoc user analysis.
Getting the job done
Watching how the computing system of CMS evolved over the last ten years from something that required a lot of hand holding to something that could be relied upon to deliver, often well beyond specification, was a great experience; seeing the sites, organization, software and people mature has had a profound effect on how I think about the design, implementation and management of software projects.
The nature of these experiments is such that any and all tools are available for use. We have a complex problem to solve, which results in a complex solution. While we rationalized a lot of things (making sensible decisions can save you thousands of lines of code) we still ended up with quite a range of supported technologies. Our stack used (from memory and in no particular order):
- Languages: Python, Perl, C++ and a bit of Java
- Databases: Oracle, MySQL, CouchDB, MongoDB
- Messaging: ZeroMQ, MonALISA
- Grid middleware stacks: EGEE, LCG, OSG, and ARC
- Batch schedulers: PBS, LSF, Torque, Maui, SGE, Condor
- Distributed file systems: DPM, HDFS, CASTOR, dCache, Lustre, GPFS).
Wrangling that stack sure was entertaining…
I think CMS was the first experiment to start using NoSQL tools in production; HDFS is used at some of the T2 sites for their storage system, MongoDB is used as a smart cache aggregating data from various data services and CouchDB is used for various state machine and monitoring tasks. CouchDB’s replication feature was especially useful in CMS’s distributed computing system. Being able to quickly build aggregated views of monitoring or state information was great, as was having everything served via a REST HTTP API. Having it all in one application was even better!
From Higgs to Cloudant
I met Mike Miller at CERN in 2007 (well, in a fondue restaurant, I think), had followed Cloudant from its inception and we’d kept in touch after he left CMS to go look at neutrinos and dark matter at University of Washington, build a company and a set of desks. He contacted me at the end of 2011 to see if I’d be interested in joining Cloudant.
Like a number of my former colleagues, I felt that it was time to move on. As Mike once said to me “we’re builders”, and I wanted to be building new things again. I’d experienced working in a large (CMS has about 4000 people involved, and is spread over nearly 120 institutions), collaborative environment with all the benefits and issues that they bring and decided that seeing what life is like in a startup would be interesting.
Am I sad that I left CMS before the discovery announcement? Not really. The nature of these collaborations is that everyone who makes a contribution is recognized; the papers are signed by “the CMS collaboration” and I am a member for the next two years. While I wasn’t directly involved in the analyses that were presented on the 4th, the software that I was responsible for enabled that research. That’s something I’m quietly proud of.
What’s great about joining Cloudant now is that I get to see that whole evolution and scale up cycle again. Everyone in the team knows that we’re just at the beginning of a very interesting journey and is really excited to see where the company goes over the coming years. I’m really looking forward to helping the company scale up; the software, the organization and, importantly, the customers.