Originally Written by Chris Biow
Ah, but a man's reach should exceed his grasp,
Or what's a heaven for?
-- Robert Browning, "Andrea del Sarto"
I’ve always been fascinated by questions of scalability. Working for then-leading enterprise search vendor Verity around 2005, we developed a sub-specialty implementing text search applications for our largest customers. Anything over 30 million documents stretched the already obsolescent Verity engine to its limits. Such implementations were clearly a level beyond mere “enterprise scale,” and there was no common term for larger problems, so we started referring to them as “empire scale.”
Today, standard terminology in IT has converged on the term “Big Data,” to describe problems whose very scale dictates the software and techniques used to process them. It also describes a good part of what we at MongoDB address with our customers. Thousands of organizations have adopted MongoDB, primarily because it has enabled fast deployment and agility at all dimensions of scalability, such as databases providing subsecond response over billions of documents, or hosting hundreds of thousands of mobile applications on a single platform.
With customers having explored so many dimensions of scalability, we wanted to explore scalability in terms of a data size that would be instantly recognizable, but in practical terms. We settled on a petabyte of data, with the goal of creating a database as inexpensively as possible.
How much is a petabyte? Under the SI system, it’s abbreviated PB and equal to 10^15 bytes, not to be confused with IEC system’s “pebibyte”, abbreviated as PiB and equal to 2^50. Since petabyte is the more common standard, we chose that as our target.
Amazon Web Services (AWS) was our natural choice of deployment target because it is a very popular environment for MongoDB. Their pricing fits well, with hourly billing for disk and server resources.
The key to MongoDB write performance, and indeed that of most databases, is storage performance. Specifically, since MongoDB uses extent-based storage of documents and B-Trees indexes, random seeks (IOPS) tend to be the storage attribute that drives performance and cost at scale. This is in contrast with Log-Structured Merge (LSM) Tree storage, which uses sequential writes of DB segments, but then later pays an overhead to merge segments. It is worth noting that with MongoDB 2.8 users will be able to chose which of these approaches is best for their application. In this post all of our tests are based on MongoDB 2.6.
AWS provides three fundamental types of storage, each with implicit limitations on total disk space, IOPS, and sequential throughput, per machine instance:
- Elastic Block Storage (EBS). The Provisioned IOPS (PIOPS) option is best for write-intensive MongoDB applications, as ours would be. At the time we ran these tests, EBS volumes were limited to maximum 1TB and 4000 PIOPS, with at most 24 volumes being mounted to each machine instance. Thus, per-instance limits are 24TB and 96K IOPS. Block size per PIOP was 16KB, for a maximum sequential throughput of 1.5 GB/s. In practice, throughput is limited by the physical 10Gbps limit on all network traffic to an instance, including EBS.
- Ephemeral SSD. i2.8xlarge instances feature 8 SSDs of 800GB each, for a total of 6.4TB and on the order of 300K IOPS at a 4KB block size, for 1.2GB/s sequential throughput.
- Ephemeral Spinning Disk. AWS’s hs1.8xlarge provides an array of 24 internal SATA drives of 2TB each. Net of some roundoff errors, they can be configured in a single RAID0 array of 46TB, in principle producing around 2.5K IOPS and up to 2.5GB/s sequential throughput.
In tabular form, with costs added (based on the us-west-2/Oregon region, as of October 2014), we get:
|Instance Type||Capacity (TB)||IOPS (K)||Sequential (GB/s)||Cost: instance and storage ($/hr)||Cost: PB total storage ($/hr)|
The far right column is the cost to maintain a petabyte of storage online. It doesn’t address how long it takes to actually write a petabyte database with MongoDB. As we mentioned above, it’s mostly IOPS that tend to limit write performance, though all the measures of performance need to be kept in mind. The IOPS requirement isn’t entirely an independent variable, depending quite a bit on block size of EBS and SSD, kinematics of spinning disk, document size, and details of synchronizing memory mapped files from RAM to disk.
So the next step was to experiment with each of our options. There are practicalities of access to these AWS resources, aside from sheer cost.
- EBS PIOPS are limited resources. By default, each AWS account is currently limited to 40K PIOPS. If we went with the maximum of 24TB and 96K PIOPS per instance, we’d need 42 instances for our petabyte, at 4M total PIOPS. While Amazon would be happy to sell that high a level of PIOPS usage, it represents a significant capital investment on their part, so they’d want some guarantee it would be used for more than a brief test. As it turned out, due to our close corporate partnership with AWS, we had a few days’ availability last year of 1M PIOPS, while Amazon was stocking up for their own customers’ Christmas rush. It let us get comfort with petascale usage of EBS PIOPS, but the window wasn’t long enough to complete a full petabyte load.
- SSD instances are also limited in availability. We did test smaller clusters of i2.8xlarge instances and found that we could fully load them in about two hours. So for $2.2K, we could build a petabyte using a cluster of 160 of these servers. Amazon was cooperative in raising our account limit, but as they warned us, there physically weren’t enough of these servers available at any given time in a single region.
- We had no problem raising our default limit for hs1.8xlarge instances and in provisioning up to 30 of them at a time. They could be bulk-loaded in 24 hours, or about $2.5K in costs.
Having chosen our (virtual) hardware architecture, we had to decide what sort of data and loading system we’d use, as well as all the usual configuration choices for hardware and virtualization.
The Yahoo Cloud Serving Benchmark (YCSB) has been widely used for testing different kinds of databases. Having to take something of a Least Common Denominator approach over all databases, it’s almost entirely key-value oriented, using none of the differentiators that have so dramatically broken out MongoDB from other databases. For that reason, it’s of little value in making comparisons, outside of the bare-bones, key-value stores for which it was written.
But for all that, it’s readily available, standard, and provides good parallel threading when configured appropriately, so we tend to use it for load generation. However, we have fixed and customized the base build to better work with MongoDB.
First, if you are going to use YCSB with MongoDB, you’ll at least want to use Achille Brighton’s fork of YCSB. Achille has continued maintenance of YCSB over the last couple of years, including MongoDB Java driver updates, corrections to connection pool sizing, elimination of superfluous error checks, explicit cursor closing, standard write concern levels, and other key updates. Please also contact us at MongoDB for assistance, so that we can provide guidance in configurations appropriate to your requirements.
For this project we configured the setup as follows:
- Client Setup: for our purpose of parallel bulk-loading, on instances that had ample, spare CPU, it made the most sense to run YCSB on the shard servers, pointed directly at the local mongod processes. This saves a network hop, circumventing the typical AWS network limitations.
- Server Setup: as summarized above, we used Amazon machine instances with a large array of spinning disks, being most cost-effective for petascale storage. More on that in a minute...
- Network Setup: although our parallel loading approach minimized network overhead, we took advantage of AWS options wherever appropriate.
We also added some project-specific enhancements to YCSB. First, even using large documents, we were going to need more than 2 billion documents to add up to a petabyte. We therefore had to revise the primary document IDs to use long integers. While we were at it, for efficiency, we changed the type of the IDs from the form “user1234” to the long integer itself. We also wanted to add document field values that would be more useful than the random byte sequences used in YCSB’s data fields. We therefore added, for each of the 8 random data fields, a parallel field with an integer value, randomly generated with a zipfian, or long-tail distribution. These 8 integer fields would support queries with meaningful selection criteria, allowing specification of result sets of any size, according to the zipfian distribution. Moreover, those result sets could use MongoDB Aggregation over the other integer fields, retrieving sums, averages, etc.
Cloud and System Specifics
For network optimization, we constrained the servers to a single AWS Availability Zone, keeping latency low. Launching all servers in a single AWS Placement Group would have provided even better network locality, but we weren’t able to provision 30 of our servers in a single Placement Group.
Since our objective was sheer data volume and speed, for purposes of Proof of Concept, we configured the 24 disks as a single, RAID 0 (striped) array. We used the XFS filesystem with 128MB log size, mounted as recommended in our docs. If this were for production purposes, you’d probably use RAID 10 (mirrored/striped), which would double the server requirement.
For an operating system, we chose Ubuntu 13.10 for its combination of XFS filesystem support and working memory cgroups. From the stock Amazon Machine Instance (AMI), we created our own AMI incorporating security updates, Oracle Java 7, recommended ulimit settings, munin-node monitoring software, Global Shell to run commands across all servers in the cluster, and our modified YCSB. Upon boot, a cloud-init script took care of configuring the disks and launching MongoDB.
Stereotypically, spinning SATA disks have rather low random IOPS performance, around 100 per disk. But when you have 24 of them in RAID 0, you get, well, 24 times the write performance, giving a perfectly respectable 2.4K IOPS. To provide optimal loading of that array, we used four MongoDB shards on each server. This approach may be warranted when dealing with large or fast disk arrays. To ensure that the mongod server processes did not contend for memory with each other or with the YCSB Java processes, we segregated each of them in its own section of RAM, using Linux cgroups.
The shards themselves were pre-split with one chunk per shard, evenly distributed over the range of primary IDs that we would load. A script then turned on the balancer process and monitored the migration of these empty chunks to each shard, turning the balancer back off once they were 1:1 with the shards. As there would be no benefit to splitting these chunks, we ran the mongos router processes with AutoSplit=off.
Since this was intended to be a bulk load, and we wanted to avoid network overhead, we launched a YCSB process locally, alongside each shard, writing directly to the mongod server process, using a range of IDs to match the chunk on that shard. We created indexes on the primary key, as required for sharding, and also on one of the zipfian-distributed integer fields, so that we could test queries more “interesting” than the simple key-value retrievals offered by YCSB.
Ready to Run
That’s it for the background and decision-making! At this point, we were ready to launch the cluster and see how it would progress toward the petabyte goal.
In the next post we’ll examine what happened.
About Chris Biow
Chris Biow is Principal Technologist and Technical Director at MongoDB, providing technical leadership for MongoDB field activities, with particular interest in enterprise database Platform-as-a-Service offerings and in the practical frontier of scaling capability. Chris has worked in Big Data and NoSQL since well before the terms were coined, as Principal Architect at Verity/Autonomy, Federal CTO/VP at MarkLogic, and with Naval Research Laboratory as a Reservist. His work includes Big Data projects with Fortune 10 companies, US Department of Defense, National Archives, and Bloomberg News. His career also includes 8 years as a Radar Intercept Officer, flying the aircraft-carrier-based F-14 Tomcat. Chris is a graduate of the United States Naval Academy and earned his Master’s degree from the University of Maryland.