There are many benchmark reports on YCSB out there, but only a few run Workload E with good hygiene. Most of what I see is done by customizing the workload from its original intent (95% query and 5% insert) to some other distribution. Recently Avalon published a report that compares MongoDB 3.2 and Couchbase Server 4.5. One important note in the benchmark is that it uses the most popular YCSB repo — so the client-side code uses what's publicly available, reproducible and it has not modified the original benchmark definitions in any way.
The benchmark environment is nine nodes of r3.8xlarges (32 cores and 60GB RAM with SSDs) with 150 million items in the database. Workload E originally simulates a query load that simulates threaded conversations in communication apps. Queries run a "range scan" with "order by" and "limit 50".
The results show that MongoDB can provide more than 8,000 queries/sec while Couchbase Server delivers more than 30,000 queries/sec on the same nine-node AWS deployment.
There are a few reasons why the difference exists on the query workload, but I think global and local indexes are responsible for a majority of the difference. Here is what is boils down to: Couchbase Server query execution does not have to scatter gather with its global indexes while MongoDB has to reach across the entire cluster due to local indexes.
It is really laws of physics or law of networks: When a query runs like the one in Workload E, MongoDB has to send the query to every node in the system and get a "top 50" computed on each node. For the nine nodes, that means 450 items (9*50 items) travelling back to a single node that eventually does a merge-sort that finds the final "top 50."
In the case of a query execution with global index in Couchbase Server, we find that there is only a single hop to the node that contains the index. Range scan with "order by" and "limit 50" gets pushed down so only 50 items are handed back to N1QL Query engine to deliver the results.
The issue with local indexes and scatter gather does not end here. The more nodes you add, beyond the nine nodes in this test, the worse things get. Given the query in Workload E (or any other range scan), each added node has to process their own "top 50." The network gets saturated with node-count*50 results travelling for each query. Given the query throughput at tens of thousands, life gets worse the more nodes there are in the mix. You can see the impact of network saturation in these results here comparison the throughput vs the number of client threads generating the load.
I'd encourage everyone to dig deep into the numbers and draw their own conclusions. You can find the full disclosure report here.