Usually, your application will persist of a set of queries on ArangoDB for one scenario (e.g. displaying your user's account information, etc.). When you want to make your application scale, you'd fire requests on it, and see how it behaves. Depending on internal processes execution times of these scenarios vary a bit.
We will take intervals of 10 seconds, and graph the values we will get there:
- Average - all times measured during the interval, divided by the count.
- Minimum - fastest requests.
- Maximum - slowest requests.
- The time "most," aka 95%, of your users may expect an answer within - this is called the 95th percentile.
We want to push the installation hard, and not waste too many resources on these statistics. However, we want it in real time, so we can align it to other performance figures, e.g.:
- The number of open file handles.
- Number of running threads.
- The used CPU resources.
Our Monitoring Solution
Out of first thoughts that one would want this kind of metrics the statsd pattern evolved, which has been meanwhile implemented in many monitoring solutions. The one we will pick here is collectd, since collectd can also aggregate the resource usage of a process.
Since later on, we will require to monitor per process file-descriptors, we patch collectd with this pull request (or use a newer version than 5.7.2 once it's available).
In a permanent monitoring solution, one would i.e. want to configure graphite to collect the values. To simplify the scope of this article, we use the collectd rrd backend and inspect our metrics using KCollectd which is a simple KDE program available with most current linux distributions.
Monitoring ArangoDB CPU Usage With collectd
We match the process by regular expression using this collectd configuration snippet:
CollectFileDescriptor true CollectContextSwitch true ProcessMatch "adb_single" ".*bin/arangod"
Integrating statsd in Our Setup
Statsd (as mentioned above) strives to add client instrumentation to your program. Since we only want to have statistics, we don't need 100% accuracy. Thus we don't care if some of the values measured get lost if we gain the advantage that we don't have to hold up the program reporting it.
For that reason, statsd packages are sent via UDP in a fire and forget fashion from the client to the server.
We add the statsd client to pyarango so it can give us these metrics on a per query scale.
This implementation identifies AQL queries by making a hash of the query string - therefore you need to use bind values in order to get useful values.
Getting in Touch With "Real" Numbers
Here's a little python script creating some load on the DB so we can measure some nice values:
Learning Which Is the Identifier of Our Query
On a first run, we limit the loop, so we can identify our queries in the
Inspecting the Results
We run the test for some minutes, and monitor it with KCollectd. We choose several metrics to display by dragging them from the tree on the left into a graph area:
statsd branch we find the values our AQL query emits. We create one graph with it.
To see the resource usage of the ArangoDB process we also add another graph for it, and a third graph that shows the number of file descriptors ArangoDB is using.
From inspecting the graphs, we see the query up to using 60ms which is not that bad, but its runtime behavior degrades over time - which is rather undesirable.
The blue line identifies the 95%-percentile, which is way below the brown line with the maximum reply time. This indicates that there are single replies that are much worse than the percentile - a behavior that one wouldn't like to have at all.
All of the above is usually an indicator for a full collection scan happening since we only query one document, it may be found faster or slower, which is what makes the values bounce.
Taking a look at our code, we see that this aligns to the number of documents we fill into the
Thus we take a closer look at the monitored query using explain:
AHA! no index is used - this is why it worsens with the collection filling up.
So we create an index in the setup phase of our test code:
Once its there, explain will show us it's going to be used:
and the values we measure show much better results:
While the first run wasn't finished yet after 7 minutes, the second run is already done in 4 minutes. We saw that the
Thread count numbers weren't in the range of the CPU time values at all, so we move it over to the file handle count graph.
Watching the reply times of the query from the first run, we can see worst reply times were around 60 ms. At the start we see the
truncate-operation is giving us little slower values, but during the regular test run the maximum values are at around 5ms - which is perfectly acceptable.
We also can see the resource usage by the ArangoDB process is way better - 100k jiffies compared to 900k jiffies in the first test run.
We also see that the system CPU time is 1/3rd of the total CPU, which means we do a lot of I/O (write new documents to disk) while mildly loading the machine with userspace calculations.
In the first run, the system CPU usage would be next to unmeasurable.
In the next articles, we're going to take a look at how to measure and instrument transactions, and what to look at on clusters. We will also implement a user scenario to measure the possible capacity of a given system. Stay tuned!