# Counting Triangles Smarter: or, How to Beat Big Data Vendors at Their Own Game

# Counting Triangles Smarter: or, How to Beat Big Data Vendors at Their Own Game

Join the DZone community and get the full member experience.

Join For Free**Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.**

A few months ago, I discovered Vertica’s “**Counting** **Triangles**”-article through Prismatic. The blog post describes a number of benchmarks on **counting** **triangles** in large networks. A *triangle* is detected whenever a vertex has two adjacent vertices that are also adjacent to each other. Imagine your social network; if two of your friends are also friends with each other, the three of you define a *friendship triangle*. **Counting** all **triangles**within a large network is a rather *compute-intensive task*. In its most naive form, an algorithm iterates through all vertices in the network, retrieving the adjacent vertices of their adjacent vertices. If one of the vertices adjacent to the latter vertices is identical to the origin vertex, we identified a triangle.

The Vertica article illustrates how to execute an optimised implementation of the above algorithm through*Hadoop* and their own *Massive Parallel Processing (MPP) Database* product (both being run on a 4-node cluster). The dataset involves the LiveJournal social network graph, containing around *86 million relationships*, resulting in around *285 million identified triangles*. As can be expected, the Vertica solution shines in all respects (

**counting**all

**triangles**in

*97 seconds*), beating the Hadoop solution by a factor of

*40*. A few weeks later, the Oracle guys published a similar blog post, using their

*ExaData*platform, beating Vertica’s results by a factor of

*7*, clocking in at

*14 seconds*.

Although Vertica and Oracle’s results are impressive, they require a significant hardware setup of 4 nodes, each containing 96GB of RAM and 12 cores. My challenge: beating the Big Data vendors at their own game by calculating **triangles** through a * smarter algorithm* that is able to deliver similar performance on

*commodity hardware*(i.e. my MacBook Pro Retina).

### 1. Doing it the smart way

The LiveJournal social network graph, about 1.3GB in raw size, contains around 86 million relationships. Each line in this file declares a relationship between a *source* and *target vertex* (where each vertex is identified by an unique id). Relationships are assumed to be *bi-directional*: if person 1 knows person 2, person 2 also knows person 1.

0 1 0 7 0 8 0 9 1 2 1 5 2 0 2 5 2 7489

Let’s start by creating a *row-like* data structure for storing these relationships. The *key* of each row is the *id of the source vertex*. The row *values* are the id’s of all *target vertices* associated with the particular source vertex.

0 - 1 7 8 91 - 2 52 - 0 5 7479

With this structure in place, one can execute the naive algorithm as described above. Unfortunately, iterating four levels deep will result in mediocre performance. Let’s improve our data structure by *indexing* each relationship through its *lowest key*. So, even though the LiveJournal file declares the relationship as being “*2 0*“, we persist the relationship by assigning the *2*-value to the *0*-row. (Order doesn’t matter as relationships are bi-directional anyway.)

0 - 1 2 7 8 91 - 2 52 - 5 7479

Calculating **triangles** becomes a lot easier (and faster) now. If the key of a row is part of a *triangle*, its two*adjacent vertices* should be in its list of values (as by definition, the row key is the smallest vertex id of the three of them). Hence, we need to check whether we can *find edges amongst the vertices* contained within each row. So, for each row, we iterate through its list of values. For each of these values, we retrieve the associated row and verify whether one of its values is part of the original *source*-values. By doing so, we get rid of one expensive *for*-loop. Nevertheless, the amount of calculations that need to be executed is still close to *2 billion*!

sourceValues = sourceRow.getValues();for (sourceValue : sourceValues) {targetValues : rows[sourceValue].getValues();for (targetValue : targetValues) {if (sourceValues.contains(targetValue) {triangles++;}}}

### 2. Persisting the relationships

The data structure as described above is persisted in a *custom datastore* that we developed at Datablend for powering the similr-engine (a chemical structure search engine). The datastore is *fully persistent* and optimised for quickly performing *set-based operations* (intersections, differences, unions, … ). Parsing the 86 million relationships and creating the appropriate in-memory data structure takes around *20 seconds* on my MacBook Pro. An additional *4 seconds* is required for persisting the entire data structure to the datastore itself. So around*25 seconds* in total for effectively storing all 86 million relationships. Vertica nor Oracle mention the time it takes to persist the Livejournal dataset within their respective databases. However, I assume it also requires them a few seconds to execute this load-operation.

What about *disk usage*? The custom Datablend datastore takes the *second place*, requiring only 37 Mb more compared to Oracle’s Hybrid Columnar Compression version.

1. Oracle HCC Query High: 262 Mb2. Datablend datastore: 296 Mb (+13 %)3. Oracle HCC Query Low: 361 Mb (+38 %)4. Vertica: 560 Mb (+113 %)

### 3. Calculating the **triangles**

The Oracle setup (on a cluster of 4 nodes, each with 96GB of RAM and 12 cores) is able to calculate the 265 million **triangles** in 14 seconds. The optimised algorithm described above, running on the custom Datablend datastore, takes the first place, clocking in at *9 seconds*! The calculation runs fully pararellized on my MacBook Pro Retina and has a peak use of only 2.11 GB of RAM!

1. Datablend datastore: 9 sec (Macbook Pro Retina)2. Oracle: 14 sec (+55 %) (4-node cluster)3. Vertica: 97 sec (+1077 %) (4-node cluster)

### 4. Conclusion

Datablend’s custom datastore is a very specific solution that targets a *particular range of Big Data computations*. It is in no means as generic and versatile as the MPP database solutions offered by both Vertica and Oracle. Nevertheless, the article tries to illustrate that one does not require a large computing cluster to execute particular Big Data computations. Just use the most appropriate/smart solution to solve the problem in an elegant and fast way. Don’t hesitate to contact us if you have any questions related to similr and/or Datablend.

**Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub. Join the discussion.**

Published at DZone with permission of Davy Suvee , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}