# Mutual Fund Benchmarks With Neo4j

# Mutual Fund Benchmarks With Neo4j

### Modeling data in a graph database is a wild frontier. Use the data you have and the questions you're trying to answer as a guide and be sure to let your imagination and creativity help you along the way.

Join the DZone community and get the full member experience.

Join For FreeJust the other day, I had a conversation with an Investment Risk Manager about one of the data problems his team was working on and he was wondering if Neo4j could help. Imagine you have about 20,000 mutual funds and ETFs, and you want to track how they measure up against a benchmark like, say, the returns of the S&P 500. I’m sorry, did I say *one*? I meant *all of them* — let’s say, 2,000 different benchmarks — and you want to track it *every day* for a rolling five-year period. So, that’s 20,000 securities * 2000 benchmarks * 5 years * 252 trading days a year (on average)… or... **50 billion data points**. That’s a *big* join table if we're using a relational database. How can we efficiently model this in Neo4j?

To keep things simple, let’s say we only want to track one measure, and that measure is R-squared. When we compare the change in the value of the mutual fund vs. the benchmark, how closely did the changes track? R-squared is always between 0 and 1. A value of zero means the benchmark has nothing to do with the mutual fund, and a value of one means it has everything to do with it. So how could we model this?

Our first option is to model the fund as a node, the benchmark as a node, and the data point between them as a relationship with both a date and an `r_squared`

property, as shown above. We would have 20,000 thousand Fund nodes, 2,000 Benchmark nodes, and 50 billion relationships. That’s an extremely dense graph.

It’s a start, but let’s try a query and see how our model does. I want to know what the `r_squared`

was on October 31, 2017 (Halloween) for Fund X and Benchmark Y.

```
MATCH (f:Fund {name: $fund_name})-[r:HAS_DATA]->(b:Benchmark {name: $benchmark_name})
WHERE r.date = $date
RETURN r.r_squared
```

I would create an index on `:Fund(name)`

so I can start there quickly. This is where this model fails us performance-wise. I would have to look at potentially 2,000 relationships (one for each benchmark) * 5 years * 252 days = **2.5 million** relationships in order to find the one I want.

Let’s try a different approach that you know I am a big fan of — using dated relationship types:

To find the `r_squared`

between a fund and a benchmark, I can write my query this way:

```
MATCH (f:Fund {name: $fund_name})-[r:ON_2017_10_31]->(b:Benchmark {name: $benchmark_name})
RETURN r.r_squared
```

Now, instead of checking up to 2.5 million relationships for the right one, I only have to check 2,000. One for each benchmark that exists — that’s much faster. Another benefit is that if I want all of the benchmarks for that one fund on a single day, this is easy to get:

```
MATCH (f:Fund {name: $fund_name})-[r:ON_2017_10_31]->(b:Benchmark)
RETURN b.name, r.r_squared
```

But let’s try another query. Let’s say we want to see how the `r_squared`

of a fund against a benchmark has changed since the beginning of this year.

```
MATCH (f:Fund {name: $fund_name})-[r]->(b:Benchmark {name: $benchmark_name})
WHERE TYPE(r) STARTS WITH "ON_2017"
RETURN TYPE(r) AS day, r.r_squared
```

We would once again start by finding our Fund node in an Index, and then we would have to traverse 2.5 million relationships again to find the ones we want. Now, traversing 2-4 million relationships is not the end of the world. Neo4j can do that in about a second per core. But we can do better. Let’s try moving the benchmark name into the relationship instead.

With this model, we would start at our Fund node, traverse 5 years * 252 days = **1,260 relationships** and check the date property to answer our last query.

```
MATCH (f:Fund {name: $fund_name})-[r:FOR_BENCHMARK_X]->(b:Benchmark {name: $benchmark_name})
WHERE r.date > $date
RETURN r.date, r.r_squared
```

OK… what about the previous query of getting all the `r_squared`

values for all the benchmarks for a particular day?

```
MATCH (f:Fund {name: $fund_name})-[r]->(b:Benchmark)
WHERE r.date = $date
RETURN b.name, r.r_squared
```

We have a bit of a problem here since now, we’re looking at 5 years * 252 days * 2,000 benchmarks or **2.5M relationships** again. We can’t use both the date *and* the benchmark name as a relationship type. Neo4j can only handle about 32k relationship types, not the 2.5 million we need. So what can we do?

That’s right. We can create a new type of node, a FundDay — one per fund per day, so 20,000 funds * 5 years * 252 days is about **25 million nodes**. We then connect these to our funds and to the benchmarks. Our query to get all the `r_squared`

values for all of the benchmarks for a particular day now looks like this:

```
MATCH (f:Fund {name: $fund_name})-[:ON_2017_10_31]->()-[r]->(b:Benchmark)
RETURN b.name, r.r_squared
```

We start from the Fund node, traverse to the FundDay (which we will ignore), and then to all the Benchmarks for that FundDay. So, our query will traverse one relationship for the day plus 2,000 for the benchmarks. Traversing 2,001 relationships is much faster than traversing 2.5M of them.

There is just one tiny little problem. Our graph is now comprised of 20,000 Fund nodes, 2,000 Benchmark nodes, and 25M FundDay nodes, which is fine… but we still have all the relationships between the 25M FundDay nodes and the 2,000 Benchmarks. 50 billion relationships is a lot. Can we trim this database down? We can by remembering that Neo4j can handle scalar values like Integers, Strings, and Floats, as well as Arrays of them. So, instead of storing each `r_squared`

separately, what if we stored them in an Array?

Instead of storing a single value in our `r_squared`

property, we now store an array of 5 years * 252 days `r_squared`

values. Halloween 2017 would be the 210th trading of 2017, + 4 years * 252 days… the `r_squared`

would be stored in position 1,218 in our array.

```
MATCH (f:Fund {name: $fund_name})-[r:FOR_BENCHMARK_X]->(b:Benchmark {name: $benchmark_name})
RETURN r.r_squared[1218]
```

What about all of the `r_squared`

values for a single benchmark from the first trading day of 2017 until Halloween? 4 years * 252 = 1,008, plus one for the start of 2017 until our trading day for Halloween 2017, so our query now looks like:

```
MATCH (f:Fund {name: $fund_name})-[r:FOR_BENCHMARK_X]->(b:Benchmark {name: $benchmark_name})
RETURN r.r_squared[1009..1218]
```

We start at our Fund node, traverse a single relationship to our Benchmark, and get a slice of the `r_squared`

property array as values. What about for *many* Benchmarks?

```
MATCH (f:Fund {name: $fund_name})-[r]->(b:Benchmark)
RETURN b.name, r.r_squared[1009..1218]
```

The query above costs us just 2,000 relationships traversals (one per Benchmark) and 2,000 property lookups… and the best part is we’ve reduced the size of our database to just 20,000 Fund nodes, 2,000 Benchmark nodes, and 40 million relationships (one for each combination).

There is still one small problem: Neo4j stores array properties in blocks of 120 bytes by default. An overhead of 8-10 bytes leaves 112 bytes of storage. Since each `r_squared`

value is a float that takes 8 bytes, each block would only take 14 `r_squared`

values at a time — forcing us to jump around from block to block a bunch and slow us down quite a bit.

But since we know ahead of time how many we want to store, we can change this parameter (`array_block_size`

) to a value above **10,080** (5 years * 252 days ( 8 bytes). Neo4j will use that value plus 8-10 bytes of overhead. Each page is 8,192 bytes, so we can reserve a page and a half (12,878) or just two pages for each entry (16,374), leaving 10 bytes for the overhead. Our database will be about 700 GB in size, so we need to grab an x1.16xlarge or an x1e.8xlarge with a monthly reserved three-year paid upfront cost of about $1,365 a month. Another option is that we could buy a big RAM server for about $20,000 and host it ourselves.

Remember there is no “third normal form” for modeling data in a graph database. It’s a wild frontier. Use the data you have and the questions you are trying to answer as a guide, but be sure to let your imagination and creativity help you along the way.

Published at DZone with permission of Max De Marzi , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

## {{ parent.title || parent.header.title}}

{{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}