Over a million developers have joined DZone.

Fast and Flexible Risk Aggregation on Apache Spark

DZone's Guide to

Fast and Flexible Risk Aggregation on Apache Spark

In this post, Deenar Toraskar, founder of risk-analytics solution provider Think Reactive and a contributor to Spark, describes why new requirements for agile, self-service, and VaR reporting help make the case for building out new analytic infrastructure on the Apache Hadoop ecosystem.

· Cloud Zone
Free Resource

MongoDB Atlas is a database as a service that makes it easy to deploy, manage, and scale MongoDB. So you can focus on innovation, not operations. Brought to you in partnership with MongoDB.

As described previously in this post, Value at Risk (VaR) is a popular risk measure used for risk management, margin calculation, regulatory reporting, capital charges, and pre-trade decision making. VaR is also often used for hedge optimization, portfolio construction, and to optimize the tracking error of a portfolio against some benchmark.

VaR is defined as predicted worst loss over a target horizon within a given confidence interval. For example, if you hold $1000 worth of GOOG shares, one-day VaR with a 95% confidence level is $22.

VaR is used in many contexts across an organization, so it can carry different names for different use cases and thereby create various reporting needs. Here is a typical set of views:


Regardless of context, it’s important to note that VaR is not simply additive: the VaR of a portfolio containing assets A and B does not equal the sum of VaR of asset A and VaR of asset B. Hence, if you were to create the following table:

CREATE TABLE positionsVaR (book INT, trader String, clientAccount String, product String riskFactor String, riskType String, var Double)

The following query would fail:

Select book, sum(VaR) from positionsVaR group by book

For this reason, traditional databases are of limited value when the VaR information being reported is not linearly aggregate-able. (Many important risk measures beyond VaR, such as counterparty credit risk, fall into this category.) For that reason, most risk-reporting warehouses will pre-aggregate all frequently-used dimensions and use the pre-aggregated values to report VaR. This approach works somewhat, but the available views are limited and fixed: calculating VaR along any other dimension, or for a custom set of assets, requires a new aggregation job and thus for users to wait for results (for hours or even days).

There are other limitations as well:

  • Due to a shallow schema with limited analytical capabilities, only standard slice-and-dice operations and simple aggregation functions can be used for reporting.
  • Limited support is available to run user-defined functions (UDFs) or to call out to external analytical libraries. Thus risk measures are pre-aggregated at a fixed set of dimensions.
  • Schema are fixed and thus new analytics and aggregations require new views and schema changes.

Why Hadoop?

In contrast, Apache Hadoop ecosystem technologies such as Apache Spark, Apache Impala (incubating), and Apache Hive—combined with serialization formats such as Apache Avro, Thrift, Protocol Buffers, Hive Serdes—can serve as the foundation of “high-definition” analytic infrastructure with these attributes:

  • Extensibility: embedding custom analytics for aggregation and reporting via UDFs and UDAFs allow a Domain-Specific Language (DSL) to be implemented as an extension to built-in functionality. UDFs can call external analytical libraries, as well.
  • Support for complex data: using persistence formats such as Avro, Thrift, Protocol Buffers, and Hive Serdes, you can model complex domain objects via their support of rich types.
  • Flexibility: Evolvable schemas allow quicker response to changing requirements.
  • High performance: Spark supports fast, large-scale data processing using an advanced DAG execution engine that supports cyclic data flow and in-memory computing.

With this approach, users can ask any questions or run any reports they like without the need to request custom analytics jobs, making deeper insight available on-demand.

Let’s review an example. Let’s say you want to store a complete view of a trade using Spark SQL, instead of calculating a single VaR number using some assumptions. With the following SQL, all the historic P&L for the trade are stored using Hive’s Array datatype.

CREATE TABLE positions (book INT, trader String, clientAccount String, product String riskFactor String, riskType String, nominal Double, pnls Array)

Users can then combine a deep representation of the trade with custom analytics via UDFs:

/* value at Risk given pnls */
def valueAtRisk(pnls: Seq[Double], percentile: Double): Double = {
 val size = pnls.size
 val sortedVector = pnls.sorted
 val indexR = (size * ((100 - percentile) / 100)) - 1
 val upper = math.ceil(indexR).toInt.max(0).min(size - 1)
 val lower = math.floor(indexR).toInt.max(0).min(size - 1)
 if (lower == upper)
 else /* interpolate if necessary */
   ((upper - indexR) * sortedVector(lower)) + ((indexR - lower) * sortedVector(upper))
sqlContext.udf.register(“valueAtRisk”, valueAtRisk _)

Now, an entire risk profile in the data model allows users to ask any questions through the reporting layer:

/* VaR by client */
$ select clientAccount, valueAtRisk(arraySum(pnls), 99.0) from positions group by clientAccount

/* VaR by Trader */
$ select trader, valueAtRisk(arraySum(pnls), 99.0) from positions group by trader

/* What If I sell GOOG */
$ select clientAccount, valueAtRisk(arraySum(pnls), 99.0) from positions where riskFactor <> ‘GOOG’ group by clientAccount

/* Change confidence level */
$select clientAccount, valueAtRisk(arraySum(pnls), 95.0) from positions group by clientAccount

There are other benefits to this approach, as well:

  • Because the DSL is accessible via SQL, users needn’t learn a new language to access the functionality exposed via UDFs.
  • Hundreds of standard Hive and community-contributed UDFs from the likes of Facebook, Brickhouse, and do so on can be combined with custom UDFs.
  • Spark UDFs are concise, quickly implemented, and unit-testable.
  • Spark UDFs are versatile/Write Once Use Everywhere for streaming jobs, batch jobs, REST services, ad hoc queries, and machine learning.
  • Spark SQL is not limited to querying data from HDFS; it also integrates with data sources like Apache HBase, Apache Cassandra, Apache Kudu (incubating), and even relational databases. This support for polyglot persistence allows for joins across data sources; for example; positions can come from HDFS, time series from HBase or Cassandra, and business hierarchies and reference data from Oracle Database.

No Fear of Market-Risk Regulatory Changes

In January 2016, the Basel Committee on Banking Supervision published its new minimal capital requirements framework for market risk. As the committee has identified a number of weaknesses in VaR, including its inability to capture “tail risk,” it proposes a shift away from VaR in favor of Expected Shortfall (ES, aka Conditional Value at Risk [CVaR]), an alternative risk measure that is more sensitive to the shape of loss distribution in the tail of the distribution.

Fortunately, the high-def approach I have described allows calculation of new metrics without changes. Thus, you could easily calculate ES with the following.

def expectedShortfall(pnls : Seq[Double], percentile: Double): Double = {
 val size = pnls.size
 val sortedVector = pnls.sorted
 val indexR = (size * ((100 - percentile) / 100)) - 1
 val upper = math.ceil(indexR).toInt.max(0).min(size - 1)
 val lower = math.floor(indexR).toInt.max(0).min(size - 1)
 if (lower == upper)
   (sortedVector.take(lower).sum + ((upper - indexR) * sortedVector(lower)) + ((indexR - lower) * sortedVector(upper)))/indexR


Vendors have generally focused on the cost-reduction benefits of moving workloads onto the Apache Hadoop stack. In fact, in some cases, this stack can serve as the foundation of smarter, higher-definition, more agile, and faster analytic infrastructure overall.

With this approach, you can not only free your engineering team to do other things than run reports, but also mine your data arbitrarily to uncover new opportunities for business optimization.

MongoDB Atlas is the best way to run MongoDB on AWS — highly secure by default, highly available, and fully elastic. Get started free. Brought to you in partnership with MongoDB.

apache spark ,apache hadoop

Published at DZone with permission of Deenar Toraskar. See the original article here.

Opinions expressed by DZone contributors are their own.


Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.


{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}