How to Use Window Aggregates in AnzoGraph for Running Totals or Aggregations Over Time
This article will show you how AnzoGraph provides built-in support for advanced analytics such as window aggregates.
Join the DZone community and get the full member experience.
Join For FreeAnzoGraph is a native, massively parallel processing (MPP) graph OLAP database purpose-built for interactive data warehousing analytics as well as graph analytics. In our previous article, we showed you how AnzoGraph supports not only semantic graphs but also labeled property graphs using the RDF* proposed W3C standard. This article will show you how AnzoGraph provides built-in support for advanced analytics such as window aggregates, which enable users to compute multiple aggregate values on a particular partition or window of the result set using just one query, instead of having to write and submit multiple redundant queries to get the same results.
Unlike grouped aggregate functions that group results and return a single value, window aggregates return a value for every row in the specified window. For example, using the grouped aggregate SUM function to add up the total number of tickets sold in a year returns one value: the total number of tickets sold for the year. However, by using the SUM window aggregate instead, the results could be partitioned by month so that your single query returns 12 values: the sum of the tickets sold in each month of the year.
This article describes the supported window aggregates, explains how to incorporate them into your SPARQL queries in AnzoGraph, and provides some example queries.
AnzoGraph Built-In Window Aggregates
AnzoGraph provides the following built-in aggregate and ranking functions:
- AVG: Calculates the average of the input expression values.
- COUNT: Counts the number of results defined by the expression.
- MAX: Returns the maximum of the input expression values.
- MIN: Returns the minimum of the input expression values.
- NTILE: Divides the rows in the partition into the specified number of ranked groups and returns the group that each value belongs to.
- PERCENTILE: Like using NTILE(100), this function divides the rows in the partition into 100 ranked groups and returns the group that each value belongs to.
- PRODUCT: Multiplies the input expression values and returns the product.
- QUARTILE: Like using NTILE(4), this function divides the rows in the partition into 4 ranked groups and returns the group that each value belongs to.
- ROW_NUMBER: Assigns unique numbers to each row in the partition.
- SUM: Returns the sum of the input expression values.
Window Aggregate SPARQL Syntax
Use the following syntax when incorporating window aggregates in SPARQL queries. The list below the syntax describes each option:
(FUNCTION_NAME (expression1) OVER (
[ PARTITION BY expression2 ] |
[ ORDER BY expression3 [ frame_clause ] ]
)
AS ?variable )
FUNCTION_NAME (expression1)
The window aggregate name and expression or expressions that produce the values to operate on.
OVER()
The required OVER() keyword differentiates window aggregates from other aggregate functions.
PARTITION BY expression2
This optional clause forms the groups of rows, dividing the result set into partitions. If you do not include a PARTITION BY clause, the partition becomes the entire result set identified by the WHERE clause. When PARTITION BY is included, the system calculates the specified function for the rows in each partition.
ORDER BY expression3
This optional clause defines the order or sequence of rows within each partition.
frame_clause
The optional frame clause defines the window frame, which specifies the rows in a partition to combine with the current row. There are two types of window frames:
A fixed frame with two moving endpoints, where each row becomes the current row as the window frame slides forward in the partition. This type of frame is ideal for computing aggregations over moving time frames.
A resizing frame with one anchored endpoint, where one row is a fixed endpoint and the frame resizes up (preceding) or down (following). This type of frame is ideal for computing running totals.
The reference point for all window frames is the current row. The frame clause can be one of the following options:
[ ROWS ] frame_start
[ ROWS ] BETWEEN frame_start AND frame_end
Where frame_start is one of the following options:
UNBOUNDED PRECEDING
positive_value PRECEDING
CURRENT ROW
positive_value FOLLOWING
And frame_end is one of the following options:
positive_value PRECEDING
CURRENT ROW
positive_value FOLLOWING
UNBOUNDED FOLLOWING
When a frame clause is not specified, the system treats the window frame as unbounded: ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
AnzoGraph Window Aggregate Query Examples
The sample queries below analyze sales activity for a fictional e-commerce website we’ll call “Tickit,” where people buy and sell tickets for sporting events, shows, and concerts. The data consists of person, venue, category, date, event, listing, and sales classes. Note: The Tickit sample data is included with an install of AnzoGraph and is also available on Amazon S3.
The following simple query does an aggregation of the number of events per month in the Tickit graph:
SELECT ?month (COUNT(?event) OVER (PARTITION BY ?month) AS ?events)
FROM <tickit>
WHERE {
{ SELECT ?event (MONTH(?date) AS ?month)
WHERE {
?sale <dateid> ?dateid .
?dateid <caldate> ?date .
?sale <eventid> ?eventid .
?eventid <eventname> ?event .
}
}
}
ORDER BY ?month
month | events
------+--------
1 | 26117
1 | 26117
...
2 | 30911
2 | 30911
...
3 | 33932
3 | 33932
...
4 | 32196
4 | 32196
...
8 | 34315
8 | 34315
...
12 | 11648
12 | 11648
...
This example query returns the percentage of each salesperson's total sales that came from the event "Gypsy":
SELECT ?event_name ?fname ?lname
((?dollars * 100.0/(SUM(?dollars) OVER(PARTITION BY ?event))) as ?percent_of_sales)
FROM <tickit>
WHERE {
?sale <eventid> ?event .
?event <eventname> ?event_name .
?sale <sellerid> ?salesperson .
?sale <pricepaid> ?dollars .
?salesperson <firstname> ?fname .
?salesperson <lastname> ?lname .
FILTER(?event_name = "Gypsy").
}
ORDER BY ?event_name desc(?percent_of_sales)
event_name | fname | lname | percent_of_sales
-----------+---------+-----------+------------------
Gypsy | Zoe | Sosa | 100.000000
Gypsy | Xaviera | Jacobson | 50.941458
Gypsy | Brianna | Mcfarland | 50.507614
Gypsy | Alexa | Baird | 45.792564
Gypsy | Roanna | Wood | 42.040816
Gypsy | Colette | Clay | 36.938776
Gypsy | Amela | Holman | 35.727660
Gypsy | Aubrey | Terrell | 32.245681
Gypsy | Bruno | Griffin | 31.813858
Gypsy | Damian | Berger | 31.245923
Gypsy | Zelenia | Woods | 31.161644
Gypsy | Imogene | Mclean | 31.000483
Gypsy | Helen | Haney | 30.295666
Gypsy | Silas | Wynn | 28.730306
Gypsy | Caryn | Hester | 28.714012
Gypsy | Camden | Horne | 27.253097
Gypsy | Adara | Burris | 26.809349
Gypsy | Raya | Fleming | 26.790955
...
This example queries the sample Tickit data set to return a running total of the number of tickets sold for the event "Mamma Mia!":
SELECT ?event ?month (SUM(?qty) OVER (PARTITION BY ?month ORDER BY ?event
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS ?tickets)
FROM <tickit>
WHERE {
?s <qtysold> ?qty .
?s <eventid> ?eventid .
?eventid <eventname> ?event .
?s <dateid> ?date .
?date <month> ?month .
filter(?event="Mamma Mia!")
}
ORDER BY ?tickets
LIMIT 500
event | month | tickets
-----------+-------+---------
Mamma Mia! | MAY | 1
Mamma Mia! | JAN | 1
Mamma Mia! | SEP | 1
Mamma Mia! | APR | 2
Mamma Mia! | JAN | 2
Mamma Mia! | AUG | 2
Mamma Mia! | JUN | 2
Mamma Mia! | MAY | 2
Mamma Mia! | NOV | 2
Mamma Mia! | SEP | 3
Mamma Mia! | NOV | 3
Mamma Mia! | FEB | 4
Mamma Mia! | APR | 4
...
Mamma Mia! | JAN | 85
Mamma Mia! | JUL | 85
Mamma Mia! | MAR | 85
Mamma Mia! | JUN | 85
Mamma Mia! | OCT | 85
Mamma Mia! | NOV | 85
Mamma Mia! | FEB | 86
Mamma Mia! | APR | 86
…
Unlike other graph databases that might require you to write several queries to gain an overall view of the values in a data set, AnzoGraph’s window aggregates enable users to run a single query that returns detailed information about the data’s patterns, such as running totals or changes over time. In addition to window aggregates, AnzoGraph includes a rich library of other SQL- and Microsoft Excel-like functions.
Opinions expressed by DZone contributors are their own.
Comments