Associative Data Modeling Demystified: Part V
Associative Data Modeling Demystified: Part V
Qlik’s competitive advantage over other BI tools is that it manages associations in memory at the engine level and not at the application level.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
In the discussion session of the previous post of our series, we mentioned two basic reasons that make RDF data modeling (and consequently SPARQL query mechanisms) disadvantageous with an interactive and associative data exploration and data integration of QlikView and Qliksense. These are the naming of predicate terms that connect resources and literals and the unidirectional architectural design of the edges.
We have already demonstrated this issue with the associative model schema design of a Movies domain in Freebase (Figure 10), the modeling of our Supplier-Parts-Catalogue toy example in Sentences, Figure 15, Figure 16, and the OrientDB Property Graph (Figure 4). In this post, we continue our journey with Qlik’s associative data model.
Qlik’s In-Memory Associative Architecture
Qlik’s competitive advantage over other BI tools is that it manages associations in memory at the engine level and not at the application level. This management of data is deep down to an atomic level of data (Granularity), i.e., every data point in every field of a table is associated with every other data point anywhere in the entire schema. We will follow Qlik’x QIX engine stepwise approach with the goal of performing an interactive data exploration of our Suppliers-Parts-Catalogue data set.
Loading the Data
The first step is to bring data into memory. A multi-table read-only, in-memory, compressed, binary, columnar data model is created. Data from each source data table is converted into two types of in-memory data structures.
- A set of columns that contain binary values, i.e., references, for each distinct original value.
- A compressed binary data table by replacing each row and column cell value with a binary reference.
For example, these statements about QlikView script syntax load data from the Parts and Suppliers tables that are stored in Excel files and reconstruct eleven unique columns and two read-only binary representation of these tables in memory.
LOAD prtID, prtName, prtColor, prtWeight, prtUnit FROM [F:\tmp\SupplierPartCatalogue\SuppliersPartsCatalogue.xlsx] (ooxml, embedded labels, table is Part); LOAD supID, supName, supAddress, supCity, supCountry, supStatus FROM [F:\tmp\SupplierPartCatalogue\SuppliersPartsCatalogue.xlsx] (ooxml, embedded labels, table is Supplier);
If two fields have the same name in two different tables (i.e., a relationship), then they have the same columnar binary representation. In order to resolve ambiguities and to associate over fields that have the same name, unique column naming is required. This is the case for our third associative, i.e., bridge, junction table. Field names
catPIDhave already been loaded and represented with the columns
prtID. The next
LOAD statement is using the alias operator (
as) to handle unique column naming.
LOAD catSID as supID, catPID as prtID, catPrice, catTotal, catDate, catChk FROM [F:\tmp\SupplierPartCatalogue\SuppliersPartsCatalogue.xlsx] (ooxml, embedded labels, table is Catalogue);
Figure 1: QlikView internal/source table view.
Once the script is executed, we can review the data table structure with the Table Viewer. The three tables are associated with two connectors that bidirectionally link them through their common fields. This layout is quite similar to the entity-relationship diagram of Microsoft Access database schema. A QlikView association resembles a SQL natural outer join. However, an outer join in SQL is unidirectional. An association always results in a full (bidirectional) linkage. In practice, we have two completely different approaches. In a typical SQL join query, we repeatedly search the index because of the join condition.
QlikView associative engine (QIX) knows how every data point is associated, therefore it can effectively determine (infers) and flag all distinct column values and all rows in each data table upon user’s selection. — Ferguson
The Blind Spot
Depending only on memory, QlikView can load many whole tables without joins instead of a limited view from the main dataset. Other query-based BI tools usually aggregate an extracted subset of data and return it in the form of a result set. This result set is completely divorced from the original data set and this very act of extraction breaks associations. For example, find out how a piece of data contained in a query relates to another piece of data outside the query.
You may think also of an internal combustion engine analogy. With the query-based paradigm, we would look at individual parts of the engine in isolation, (see the header image of our post). Using QlikView associative technology, we have access to a digital model of a complete working engine and we can tweak any part to see how that affects other parts and the engine on the whole.
Explore and Filter Data Via Selections
In this section, we will examine QlikView associative aspect from the user’s experience point of view.
Central to QlikView is the concept of a user-defined selection state. As users click around in a QlikView document, they indicate which subsets of data they are interested in analyzing and which subsets should be ignored. — QlikView Whitepaper
To demonstrate the visual effect of associations, i.e., the relationship between a value in one field and a value in another, we arranged a number of sheet objects in our Main worksheet. We chose table boxes to present rows of data from our Tables and list boxes to display a list of all possible values of a specific field. We have also added a current selections box to list user’s selected field values and a search object for searching for information anywhere in the document.
Figure 2: QlikView Sheet Object in an unselected state with Table Boxes, Current Selections, and a Search Box on the left side and a Container with List Boxes for Fields representation on the right side.
Now you can see visually what is associated and is not associated with any particular selection. For instance, Figure 3 displays the state of our Worksheet when the user selected the Part item with ID value equal to
List Box of the field (prtID) with the value
998 appears in green and it has also been added to the
Current Selections list.
List Boxes, unrelated values in all other fields appear in gray and those that are related appear in white. At the same time, when the user interacts with this particular
List Box all the rows of
Table Boxes are instantly filtered to reflect this new context.
Figure 3: Table Boxes on the left side are filtered and values of the List Boxes on the right appear in white or gray according to the current selection (green).
So far, we have seen how we can get the resulting data after user’s selection in two possible formats, a list of values (
List Box) for each field of the data Table and a grid (
Table Box) with tuples (records). You may also notice in our visual setup that we have a
Derived Relation red grid with selected fields from all three Tables of our data set. This grid is automatically constructed with those tuples that appear in the other three grids (
Catalogue Table Box,
Part Table Box and
Supplier Table Box).
We should make it clear that the update of this
Derived Relation under the hood is not performing any kind of Join (SQL) operation. The effect of tuples filtering and shading of list values is demonstrated with animated GIF images in three states. There is the Unselected State, then there is the Part with ID value
998 selected state, and from those
Catalog entries for this specific
Part, (4 entries, one for each
Supplier), we select the
Supplier with the minimum
Catalog price and that is our third state.
Figure 4: Animated grids of tuples.
Figure 5: Animated lists of values from table fields.
R3DM Hypergraph Representation
With the Associative Data Modeling, there can be a third way to present data. This is the one that engages the user with the visualization of network graphs. In the previous post of our series, we have drawn an R3DM Hypergraph and we described associations according to R3DM terminology. We have used exactly the same data of this case study example, i.e., find the Supplier with the minimum Catalog price for Part with
998, to draw Figures 1 through 5.
This time, we have redrawn the hypergraph of Figure 14 with the help of OrientDB Graph Editor, Figure 6. Every string label and a numerical value of Figure 14 has been replaced with a unique OrientDB record identifier (RID - clusterID: clusterPosition). In effect, this is how we implemented R3DM/S3DM
Sign layer. Each data item becomes an Atomic Information Resource (AIR) unit with a symbolic representation of a 2D vector (Entity/Attribute Type: Instance).
AIR units can represent in a uniform way anything, i.e. Entities, Attributes, Values, Types, Databases, etc. Their vector form can be indexed, linked, retrieved, stored efficiently, and we can use AIR units to build associations and assimilate composite information structures such as records (tuples).
To visualize R3DM associations using AIR units we have color coded the graph and we added a legend at the left side of Figure 6. Entities are distinguished from attributes by the size of the disk and grouping of entities and attributes is depicted with different shades of a color (green for Suppliers, brown for Parts, purple for Catalog items).
We can easily see four
Catalog items (#70:7, #70:11, #70:12, #70:16) associations with the four
Suppliers (#68:0, #68:1, #68:2, #68:3) and the
Part (#69:7) with ID value
998 (#77:7) that is drawn in the center of the graph with the black disk around it. USA suppliers (#68:0, #68:1) share the value of their common
supCountry attribute (#75:0). One of them (#68:1) has a Catalog entry (#70:11) with the minimum catalog price,
catPrice, (#82:9). We can also see the formation of tuples for Supplier (4), Catalog (4) and Part (1).
Figure 6: R3DM Color Coded Associations with AIR units.
Regarding the immense advantage of using AIR units to reconstruct hierarchical, table, or graph structures, we can also refer to the following QlikTech patent. In the Summary of the Invention section we read:
Each different data element value of each data element type is assigned a binary code and the data records are stored in binary-coded form. On account of the binary coding, very rapid searches can be conducted in the tables.
It is rather unfortunate that the basis of QlikView’s evolutionary associative technology has been confined to the market of proprietary software products. In HEALIS, we have made the difference by opening up these database design key principles for discussion with experts of the field and we have made them part of our R3DM/S3DM conceptual framework.
We have gone a step further to implement our framework on top of OrientDB and Intersystems Cache DBMS and instead of writing SQL queries, we have abstracted the programming of a set of functional operations that match the selection and filtering of data. We plan to continue with the last part of this post series in order to fully demonstrate how we build our system and how we test it with the Suppliers-Parts-Catalogue data set.
Published at DZone with permission of Athanassios I. Hatzis, PhD , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.