Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

What Is Apache VXQuery and How Does It Work?

DZone's Guide to

What Is Apache VXQuery and How Does It Work?

The need has been created to query large repositories of data, mainly focusing document management and data exchange. Learn how Apache VXQuery can help!

· Big Data Zone
Free Resource

Access NoSQL and Big Data through SQL using standard drivers (ODBC, JDBC, ADO.NET). Free Download 

XQuery is to XML what SQL is to databases. XQuery is the language for querying XML data and it is built on XPath expressions. XQuery is supported by all major databases for finding and extracting elements and attributes from XML documents. It has been recommended by W3C.

Apache VXQuery is a standards-compliant XML query processor implemented in Java. Queries will be evaluated on a cluster of distributed systems.

There are lots of large collections of relatively small documents — and there are no scalable and efficient XQuery processors available today that are capable of processing large datasets in parallel and making the contained information accessible. Therefore, VXQuery stands as the best solution for evaluating queries on large amounts of XML data and large collections of relatively small XML documents in distributed systems using XQuery processing language.

What Is VXQuery?

The need has been created to query large repositories of XML data by the wide use of XML in various functionalities, mainly focusing document management and data exchange. Apache VXQuery can efficiently query such large data collections and take advantage of parallelism. XQuery optimization rules are applied to the query plan to improve path expression efficiency and to enable query parallelism.

The system builds upon two other open-source frameworks: Hyracks, a parallel execution engine, and Algebricks, a language-agnostic compiler toolbox. Apache VXQuery extends these two frameworks and provides an implementation of the XQuery specifics (data model, data model-dependent functions and optimizations, and a parser). F

The queries are executed on a Hyracks cluster (on a local single node cluster if no actual cluster is available/configured). Although there are various native open-source XQuery processors (i.e. Saxon, Galax, etc.) available, they have been optimized for single-node processing and do not support scaling to many nodes in distributed systems like VXQuery does. More explanation on parallel XQuery Processor implemented in this system can be found in this research paper.

As stated in the research article by E. Preston Carman et al.:

“An experimental evaluation using a real 500GB dataset with various selection, aggregation, and join XML queries shows that Apache VXQuery performs well both in terms of scale-up and speed-up. Experiments show that it is about 3.5x faster than Saxon (an open-source and commercial XQuery processor) on a 4-core, single node implementation, and around 2.5x faster than Apache MRQL (a MapReduce-based parallel query processor) on an eight (4-core) node cluster.”

When considering how queries are run in distributed systems, it is very similar to how Hadoop clusters work. Cluster Controller acts as the logical entry point for user requests and Node Controllers are linked to Cluster Controller. 

Apache VXQuery Stack

Figure 2: VXQuery Stack

Figure 1: VXQuery Stack

Apache VXQuery’s software stack can be represented in three layers, as shown in Figure 1. The top layer, Apache VXQuery, forms an Algebricks logical plan based on parsing a supplied XQuery. The initial Algebricks logical plan is then optimized and translated into an Algebricks physical
plan that maps directly to a Hyracks job. The Hyracks platform executes the job and returns the results.

Algebrics

Algebricks is designed to compile a query into an “algebric program.” This layer is actually data-model neutral and it supports other high level data languages like Hivesterix and AQL (AsterixDB is designed to query using AQL and it is considered as the sister project of VXQuery).

This language-agnostic toolbox complements the lower-level extensible Hyracks platform. Implementations of data-intensive languages can extend its model-agnostic algebraic layer to create parallel query processors on top of the Hyracks platform. A language developer is free to define the language and data model when using the Hyracks platform and the Algebricks toolkit. Algebricks features a rule-based optimizer and data model-neutral operators that each allow for language-specific customization. A system that uses Algebricks for its query processing provides its own parser and translator to translate a query to a query plan that uses Algebricks’ logical operators as an intermediate representation.

The Algebricks rule-based optimizer then transforms the query plan over three stages. The first is a logical-to-logical plan optimizer that creates alternate logical plans. Once the logical plan is finalized, the logical-to-physical plan optimizer converts the logical operators into a physical plan. Then, the physical optimizer considers the operator characteristics, partition properties, and data locality to choose the optimal physical implementation for the plan. Algebricks provides generic language-independent rewrite rules for each stage and allows for the addition of other rules. Finally, a Hyracks job is generated and submitted for execution on a Hyracks cluster.

Hyracks

Hyracks acts as the run-time layer and accepts and manages data parallel computations requested by the above layers. Algebrics submits its compiled queries into Hyracks and asks to start a Hyracks job to execute query. Every job has a specific Job ID mapped, which can be used to retrieve results.

This generic platform offers a framework to run dataflows in parallel on a shared-nothing cluster. The system was designed to be independent of any particular data model. Hyracks processes data in partitions of contiguous bytes, moving data in fixed-sized frames that contain physical records, and it defines interfaces that allow users of the platform to specify the data-type details for comparing, hashing, serializing, and de-serializing data. A Hyracks job is defined by a dataflow DAG with operators (nodes) and connectors (edges). During execution, the operators allow the computation to consume an input partition and produce an output partition while the connectors redistribute data among partitions.

The dataflow among Hyracks operators is push-based — in other words, each source (producer) operator pushes the output frames to a target (consumer) operator. The extensible runtime platform provides a number of operators and connectors for use in forming Hyracks jobs. While each operator’s operation is defined by Hyracks, the operator relies on data-model specific functionality provided by the client of the platform.

VXQuery Cluster

Hyracks Cluster Controller (CC) acts as the controlling unit for all other Node Controllers (NC). NCs are the control units for nodes. Nodes consists of XML documents.Image title

Figure 2: Structure of a VXQuery Cluster

At runtime, the Apache VXQuery cluster processes a query using the Apache VXQuery Client Library Interface (CLI), a Hyracks CC, and some Hyracks Data Nodes (as shown in Figure 4). The process starts with a user submitting an XQuery statement through the Apache VXQuery interface (REST API) for parallel execution. The REST API optimizes the query and submits the generated Hyracks job to the cluster controller, which manages and distributes tasks to each of the data nodes for evaluation. Each data node
contains XML files, an XML parser and the XQuery runtime expressions used to evaluate the node’s tasks. Finally, the cluster controller collects the data nodes’ results and sends the result back to the Apache VXQuery REST API, which returns the results to the user or to Command Line Interface tool.Image title


Figure 3: VXQuery Cluster Configuration

Deeper Into Implementation

When considering code-level implementation, earlier VXQuery project had vxquery-cli, vxquery-server, vxquery-core, vxquery-benchmark, and vxquery-xtest only. The CLI module was the only interface exposed to execute XQueries. But under the current version, VXQuery consists of a REST API (that comes in vxquery-rest module) and it is capable of compiling and executing the queries submitted through HTTP requests by users. The vxquery-cli module and vxquery-xtest module have been refined to use the REST API. Now, the VXQuery core is directly connected only to REST API and others submit queries through REST API. Below component diagrams show the component organization before the REST API and version after the introduction of the REST API.Image title

According to the REST API definition written with Swagger configuration, the REST API has been implemented and REST API server has been implemented using the hyracks-http package.

As per the definition of the REST API, there are two endpoints:/query  and /query/result/{resultId}. They are handled by two separate servlets implemented top of  hyracks-http. VXQuery REST Server allows users to submit queries and receive results either synchronously or asynchronously through the exposed REST API. When it comes to CLI, it runs in synchronous mode. Users have the flexibility of calling a remote VXQuery REST API or calling a locally started REST server through the CLI. Also, users can pass several parameters with the query statement to get additional information like execution timing summary.

VXQuery XTest module runs in local mode where it creates a local hyracks cluster and use the locally started REST API for executing queries related to tests. The XTest framework has 222 test cases written to verify the correct functionality and efficiency of VXQuery.

How VXQuery Starts

The CCDriver class contains the main method. When CCDriver starts, it calls ClusterControllerService to start the VXQuery Server and Cluster Controller. It waits until the Node controllers are connected and notified. After this point, VXQueryApplication starts as the cluster controller application, which is responsible for starting the REST server. Then VXQueryService, which is the main class of REST API getting started and API is set ready to be used by clients. Usually a System Administrator runs a Python file (named cluster_cli.py) to run the CCDriver class.

This is how basically code level implementation works and source code is hosted in GitHub.

Augmentations

VXQuery aims not only at technical users but also at more non-technical users — and those users need to get functionalities at ease. A simple web interface on top of a REST API will be implemented in the near future. This will allow non-technical users to submit queries to the VXQuery REST server through a web interface.

More and more future enhancements are expecting to happen. Stay with Apache VXQuery for more information!

The fastest databases need the fastest drivers - learn how you can leverage CData Drivers for high performance NoSQL & Big Data Access.

Topics:
java ,apache ,xquery ,distributed queries

Published at DZone with permission of Erandi Ganepola. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}