Alpine Data An Introduction to PFA
Alpine Data An Introduction to PFA
Alpine Data has released a new standard for analytics exchange that uses JSON or YAML. Read on to gain a quick introduction to PFA.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
With Chorus 6.1 we have introduced the support for PFA, the Portable Format for Analytics.
Before we get into what PFA is let’s make some observations about the data science process. There are a few important questions we can ask about the process in general:
- 1.) What is our processing model?
- 2.) What are our performance characteristics?
- 3.) How do we refine our models?
- 4.) Who does the work?
Training models thrives on large batches of data and, while performance is very important, (e.g. to scale), no one has expectations that the training is instantaneous. Models are often refined rapidly with a focus on experimentation. Primarily, data scientists are performing this work, lovingly cleaning their samples and elegantly attaining algorithmic excellence.
When we’re scoring inputs with these models, things change. Now, we need to deal with samples at a very fine granularity and expect a near-immediate response. This is similar to deploying an application in production, so experiments in this phase are targeted at concerns that are ancillary to the data science process. This is the job of dedicated engineers, carefully crafting the code and wrangling the rigmarole of getting it where it can do useful work.
So how do we move our analytics from training to scoring? There are a few methods in practice. One way is to have data scientists produce analytics on their platform of choice and have engineers figure out how to get it directly into production. This method tends to be rather difficult to support and scale. Another way is to task the engineers with translating the analytics output into a separate production environment (often using another language entirely), and then do even more work chasing the inevitable bugs from the translation process. Clearly, neither of these are ideal. What we need is a way to programmatically express the models in a common format so that data scientists can produce them and engineers can deploy them. One such format is PMML.
PMML is a specification expressed in XML that is designed to represent a collection of specific-purpose, configurable statistical models. It is an industry standard maintained by the Data Mining Group that allows trained models to be executed in a variety of ways, independent of the development language. Though this is all great, there is a downside. PMML has limited support for computation. The standard defines the set of supported models, so things the standard accommodates are fairly simple to accomplish. But, filling needed outside the standard requires tremendous effort, or may be impossible.
To address this, the Data Mining Group has introduced PFA, an evolution of the notion of serializing analytics. Like PMML, it’s an industry standard serialization format that allows the scoring of trained models to be abstracted from the training itself, allowing computations to move between disparate environments.
One difference from PMML is that PFA is expressed in JSON or YAML as a primary syntax. A bigger difference is that the semantics of PFA are much more expressive, allowing authors to embed nearly any calculation. The language itself is statically typed and primarily functional with some imperative features. PFA hosts also have some responsibilities, among them enforcing a well-defined memory model that allows for correct concurrent execution of engines.
As a quick example, here’s a PFA engine (in YAML) that adds 100 to its input and replies with result of the calculation. The “input” field defines the type of information the engine accepts, in this case signed 32 bit whole numbers. Similarly, the “output” field defines the type of information the engine produces. The “action” field, which is a list of expressions, defines a single step, calling the built-in “+” function with the engine input and a literal 100, then returning that result as the output of the engine.
More examples are available in the interactive PFA tutorials, provided by the Data Mining Group.
Chorus 6.1 adds PFA export for some of our most commonly-used models, including k-means, linear and logistic regression, and PCA. You can embed these models into our PFA scoring engine for deployment onto PaaS environments like Cloud Foundry and AWS. With engines deployed and instrumented into business processes and applications, data science assets are able to drive significant business impact, not just esoteric slide decks.
Alpine plans to start rolling out more support for PFA in future releases, so stay tuned.
Published at DZone with permission of Jason Miller , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.