This week I had the opportunity to participate in a panel discussion at the ACM SIGKDD Conference on Knowledge Discovery and Data Mining. The panel discussion was part of the “Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data” organized by the DMG (Data Mining Group). The panel session and the associated presentations spoke in detail about the challenges associated with operationalizing models. Too often, once analytical models have been created by the data science team, the process of operationalization is lengthy and labor intensive. In many instances, there is no turn-key strategy for deploying these models to a real-time scoring solution running elsewhere in the company. Indeed, in many cases, the production version of a model must be manually developed in C++ or Java by an entirely separate team, with a Word document written by the data scientists serving as the model specification. As can be imagined, this is an error-prone process, requiring extensive testing and impacting an enterprise’s agility and ability to rapidly deploy and update models.
In many instances, this requirement for recoding between training and deployment is a result of the incompatibility between the models created by the toolchains used by the data scientists and the model formats supported by operationalized scoring engines. Ideally, a model generated by a data scientist would be directly consumable by the operationalized frameworks, with a guarantee that both components interpret the model identically.
Luckily, open standards for describing predictive models do exist. This removes the requirement for the costly re-implementation of models when deployed into production. Models can be described in a standard interchange format that allows the efficient interchange of models between vendors, such that a model developed by data scientists on Hadoop can be deployed in production against a real-time stream with confidence in the preciseness of the implementation.
To-date, most of the focus around predictive analytics standards has been on the PMML (Predictive Model Markup Language) model interchange format. This format, developed by the DMG (Data Mining Group), is well established in the analytics space. Most analytic tools support the export of PMML models and a number of tools support the deployment of PMML models into production.
The DMG is now in the process of releasing a new standard to complement PMML. The standard, named PFA (Portable Format for Analytics), incorporates improvements informed from observing the use of PMML over many years. One improvement to note is the comprehensive support for encapsulating data pre- and post- processing. In modern scoring flows, significant preprocessing is typically applied to the input data prior to application of the model (e.g. cleansing, feature extraction, and feature construction). Historically, this data preprocessing may have necessitated a companion code fragment or script to complement the PMML model, complicating operationalization. With PFA, the entire scoring flow can be represented in a standardized manner in a single document, making the operationalization of models even more turn-key.
Alpine plans to start rolling out support for PFA starting with our next release, so stay tuned!