Productionizing Data Science
Collaboration is necessary to be successful with Big Data, and the right server facilitates collaboration.
Join the DZone community and get the full member experience.Join For Free
It was great speaking with Michael Berthold, Founder and CEO of KNIME during their fall summit. KNIME provides an open source analytics platform for the creation of data science. It allows developers, scientists, analysts, and business owners to design and implement data science workflows with added leverage from KNIME Integrations, KNIME Extensions, Community Extensions, and Partner Extensions.
According to Michael, multiple users working on the same projects will need to share files, opinions, and current work -- collaborating to build the best solution. A data science project rarely finishes with a trained model, the conclusive step is to deploy the model within a production application. Scalability in real-world applications is another concern. Finally, all workflows, models, metanodes, and the data produced within the group need access rights, monitoring, versioning, and management.
KNIME Server extends the power of KNIME Analytics Platform by improving the productivity of data science teams with collaborative, scalability, deployment, and management features, giving them more freedom and flexibility.
The server enables collaboration where users can build a workflow and get feedback by sharing with team members who can comment, correct, tag, and rate the workflow via the workflow hub. Users can access interactive overviews of the workflows including the image, meta information, and the required plugins for the workflow. This is a great way to encourage discussion about data science problems among stakeholders. The workflow hub helps retain knowledge as team members come and go and promotes the reuse of successful workflows. The public workflow hub serves as a place for the KNIME community to share and rate workflows.
Database connections, logical groups of nodes, and complex Python, R, or other scripts can all be shared in a way that's available to new users. To create, simply drag and drop the node into your workflow editor. This creates a read-only link to the original metanode template.
There are three deployment options with the KNIME server: 1) run via remote execution; 2) run via a web browser on the KNIME web portal; and, 3) run as a REST API.
Workflows can be set to run on a schedule, as well as remotely. You can monitor your workflow and make changes to the configuration. This is useful when you are not able to run on the high-performance hardware required to process extremely large datasets, or working with GPUs for deep learning applications.
Security and administration controls are part of the server. After deployment, you can set permissions, and control access on workflows and data files to comply with data protection policies, internal business rules, and team processes. It's also possible to integrate authentications with corporate LDAP/Active Directory setups and manage permissions for groups and individual users.
For large deployments, keeping control of common and approved preferences and allowing connections to databases, proxy settings, Python or R setting can become difficult. The management feature of the server makes it easy to manage all client preferences.
When deploying or updating a workflow you can keep track of changes by taking snapshots so you can roll back to a previous version if necessary. You can also identify subtle differences in workflows with the workflow difference functionality highlight.
Finally, the server keeps a record of all jobs and enables you to scale by moving workflows to a distributed platform, the public cloud, or distributed executors of the KNIME server using RabbitMQ.
Opinions expressed by DZone contributors are their own.