DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
Securing Your Software Supply Chain with JFrog and Azure
Register Today

Trending

  • A Deep Dive Into the Differences Between Kafka and Pulsar
  • Five Java Books Beginners and Professionals Should Read
  • Implementing a Serverless DevOps Pipeline With AWS Lambda and CodePipeline
  • Alpha Testing Tutorial: A Comprehensive Guide With Best Practices

Trending

  • A Deep Dive Into the Differences Between Kafka and Pulsar
  • Five Java Books Beginners and Professionals Should Read
  • Implementing a Serverless DevOps Pipeline With AWS Lambda and CodePipeline
  • Alpha Testing Tutorial: A Comprehensive Guide With Best Practices
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Alibaba Cloud Arena: An Open-Source Tool for Deep Learning

Alibaba Cloud Arena: An Open-Source Tool for Deep Learning

Let's take a look at a new tool called Arena, which is an open-source tool for Deep Learning. Also explore the story behind it.

Leona Zhang user avatar by
Leona Zhang
·
Sep. 08, 18 · News
Like (3)
Save
Tweet
Share
6.29K Views

Join the DZone community and get the full member experience.

Join For Free

Alibaba Cloud introduced the Deep Learning tool Arena to the open-source community in July 2018. Now, data scientists can run Deep Learning on the cloud without having to learn to manipulate low-level IT resources. They can start a Deep Learning task within a minute, and create a heterogeneous computing cluster within fifteen minutes.

Why Build a Tool Like Arena?

Today, KubeFlow is the most popular Deep Learning solution within the Kubernetes community, so isn't Arena just reinventing the wheel? KubeFlow is a combinable, portable, and expandable machine learning technology stack built on Kubernetes. It is an end-to-end solution that supports Jupyter Hub development, TFJob model training to TF-serving, and Seldon prediction. However, KubeFlow requires a mastery of Kubernetes. For example, writing a yaml file to deploy a TFJob is quite challenging for the primary users of a machine learning platform — data scientists.

Such tasks diverge from the expectations of data scientists, who care only about three things:

  1. Where the data comes from.
  2. How to run the machine learning code.
  3. How to examine training results (models and logs).

Data scientists are familiar with and enjoy the work method of writing a few simple scripts and running machine learning code on their desktops. However, the space limitations of their hard drives limit the quantity of data they can process, and their computing power is limited when they have no way to take advantage of distributed training.

This is why we developed Arena. This command line tool shields you from the complexities of low-level resources, environment administration, task scheduling, and GPU scheduling and assignment. Arena helps data scientists submit training tasks and check training progress in the straightforward way with which they are already familiar. When data scientists call Arena, they can designate the data source, code to download, and whether to use TensorBoard to check training results.

What Is The Role of Arena?

Arena currently supports standalone training and PS-Worker model distributed training. On the backend, it relies on the TFJob provided by KubeFlow. Soon, it will be expanded to support MPIJob and PytorchJob also.

Image title

It also supports real-time training operations and maintenance including:

  1. Utilization of the "top" command to monitor the allocation and scheduling of GPU resources.
  2. CPU and GPU resource monitoring.
  3. Real-time checking of training logs.

In the future, we hope to provide a Deep Learning production line through Arena that covers the whole process, including integrated training data management, experiment management, model development, continuous training, evaluation, and online prediction.

The goal of Arena is to allow data scientists to unleash the power of KubeFlow as easily as training on a desktop, while also giving them control over cluster-level scheduling and administration. We have published our source code on GitHub to better share and cooperate with the open-source community: https://github.com/AliyunContainerService/arena. Everybody is welcome to check it out and use it. If you like it, please star it. We also welcome your contributions to the code.

The Story Behind Arena

The open-source tool Arena was born as Alibaba Cloud's Deep Learning Solution. It already supports many Deep Learning frameworks (such as TensorFlow, Caffe, Hovorod, and Pytorch), and it supports the whole Deep Learning production line from start to finish (including the steps of integrated training data management, experiment management, model development, continuous training and evaluation, and online prediction).

This solution deeply integrates the resources and services of Alibaba Cloud. It efficiently utilizes heterogeneous resources like the CPU and GPU, and it centralizes containerization, orchestration, and management, also providing monitoring warnings and a platform for operation and maintenance.

Conclusion

Zhang Kai, a senior technical solution architect at Alibaba Cloud said, "Deep Learning has brought about a revolutionary leap in the development of artificial intelligence, yet it has also sharply increased our reliance on computing and data resources. Alibaba Cloud provides end-to-end support for large-scale training, and we are continuously polishing this Deep Learning solution to make it easier to use and give it more powerful features."

Deep learning Open source Alibaba Cloud Arena (web browser) Data science Cloud Machine learning

Published at DZone with permission of Leona Zhang. See the original article here.

Opinions expressed by DZone contributors are their own.

Trending

  • A Deep Dive Into the Differences Between Kafka and Pulsar
  • Five Java Books Beginners and Professionals Should Read
  • Implementing a Serverless DevOps Pipeline With AWS Lambda and CodePipeline
  • Alpha Testing Tutorial: A Comprehensive Guide With Best Practices

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: