{{announcement.body}}
{{announcement.title}}

Deep Learning at Alibaba Cloud With Alluxio – Running PyTorch on HDFS

DZone 's Guide to

Deep Learning at Alibaba Cloud With Alluxio – Running PyTorch on HDFS

This article demonstrates how this work was achieved within a Kubernetes environment.

· AI Zone ·
Free Resource

Google’s TensorFlow and Facebook’s PyTorch are two Deep Learning frameworks that have been popular with the open source community. Although PyTorch is still a relatively new framework, many developers have successfully adopted it due to its ease of use.

By default, PyTorch does not support Deep Learning model training directly in HDFS, which brings challenges to users who store data sets in HDFS. These users need to either export HDFS data at the start of each training job or modify the source code of PyTorch to support reading from HDFS. Both approaches are not ideal because they require additional manual work that may introduce additional uncertainties to the training job.

To avoid this problem, we choose to use Alluxio as an interface to access HDFS via a POSIX FileSystem interface. This approach greatly improved development efficiency at Alibaba Cloud.


This article demonstrates how this work was achieved within a Kubernetes environment.

Prepare HDFS 2.7.2 environment

For this tutorial, we used a Helm Chart to install HDFS to mock an existing HDFS cluster.

1. Install the helm chart of Hadoop 2.7.2

Shell
 




xxxxxxxxxx
1
10


1
git clone https://github.com/cheyang/kubernetes-HDFS.git
2
 
          
3
kubectl label nodes cn-huhehaote.192.168.0.117 hdfs-namenode-selector=hdfs-namenode-0
4
#helm install -f values.yaml hdfs charts/hdfs-k8s
5
helm dependency build charts/hdfs-k8s
6
helm install hdfs charts/hdfs-k8s \
7
      --set tags.ha=false  \
8
      --set tags.simple=true  \
9
      --set global.namenodeHAEnabled=false  \
10
      --set hdfs-simple-namenode-k8s.nodeSelector.hdfs-namenode-selector=hdfs-namenode-0


2. Check the status of the helm chart

Shell
 




xxxxxxxxxx
1


1
kubectl get all -l release=hdfs 


3. Client access hdfs

Shell
 




xxxxxxxxxx
1
31


 
1
kubectl exec -it hdfs-client-f5bc448dd-rc28d bash
2
root@hdfs-client-f5bc448dd-rc28d:/# hdfs dfsadmin -report
3
Configured Capacity: 422481862656 (393.47 GB)
4
Present Capacity: 355748564992 (331.32 GB)
5
DFS Remaining: 355748515840 (331.32 GB)
6
DFS Used: 49152 (48 KB)
7
DFS Used%: 0.00%
8
Under replicated blocks: 0
9
Blocks with corrupt replicas: 0
10
Missing blocks: 0
11
Missing blocks (with replication factor 1): 0
12
 
          
13
-------------------------------------------------
14
Live datanodes (2):
15
 
          
16
Name: 172.31.136.180:50010 (172-31-136-180.node-exporter.arms-prom.svc.cluster.local)
17
Hostname: iZj6c7rzs9xaeczn47omzcZ
18
Decommission Status : Normal
19
Configured Capacity: 211240931328 (196.73 GB)
20
DFS Used: 24576 (24 KB)
21
Non DFS Used: 32051716096 (29.85 GB)
22
DFS Remaining: 179189190656 (166.88 GB)
23
DFS Used%: 0.00%
24
DFS Remaining%: 84.83%
25
Configured Cache Capacity: 0 (0 B)
26
Cache Used: 0 (0 B)
27
Cache Remaining: 0 (0 B)
28
Cache Used%: 100.00%
29
Cache Remaining%: 0.00%
30
Xceivers: 1
31
Last contact: Tue Mar 31 16:48:52 UTC 2020


4. HDFS client configuration

Shell
 




xxxxxxxxxx
1
43


 
1
[root@iZj6c61fdnjcrcrc2sevsfZ kubernetes-HDFS]# kubectl exec -it hdfs-client-f5bc448dd-rc28d bash
2
root@hdfs-client-f5bc448dd-rc28d:/# cat /etc/hadoop-custom-conf
3
cat: /etc/hadoop-custom-conf: Is a directory
4
root@hdfs-client-f5bc448dd-rc28d:/# cd /etc/hadoop-custom-conf
5
root@hdfs-client-f5bc448dd-rc28d:/etc/hadoop-custom-conf# ls
6
core-site.xml  hdfs-site.xml
7
root@hdfs-client-f5bc448dd-rc28d:/etc/hadoop-custom-conf# cat core-site.xml
8
<?xml version="1.0"?>
9
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
10
<configuration>
11
  <property>
12
    <name>fs.defaultFS</name>
13
    <value>hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020</value>
14
  </property>
15
</configuration>
16
root@hdfs-client-f5bc448dd-rc28d:/etc/hadoop-custom-conf# cat hdfs-site.xml
17
<?xml version="1.0"?>
18
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
19
<configuration>
20
  <property>
21
    <name>dfs.namenode.name.dir</name>
22
    <value>file:///hadoop/dfs/name</value>
23
  </property>
24
  <property>
25
    <name>dfs.namenode.datanode.registration.ip-hostname-check</name>
26
    <value>false</value>
27
  </property>
28
  <property>
29
    <name>dfs.datanode.data.dir</name>
30
    <value>/hadoop/dfs/data/0</value>
31
  </property>
32
</configuration>
33
root@hdfs-client-f5bc448dd-rc28d:/etc/hadoop-custom-conf# hadoop --version
34
Error: No command named `--version' was found. Perhaps you meant `hadoop -version'
35
root@hdfs-client-f5bc448dd-rc28d:/etc/hadoop-custom-conf# hadoop -version
36
Error: No command named `-version' was found. Perhaps you meant `hadoop version'
37
root@hdfs-client-f5bc448dd-rc28d:/etc/hadoop-custom-conf# hadoop version
38
Hadoop 2.7.2
39
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r b165c4fe8a74265c792ce23f546c64604acf0e41
40
Compiled by jenkins on 2016-01-26T00:08Z
41
Compiled with protoc 2.5.0
42
From source with checksum d0fda26633fa762bff87ec759ebe689c
43
This command was run using /opt/hadoop-2.7.2/share/hadoop/common/hadoop-common-2.7.2.jar


5. Experimental HDFS basic file operations

Shell
 




xxxxxxxxxx
1


 
1
# hdfs dfs -ls /
2
Found 1 items
3
drwxr-xr-x   - root supergroup          0 2020-03-31 16:51 /test
4
# hdfs dfs -mkdir /mytest
5
# hdfs dfs -copyFromLocal /etc/hadoop/hadoop-env.cmd /test/
6
# hdfs dfs -ls /test
7
Found 2 items
8
-rw-r--r--   3 root supergroup       3670 2020-04-20 08:51 /test/hadoop-env.cmd


6. Download data

Shell
 




xxxxxxxxxx
1


 
1
mkdir -p /data/MNIST/raw/
2
cd /data/MNIST/raw/
3
wget http://kubeflow.oss-cn-beijing.aliyuncs.com/mnist/train-images-idx3-ubyte.gz
4
wget http://kubeflow.oss-cn-beijing.aliyuncs.com/mnist/train-labels-idx1-ubyte.gz
5
wget http://kubeflow.oss-cn-beijing.aliyuncs.com/mnist/t10k-images-idx3-ubyte.gz
6
wget http://kubeflow.oss-cn-beijing.aliyuncs.com/mnist/t10k-labels-idx1-ubyte.gz
7
hdfs dfs -mkdir -p /data/MNIST/raw
8
hdfs dfs -copyFromLocal *.gz /data/MNIST/raw



Deploy Alluxio

1. First select the designated node, which can be one or more

Shell
 




xxxxxxxxxx
1


 
1
kubectl label nodes cn-huhehaote.192.168.0.117 dataset=mnist


2. Create config.yaml, in which you need to configure the node selector to specify the node

Shell
 




xxxxxxxxxx
1
32


1
cat << EOF > config.yaml
2
image: registry.cn-huhehaote.aliyuncs.com/alluxio/alluxio
3
imageTag: "2.2.0-SNAPSHOT-b2c7e50"
4
nodeSelector:
5
    dataset: mnist
6
properties:
7
    alluxio.fuse.debug.enabled: "false"
8
    alluxio.user.file.writetype.default: MUST_CACHE
9
    alluxio.master.journal.folder: /journal
10
    alluxio.master.journal.type: UFS
11
    alluxio.master.mount.table.root.ufs: "hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020"
12
worker:
13
    jvmOptions: " -Xmx4G "
14
master:
15
    jvmOptions: " -Xmx4G "
16
tieredstore:
17
  levels:
18
  - alias: MEM
19
    level: 0
20
    quota: 20GB
21
    type: hostPath
22
    path: /dev/shm
23
    high: 0.99
24
    low: 0.8
25
fuse:
26
  image: registry.cn-huhehaote.aliyuncs.com/alluxio/alluxio-fuse
27
  imageTag: "2.2.0-SNAPSHOT-b2c7e50"
28
  jvmOptions: " -Xmx4G -Xms4G "
29
  args:
30
    - fuse
31
    - --fuse-opts=direct_io
32
EOF


It should be noted that theHDFS version needs to be specified at compilation time. 

3. Deploy Alluxio

Shell
 




xxxxxxxxxx
1


 
1
wget http://kubeflow.oss-cn-beijing.aliyuncs.com/alluxio-0.12.0.tgz
2
tar -xvf alluxio-0.12.0.tgz
3
helm install alluxio -f config.yaml alluxio


4. Check the status of Alluxio, wait until all components are ready

Shell
 




xxxxxxxxxx
1
11


 
1
helm get manifest alluxio | kubectl get -f -
2
NAME                     TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                                   AGE
3
service/alluxio-master   ClusterIP   None         <none>        19998/TCP,19999/TCP,20001/TCP,20002/TCP   14h
4
 
          
5
NAME                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
6
daemonset.apps/alluxio-fuse     4         4         4       4            4           <none>          14h
7
NAME                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
8
daemonset.apps/alluxio-worker   4         4         4       4            4           <none>          14h
9
 
          
10
NAME                              READY   AGE
11
statefulset.apps/alluxio-master   1/1     14h



Prepare PyTorch Container Image

1. Create a Dockerfile

Shell
 




xxxxxxxxxx
1


1
mkdir pytorch-mnist
2
cd pytorch-mnist
3
vim Dockerfile


Populate the Dockerfile with the following content:

Shell
 




xxxxxxxxxx
1


1
FROM pytorch/pytorch:1.4-cuda10.1-cudnn7-devel
2
 
          
3
# pytorch/pytorch:1.4-cuda10.1-cudnn7-devel
4
 
          
5
ADD mnist.py /
6
 
          
7
CMD ["python", "/mnist.py"]


2. Create a PyTorch python file called mnist.py

Shell
 




xxxxxxxxxx
1


 
1
cd pytorch-mnist
2
vim mnist.py


Populate the python file with the following content:

Python
 




xxxxxxxxxx
1
138


1
# -*- coding: utf-8 -*-
2
# @Author: cheyang
3
# @Date:   2020-04-18 22:41:12
4
# @Last Modified by:   cheyang
5
# @Last Modified time: 2020-04-18 22:44:06
6
from __future__ import print_function
7
import argparse
8
import torch
9
import torch.nn as nn
10
import torch.nn.functional as F
11
import torch.optim as optim
12
from torchvision import datasets, transforms
13
from torch.optim.lr_scheduler import StepLR
14
 
          
15
 
          
16
class Net(nn.Module):
17
    def __init__(self):
18
        super(Net, self).__init__()
19
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
20
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
21
        self.dropout1 = nn.Dropout2d(0.25)
22
        self.dropout2 = nn.Dropout2d(0.5)
23
        self.fc1 = nn.Linear(9216, 128)
24
        self.fc2 = nn.Linear(128, 10)
25
 
          
26
    def forward(self, x):
27
        x = self.conv1(x)
28
        x = F.relu(x)
29
        x = self.conv2(x)
30
        x = F.relu(x)
31
        x = F.max_pool2d(x, 2)
32
        x = self.dropout1(x)
33
        x = torch.flatten(x, 1)
34
        x = self.fc1(x)
35
        x = F.relu(x)
36
        x = self.dropout2(x)
37
        x = self.fc2(x)
38
        output = F.log_softmax(x, dim=1)
39
        return output
40
 
          
41
 
          
42
def train(args, model, device, train_loader, optimizer, epoch):
43
    model.train()
44
    for batch_idx, (data, target) in enumerate(train_loader):
45
        data, target = data.to(device), target.to(device)
46
        optimizer.zero_grad()
47
        output = model(data)
48
        loss = F.nll_loss(output, target)
49
        loss.backward()
50
        optimizer.step()
51
        if batch_idx % args.log_interval == 0:
52
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
53
                epoch, batch_idx * len(data), len(train_loader.dataset),
54
                100. * batch_idx / len(train_loader), loss.item()))
55
 
          
56
 
          
57
def test(model, device, test_loader):
58
    model.eval()
59
    test_loss = 0
60
    correct = 0
61
    with torch.no_grad():
62
        for data, target in test_loader:
63
            data, target = data.to(device), target.to(device)
64
            output = model(data)
65
            test_loss += F.nll_loss(output,
66
                                    target,
67
                                    reduction='sum').item()
68
            pred = output.argmax(dim=1, keepdim=True)
69
            correct += pred.eq(target.view_as(pred)).sum().item()
70
 
          
71
    test_loss /= len(test_loader.dataset)
72
 
          
73
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
74
        test_loss, correct, len(test_loader.dataset),
75
        100. * correct / len(test_loader.dataset)))
76
 
          
77
 
          
78
def main():
79
    # Training settings
80
    parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
81
    parser.add_argument('--batch-size', type=int, default=64, metavar='N',
82
                        help='input batch size for training (default: 64)')
83
    parser.add_argument('--test-batch-size', type=int,
84
                        default=1000,
85
                        metavar='N',
86
                        help='input batch size for testing (default: 1000)')
87
    parser.add_argument('--epochs', type=int, default=14, metavar='N',
88
                        help='number of epochs to train (default: 14)')
89
    parser.add_argument('--lr', type=float, default=1.0, metavar='LR',
90
                        help='learning rate (default: 1.0)')
91
    parser.add_argument('--gamma', type=float, default=0.7, metavar='M',
92
                        help='Learning rate step gamma (default: 0.7)')
93
    parser.add_argument('--no-cuda', action='store_true', default=False,
94
                        help='disables CUDA training')
95
    parser.add_argument('--seed', type=int, default=1, metavar='S',
96
                        help='random seed (default: 1)')
97
    parser.add_argument('--log-interval', type=int, default=10, metavar='N',
98
                        help='how many batches to wait before logging training status')
99
 
          
100
    parser.add_argument('--save-model', action='store_true', default=False,
101
                        help='For Saving the current Model')
102
    args = parser.parse_args()
103
    use_cuda = not args.no_cuda and torch.cuda.is_available()
104
 
          
105
    torch.manual_seed(args.seed)
106
 
          
107
    device = torch.device("cuda" if use_cuda else "cpu")
108
 
          
109
    kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
110
    train_loader = torch.utils.data.DataLoader(
111
        datasets.MNIST('../data', train=True, download=True,
112
                       transform=transforms.Compose([
113
                           transforms.ToTensor(),
114
                           transforms.Normalize((0.1307,), (0.3081,))
115
                       ])),
116
        batch_size=args.batch_size, shuffle=True, **kwargs)
117
    test_loader = torch.utils.data.DataLoader(
118
        datasets.MNIST('../data', train=False, transform=transforms.Compose([
119
                       transforms.ToTensor(),
120
                       transforms.Normalize((0.1307,), (0.3081,))
121
                       ])),
122
        batch_size=args.test_batch_size, shuffle=True, **kwargs)
123
 
          
124
    model = Net().to(device)
125
    optimizer = optim.Adadelta(model.parameters(), lr=args.lr)
126
 
          
127
    scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma)
128
    for epoch in range(1, args.epochs + 1):
129
        train(args, model, device, train_loader, optimizer, epoch)
130
        test(model, device, test_loader)
131
        scheduler.step()
132
 
          
133
    if args.save_model:
134
        torch.save(model.state_dict(), "mnist_cnn.pt")
135
 
          
136
 
          
137
if __name__ == '__main__':
138
    main()


3. Build the image

Build a custom image under the same level of the directory, the target container image in this example is registry.cn-shanghai.aliyuncs.com/tensorflow-samples/mnist:pytorch-1.4-cuda10.1-cudnn7-devel

Shell
 




xxxxxxxxxx
1


 
1
docker build -t \
2
 registry.cn-shanghai.aliyuncs.com/tensorflow-samples/mnist:pytorch-1.4-cuda10.1-cudnn7-devel .


4. Push the built mirror

registry.cn-shanghai.aliyuncs.com/tensorflow-samples/mnist:pytorch-1.4-cuda10.1-cudnn7-devel to the mirror warehouse created in the East China 1 area for users who are in the Greater China area (Alibaba Cloud). You can refer to the basic operation of mirroring.


Submit PyTorch Training Tasks

1. Install arena

Shell
 




xxxxxxxxxx
1


1
$ wget http://kubeflow.oss-cn-beijing.aliyuncs.com/arena-installer-0.3.3-332fcde-linux-amd64.tar.gz
2
$ tar -xvf arena-installer-0.3.3-332fcde-linux-amd64.tar.gz
3
$ cd arena-installer/
4
$ ./install.
5
$ yum install bash-completion -y
6
$ echo "source <(arena completion bash)" >> ~/.bashrc
7
$ chmod u+x ~/.bashrc


2. Use arena to submit training tasks, remember to choose selector as dataset=mnist


Shell
 




xxxxxxxxxx
1


 
1
arena submit tf \
2
             --name=alluxio-pytorch \
3
             --selector=dataset=mnist \
4
             --data-dir=/alluxio-fuse/data:/data \
5
             --gpus=1 \
6
             --image=registry.cn-shanghai.aliyuncs.com/tensorflow-samples/mnist:pytorch-1.4-cuda10.1-cudnn7-devel \
7
             "python /mnist.py"


3. And view the training log through arena


Shell
 




xxxxxxxxxx
1
20


 
1
# arena logs --tail=20 alluxio-pytorch
2
Train Epoch: 12 [49280/60000 (82%)] Loss: 0.021669
3
Train Epoch: 12 [49920/60000 (83%)] Loss: 0.008180
4
Train Epoch: 12 [50560/60000 (84%)] Loss: 0.009288
5
Train Epoch: 12 [51200/60000 (85%)] Loss: 0.035657
6
Train Epoch: 12 [51840/60000 (86%)] Loss: 0.006190
7
Train Epoch: 12 [52480/60000 (87%)] Loss: 0.007776
8
Train Epoch: 12 [53120/60000 (88%)] Loss: 0.001990
9
Train Epoch: 12 [53760/60000 (90%)] Loss: 0.003609
10
Train Epoch: 12 [54400/60000 (91%)] Loss: 0.001943
11
Train Epoch: 12 [55040/60000 (92%)] Loss: 0.078825
12
Train Epoch: 12 [55680/60000 (93%)] Loss: 0.000925
13
Train Epoch: 12 [56320/60000 (94%)] Loss: 0.018071
14
Train Epoch: 12 [56960/60000 (95%)] Loss: 0.031451
15
Train Epoch: 12 [57600/60000 (96%)] Loss: 0.031353
16
Train Epoch: 12 [58240/60000 (97%)] Loss: 0.075761
17
Train Epoch: 12 [58880/60000 (98%)] Loss: 0.003975
18
Train Epoch: 12 [59520/60000 (99%)] Loss: 0.085389
19
 
          
20
Test set: Average loss: 0.0256, Accuracy: 9921/10000 (99%)



Summary

Previously, running the PyTorch program required users to modify the PyTorch adapter code to be able to access data in HDFS. Using  Alluxio, we were able to quickly develop and train models without any additional work to modify PyTorch code or manually copy HDFS data. This approach is further simplified by setting up the entire environment within Kubernetes.

Topics:
cloud computing, data engineering, data orchestration, deep learning, distributed systems, hadoop and hdfs, kubernetes, pytorch, tutorial

Published at DZone with permission of Bin Fan , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}