部署PyTorch模型到終端

03-06

背景

AI能力進行服務端部署並不是任何時候都適用，在未來，可能大多數時候都不適用。Gemfield來列舉幾個場景：

1，AI能力的輸出只是服務個別用戶的時候（不能發揮服務端一對多的規模優勢）；比如手機的AI拍照。

2，終端到服務端一來一回的網路延遲不能接受的情況下（遑論網路不穩定甚至斷網經常發生）；比如無人駕駛。

3，服務端算力受限的情況；服務端目前的AI計算能力相對來說是很強大，但這更像是社會主義集中資源辦大事的模式；如果有很多並行的微小AI能力需求，這種情況下就不適用。比如要對幾十萬路監控探頭進行AI分析，最好的模式還是將AI能力部署在AI攝像探頭上，將AI能力的輸出再回傳到服務端。否則，僅僅幾十萬路的視頻流就足以讓服務端無法承受。

在幾年之前，AI能力還沒有落地的情況下，是不會有這樣的場景的，所以大多數人（也包括大多數廠商）也不會考慮類似的問題。但如今，AI+革命帶來了越來越多類似的場景。為了給這些場景提供解決方案（有人說叫邊緣計算，不懂這個名詞是啥意思），軟硬體巨頭開始了劇烈的轉型。

AI能力開發及部署模式

Gemfield將從服務端AI訓練硬體、服務端AI訓練軟體框架、終端AI前向(推理)硬體、終端AI前向(推理)軟體框架這四個方面來闡述。注意硬體四大巨頭：Nvidia GPU、Apple NPU、Google TPU、Intel VPU。

服務端AI訓練硬體

目前市場上有Nvidia的cuda和google cloud上的TPU。在google cloud之外，比如自建數據中心、AWS、AZURE、阿里雲上，Nvidia的CUDA設備是事實上的唯一AI訓練硬體。

服務端AI訓練軟體框架

目前市場上主要有：Tensorflow、PyTorch、Caffe、MXNET。在自建AI平台上，一般選擇Tensorflow、PyTorch。

終端AI前向硬體

這裡的終端可以是類電腦的設備、手機、汽車、攝像頭、IoT、眾多嵌入式設備。這些設備上用到的AI加速硬體有：

Nvidia：CUDA GPU，面向嵌入式的JETSON；Intel：Movidius VPU（NCS2）；Apple：A12上的NPU；Google：Edge TPU；除了這四大巨頭外，還有國產的寒武紀、百度崑崙等，但還沒有哪個形成規模。

終端AI前向軟體框架

以手機為例，iOS上使用的是Apple的CoreML框架、Android上使用的是TFlite框架；以Intel NCS為例，使用的是Intel的NCSDK；另外還有Nvidia的TensorRT、騰訊的NCNN等。

開發模式

就像上面分析的那樣，在服務端，Nvidia的CUDA設備是目前事實上的唯一設備，因此，除卻價格外，這個生態系統對開發者是友好的，大家擁有標準的交流語言；而一旦到了種類繁多的終端上，軟硬體可謂是百花齊放，沒有一個統一標準。這就帶來了目前主流的部署方式：

在服務端訓練出特定的演算法模型——再將這個模型部署到服務端或者終端（以後大多數場景下是部署到終端）——需要服務端軟體框架到終端前向（推理）軟體框架的轉換工具。推理框架和轉換工具將構建起一個獨立的生態系統，目前，很多這樣的獨立生態系統在相互競爭，同時也為普通的開發者帶來了諸多AI開發難題。

在本文，gemfield將不定期更新一些AI演算法部署到終端的實踐。

1，PyTorch到CoreML

Apple官方雖然不支持pytorch到coreml的直接轉換。然而藉助蘋果的coremltools、pytorch的onnx、社區的onnx到coreml的轉換工具這三大力量，這個轉換過程還是很容易的。

將pytorch的模型在前向的時候輸出為onnx

在正常的前向邏輯中，加入torch.onnx._export API調用，只需要執行一次即可。這個是由PyTorch框架支持的。

input_names = ["gemfield"] output_names = ["gemfieldout"] outputs = torch.onnx._export(model, img_tensor, "gemfield.onnx", input_names=input_names, output_names=output_names)

由此以來，我們就生成了gemfield.onnx模型，再藉助下面的項目即可將gemfield.onnx轉換為蘋果的CoreML模型：gemfield.mlmodel。

onnx/onnx-coreml?

github.com

有了mlmodel文件後，便可將其加入到xcode工程中，最終將演算法部署到iOS上。這個時候你需要一位熟練的iOS開發者。注意：如果你使用了vision框架配合CoreML處理圖片的輸入，切記VNCoreMLRequest這個類的實例上的imageCropAndScaleOption屬性，因為默認的值會導致vision從中間將你的輸入圖片crop為正方形。你可以更改該屬性的值來改變這個狀況：

request.imageCropAndScaleOption = .scaleFill

福利1：使用下面的代碼可以查看gemfield.mlmodel網路結構(該代碼可以運行在Linux/Mac/Win上）：

import sys import coremltools from coremltools.models.neural_network import flexible_shape_utils spec = coremltools.utils.load_spec(gemfield.mlmodel) print(spec)

福利2：使用下面的代碼可以在手機之外debug CoreML模型（該代碼只能運行在Mac OS上）：

from PIL import Image import coremltools import numpy as np mlmodel = coremltools.models.MLModel(gemfield.mlmodel) pil_img = Image.open(civilnet.jpg) pil_img = pil_img.resize((768,1280))


#forward

out = mlmodel.predict({gemfield: pil_img})

#visualize the model output b = np.argmax(out[gemfieldout],0) im = np.where(b==1,255,0) im = im.astype(np.uint8) im=Image.fromarray(im) im.show()

要想在Mac OS上執行上面的腳本，需要安裝如下依賴：

sudo easy_install pip pip install --user pillow pip install --user coremltools

2，PyTorch到TFlite

總的說來，pytorch到tflite目前有4種方法：

a，使用pytorch2keras項目，再從keras轉換到tflite；

使用這個項目一開始就報錯，放棄了。

b，使用onnx-tensorflow 項目，再從tensorflow轉；

首先用pytorch export出onnx模型，其次用這個項目轉換為tensorflow的pb模型。

import onnx from onnx_tf.backend import prepare onnx_model = onnx.load("input_path") # load onnx model tf_rep = prepare(onnx_model) # prepare tf representation tf_rep.export_graph("output_path") # export the model

走了一遍，不支持空洞卷積、prelu...放棄了。

c，使用MMdnn項目轉化為IR，再從IR轉換為tensorflow或者keras，再到tflite；

使用命令：

mmtoir -f pytorch -d gemfieldnet --inputShape 3,512,1024 -n model.pth

報錯：

Traceback (most recent call last): File "/usr/local/bin/mmtoir", line 11, in <module> load_entry_point(mmdnn==0.2.3, console_scripts, mmtoir)() File "/usr/local/lib/python3.5/dist-packages/mmdnn/conversion/_script/convertToIR.py", line 192, in _main ret = _convert(args) File "/usr/local/lib/python3.5/dist-packages/mmdnn/conversion/_script/convertToIR.py", line 92, in _convert parser = PytorchParser(model, inputshape[0]) File "/usr/local/lib/python3.5/dist-packages/mmdnn/conversion/pytorch/pytorch_parser.py", line 85, in __init__ self.pytorch_graph.build(self.input_shape) File "/usr/local/lib/python3.5/dist-packages/mmdnn/conversion/pytorch/pytorch_graph.py", line 124, in build trace.set_graph(PytorchGraph._optimize_graph(trace.graph(), False)) File "/usr/local/lib/python3.5/dist-packages/mmdnn/conversion/pytorch/pytorch_graph.py", line 74, in _optimize_graph graph = torch._C._jit_pass_onnx(graph, torch.onnx.OperatorExportTypes.ONNX) File "/usr/local/lib/python3.5/dist-packages/torch/onnx/__init__.py", line 52, in _run_symbolic_function return utils._run_symbolic_function(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/torch/onnx/utils.py", line 529, in _run_symbolic_function n.kindOf("value"))) RuntimeError: Unsupported prim::Constant kind: `i`. Send a bug report.

解決無望！放棄了。

d，使用tensorflow重寫一遍。

不多說了，正在寫...

3，Pytorch到騰訊的NCNN

通過onnx轉換，不過目前（2019年1月25日）ncnn不支持upsample，放棄。

4，Pytorch到小米的MACE

通過onnx轉換，不過目前（2019年1月25日）不支持卷積核的group參數，不支持upsample，放棄。

Transform model to one that can better run on device (onnx model IR version: , 3L) (onnx model opset import: , [domain: "" version: 9 ]) Traceback (most recent call last): File "/home/gemfield/mace/bazel-bin/mace/python/tools/converter.runfiles/mace/mace/python/tools/converter.py", line 363, in <module> main(unused_args=[sys.argv[0]] + unparsed) File "/home/gemfield/mace/bazel-bin/mace/python/tools/converter.runfiles/mace/mace/python/tools/converter.py", line 194, in main output_graph_def = converter.run() File "/home/gemfield/mace/bazel-bin/mace/python/tools/converter.runfiles/mace/mace/python/tools/converter_tool/onnx_converter.py", line 389, in run self.convert_ops(graph_def) File "/home/gemfield/mace/bazel-bin/mace/python/tools/converter.runfiles/mace/mace/python/tools/converter_tool/onnx_converter.py", line 444, in convert_ops self._op_converters[node.op_type](node) File "/home/gemfield/mace/bazel-bin/mace/python/tools/converter.runfiles/mace/mace/python/tools/converter_tool/onnx_converter.py", line 546, in convert_conv2d "Mace does not support group convolution yet") File "/home/gemfield/mace/bazel-bin/mace/python/tools/converter.runfiles/mace/mace/python/tools/convert_util.py", line 20, in mace_check raise Exception(msg) Exception: Mace does not support group convolution yet

5，Pytorch到Caffe2

通過onnx轉換，不過upsample的bilinear操作會被轉為caffe2的nearest resize操作，效果會差一些。主要有兩步，將onnx模型轉換為caffe2的pb模型，編譯出pytorch(caffe2)的android版本的庫。

1，onnx轉caffe2的pb

import sys import torch import onnx import caffe2.python.onnx.backend as onnx_caffe2_backend import numpy as np import os import cv2 # Some standard imports from caffe2.proto import caffe2_pb2 from caffe2.python import core, net_drawer, net_printer, visualize, workspace, utils import subprocess from PIL import Image from matplotlib import pyplot # Input to the model


x = torch.randn(1, 3, 640, 384, requires_grad=True)
# Load the ONNX ModelProto object. model is a standard Python protobuf object

model = onnx.load("civilnet.onnx")

prepared_backend = onnx_caffe2_backend.prepare(model)
W = {model.graph.input[0].name: x.data.numpy()}
# Run the Caffe2 net:

c2_out = prepared_backend.run(W)[0]
print("Exported model has been executed on Caffe2 backend, and the result looks good!")

# extract the workspace and the model proto from the internal representation

c2_workspace = prepared_backend.workspace

c2_model = prepared_backend.predict_net
# Now import the caffe2 mobile exporter

from caffe2.python.predictor import mobile_exporter
# call the Export to get the predict_net, init_net. These nets are needed for running things on mobile

init_net, predict_net = mobile_exporter.Export(c2_workspace, c2_model, c2_model.external_input)
# Lets also save the init_net and predict_net to a file that we will later use for running them on mobile

with open(esp_init_net.pb, "wb") as fopen:

    fopen.write(init_net.SerializeToString())

with open(esp_predict_net.pb, "wb") as fopen:

    fopen.write(predict_net.SerializeToString())
mean = [109.496254,118.698456,124.68751]

std = [58.50182,58.50182,58.50182]
img = cv2.imread("syszux.jpg")
img = img.astype(np.float64)

im = cv2.resize(img,(384,640))

for j in range(3):

    im[:, :, j] -= mean[j]

for j in range(3):

    im[:, :, j] /= std[j]

im /= 255

im = np.expand_dims(im,axis=0)
im = im.transpose(0, 3, 2, 1)

# Lets run the mobile nets that we generated above so that caffe2 workspace is properly initialized

workspace.RunNetOnce(init_net)

workspace.RunNetOnce(predict_net)
# Caffe2 has a nice net_printer to be able to inspect what the net looks like and identify

# what our input and output blob names are.
# Now, lets also pass in the resized cat image for processing by the model.

workspace.FeedBlob("gemfield", im.astype(np.float32))
# run the predict_net to get the model output

workspace.RunNetOnce(predict_net)
# Now lets get the model output blob

img_out = workspace.FetchBlob("gemfieldout")
img_out = img_out.transpose(0,3,2,1)

img_out = np.argmax(img_out,axis=3)
for i in range(9):

    img_out = np.where(img_out==i,30*i,img_out)

img_out = img_out.transpose(1,2,0)

cv2.imwrite("out.png",img_out)

2，編譯pytorch的android庫及可執行程序（注意-DBUILD_BINARY=ON 要打開，我們需要編譯speed_benchmark可執行文件）：

gemfield@p2k:/home/gemfield/pytorch# scripts/build_android.sh -DBUILD_BINARY=ON ... [ 98%] Linking CXX executable ../bin/tutorial_blob [ 98%] Linking CXX executable ../bin/make_mnist_db [ 98%] Linking CXX executable ../bin/convert_db [100%] Linking CXX executable ../bin/predictor_verifier [100%] Built target tutorial_blob [100%] Linking CXX executable ../bin/split_db [100%] Linking CXX executable ../bin/make_cifar_db [100%] Linking CXX executable ../bin/convert_caffe_image_db [100%] Built target make_mnist_db [100%] Linking CXX executable ../bin/db_throughput [100%] Linking CXX executable ../bin/print_registered_core_operators [100%] Built target convert_db [100%] Built target predictor_verifier [100%] Linking CXX executable ../bin/run_plan [100%] Built target make_cifar_db [100%] Built target split_db [100%] Built target convert_caffe_image_db [100%] Built target db_throughput [100%] Built target print_registered_core_operators [100%] Linking CXX executable ../bin/speed_benchmark [100%] Built target run_plan [100%] Linking CXX shared library ../../lib/libcaffe2_detectron_ops.so

3，要看export出來的pb模型能否在安卓手機上運行，使用adb命令將可執行文件、模型、輸入等都放到Android手機上：

#Linux shell command adb push speed_benchmark /data/local/tmp/speed_benchmark

adb push init_net.pb /data/local/tmp/ adb push predict_net.pb /data/local/tmp/ #測試能否運行 adb shell /data/local/tmp/speed_benchmark --net /data/local/tmp/predict_net.pb --init_net /data/local/tmp/init_net.pb --input data --input_dims 1,3,640,384 --input_type float --warmup 50 --iter 10

3.1，也可以將一個圖片序列化後作為pb模型的輸入：

#linux shell command adb push input.blobproto /data/local/tmp/ adb shell /data/local/tmp/speed_benchmark --net /data/local/tmp/predict_net.pb --init_net /data/local/tmp/init_net.pb --input=gemfield --input_file=/data/local/tmp/input.blobproto --output_folder=/data/local/tmp --output=gemfieldout --iter=1 --caffe2_log_level=0 adb pull /data/local/tmp/gemfieldout ./output.blobproto

使用python將輸出可視化：

from caffe2.proto import caffe2_pb2 from caffe2.python import core, visualize, workspace, utils import numpy as np import cv2


blob_proto = caffe2_pb2.BlobProto()

blob_proto.ParseFromString(open(output.blobproto,rb).read())

img_out = utils.Caffe2TensorToNumpyArray(blob_proto.tensor)
img_out = img_out.transpose(0,3,2,1)

img_out = np.argmax(img_out,axis=3)
for i in range(9):

    img_out = np.where(img_out==i,30*i,img_out)

img_out = img_out.transpose(1,2,0)

cv2.imwrite("mobileoutput.png",img_out)