—— 以RDK X3为例,修改Head部分,8ms疾速Python后处理处罚步调,30fps稳妥当当
本文在地平线对YOLOv8s的Backbone修改的根本上,提出一种在地平线Bernoulli2架构BPU上摆设YOLOv8的后处理处罚思绪。使用640×640分辨率,80种别基于COCO数据集的预练习权重,让BPU加速Backbone的Neck的部分,推理时间约62ms,使用numpy优化的后处理处罚部分,约8ms。并使用高效的Dataflow,充实使用X3的盘算资源,Python多进程推理+Web推流的方式完成了一个30fps的实时目标检测demo。本文全部步调均开源。
1.媒介
串行步调筹划结果如下图所示,RDK X3,4×A53@1.8Ghz,2×Bernoulli2 BPU@5TOPS,YOLOv8s,微调Backbone,1120万参数,640x640分辨率,80种别,单核模子,单CPU单帧单线程,纯numpy向量化后处理处罚。
使用OpenCV从当地读取一张图片,调解为NCHW的输入给bin模子,使用numpy后处理处罚,末了绘制检测结果。
并行步调筹划扬弃了OpenCV的BGR8的Mat,使用了nv12的数据作为关键数据,将bin模子设置为了nv12的输入,而且使用Python多进程来推理,使用TROS相干工具完成可视化,终极在X3大将YOLOv8跑到了30fps。
【RDK X3开辟板 推理YOLOv8s,30fps,Python多进程】 https://www.bilibili.com/video/BV1rz421B7jL/?share_source=copy_web&vd_source=5b24829d168bb2d02896ddeeaa6a20d2
2.后处理处罚优化
如下图所示,Backbone和Neck部分的算子均能较好的被Bernoulli2架构的BPU加速。
Head部分不能较好的被BPU加速,以是只能完全摘出来放到后处理处罚中,用CPU实现。同时由于摆设时只思量前向传播,以是不须要对8400个Grid Cell的信息全部盘算。紧张的优化加速思绪为先筛选,再盘算,这个盘算包罗Classify部分的Sigmoid,Bounding Box部分的DFL盘算(SoftMax回归 + Conv卷积求渴望)和特性解码盘算(dist2bbox, ltrb2xyxy)。
除NMS外的表明请参考作者YOLOv10的文章。
NMS操纵:去掉重复辨认的目标,得到终极的检测结果了,包罗种别(id),分数(score)和位置(xyxy)。
3. 步调参考
注:任何No such file or directory, No module named "xxx", command not found.等报错请细致查抄,请勿逐条复制运行,如果对修改过程不明确请前去地平线开辟者社区从YOLOv5开始相识。
下载地平线优化过的Backbone的堆栈,并参考YOLOv8官方文档,设置好情况
相干权重文件存储在HorizonRDK构造的model_zoo堆栈,相干的修改步调和性能、精度等数据请参考:【前沿算法】地平线适配 YOLOv8 -v1.0.0 (horizon.cc)
- $ git clone https://github.com/HorizonRDK/model_zoo.git
复制代码 卸载yolo相干的下令行下令,如许直接修改./ultralytics/ultralytics目次即可见效。
- $ conda list | grep ultralytics
- $ pip list | grep ultralytics # 或者
- # 如果存在,则卸载
- $ conda uninstall ultralytics
- $ pip uninstall ultralytics # 或者
复制代码 修改Detect的输出头,直接将三个特性层的Bounding Box信息和Classify信息分开输出,一共6个输出头。
文件目次:./ultralytics/ultralytics/nn/modules/head.py,约第43行,Detect类的forward函数更换成以下内容:
- def forward(self, x):
- bbox = []
- cls = []
- for i in range(self.nl):
- bbox.append(self.cv2[i](x[i]))
- cls.append(self.cv3[i](x[i]))
- return (bbox, cls)
复制代码 运行以下Python脚本,如果有No module named onnxsim报错,安装一个即可
- from ultralytics import YOLO
- YOLO('yolov8s.pt').export(format='onnx', simplify=True, opset=11)
复制代码 参考天工开物工具链手册和OE包的参考,对模子举行查抄,全部算子均在BPU上,举行编译即可:
- (bpu) $ hb_mapper checker --model-type onnx --march bayes --model yolov8s.onnx
- (bpu) $ hb_mapper makertbin --model-type onnx --config ./yolov8s.yaml
复制代码 如果您须要使用OpenCV,发起使用NCHW的输入,编译时的yaml设置文件
- model_parameters:
- onnx_model: './yolov8s_bernoulli2.onnx'
- march: "bernoulli2"
- layer_out_dump: False
- working_dir: 'yolov8s_bernoulli2_NCHW'
- output_model_file_prefix: 'yolov8s_bernoulli2_NCHW'
- # remove_node_type: "Dequantize;"
- input_parameters:
- input_name: ""
- input_type_rt: 'rgb'
- input_layout_rt: 'NCHW'
- input_type_train: 'rgb'
- input_layout_train: 'NCHW'
- norm_type: 'data_scale'
- scale_value: 0.003921568627451
- calibration_parameters:
- cal_data_dir: './calibration_data_rgb_f32'
- cal_data_type: 'float32'
- compiler_parameters:
- compile_mode: 'latency'
- debug: False
- optimize_level: 'O3'
复制代码 如果您使用实时视频流检测,发起使用nv12为主的DataFlow筹划,共同hobot_dnn大概本文的hobot_py_dnn使用,编译时的yaml设置文件
- model_parameters:
- onnx_model: './yolov8s_bernoulli2.onnx'
- march: "bernoulli2"
- layer_out_dump: False
- working_dir: 'yolov8s_bernoulli2_nv12'
- output_model_file_prefix: 'yolov8s_bernoulli2_nv12'
- input_parameters:
- input_name: ""
- input_type_rt: 'nv12'
- input_type_train: 'rgb'
- input_layout_train: 'NCHW'
- norm_type: 'data_scale'
- scale_value: 0.003921568627451
- calibration_parameters:
- cal_data_dir: './calibration_data_rgb_f32'
- cal_data_type: 'float32'
- compiler_parameters:
- compile_mode: 'latency'
- debug: False
- optimize_level: 'O3'
复制代码 检察bbox信息的3个输出头的反量化节点名称
通过hb_mapper makerbin时的日志 ,看到巨细为[1, 64, 80, 80], [1, 64, 40, 40], [1, 64, 20, 20]的三个输出的名称为output0, 384, 394。
- ONNX IR version: 6
- Opset version: ['ai.onnx v11', 'horizon v1']
- Producer: pytorch v2.2.0
- Domain: None
- Model version: None
- Graph input:
- images: shape=[1, 3, 640, 640], dtype=FLOAT32
- Graph output:
- output0: shape=[1, 64, 80, 80], dtype=FLOAT32
- 384: shape=[1, 64, 40, 40], dtype=FLOAT32
- 394: shape=[1, 64, 20, 20], dtype=FLOAT32
- 379: shape=[1, 80, 80, 80], dtype=FLOAT32
- 389: shape=[1, 80, 40, 40], dtype=FLOAT32
- 399: shape=[1, 80, 20, 20], dtype=FLOAT32
复制代码 移除bbox信息的3个输出头的反量化节点
进入编译产物的目次
检察可以被移除的反量化节点
- $ hb_model_modifier yolov8s_bernoulli2_nv12.bin
- $ hb_model_modifier yolov8s_bernoulli2_NCHW.bin
复制代码 输出日志 如下,此中output0_HzDequantize, 384_HzDequantize, 394_HzDequantize三个节点是我们要移除的。
- 2024-06-08 06:53:54,548 INFO log will be stored in /ws/yolov8s_bernoulli2/hb_model_modifier.log
- 2024-06-08 06:53:54,566 INFO Nodes that can be deleted: ['output0_HzDequantize', '379_HzDequantize', '384_HzDequantize', '389_HzDequantize', '394_HzDequantize', '399_HzDequantize']
复制代码 使用以下下令移除上述三个反量化节点,注意,导出时这些名称大概差别,请细致确认。
- $ hb_model_modifier yolov8s_bernoulli2_nv12.bin -r output0_HzDequantize -r 384_HzDequantize -r 394_HzDequantize
- $ hb_model_modifier yolov8s_bernoulli2_NCHW.bin -r output0_HzDequantize -r 384_HzDequantize -r 394_HzDequantize
复制代码 板端性能测试
使用scp等工具将bin模子拷贝到板端,使用以下下令测试性能。约莫单线程能跑到17fps,占用BPU一个核到100%,双线程34fps,双核BPU占用到200%,推理的部分能高出30fps,不构成瓶颈,接下来偏重优化后处理处罚。
- hrt_model_exec perf --model_file yolov8s_bernoulli2.bin \
- --model_name="" \
- --core_id=0 \
- --frame_count=200 \
- --perf_time=0 \
- --thread_num=2 \
- --profile_path="."
复制代码 4. 串行步调筹划
使用以下步调时记得修改图片和模子文件路径,缺包少库请自行pip install安装。注意,此时使用的是NCHW输入的模子。
- # Copyright (c) 2024,WuChao D-Robotics.
- #
- # Licensed under the Apache License, Version 2.0 (the "License");
- # you may not use this file except in compliance with the License.
- # You may obtain a copy of the License at
- #
- # http://www.apache.org/licenses/LICENSE-2.0
- #
- # Unless required by applicable law or agreed to in writing, software
- # distributed under the License is distributed on an "AS IS" BASIS,
- # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- # See the License for the specific language governing permissions and
- # limitations under the License.
- import cv2
- import numpy as np
- from scipy.special import softmax
- from time import time
- from hobot_dnn import pyeasy_dnn as dnn
- img_path = "kite.jpg"
- result_save_path = "kite.result.jpg"
- quantize_model_path = "./yolov8s_bernoulli2_NCHW_modified.bin"
- input_image_size = 640
- conf=0.5
- iou=0.5
- conf_inverse = -np.log(1/conf - 1)
- print("sigmoid_inverse threshol = %.2f"%conf_inverse)
- # 一些常量或函数
- coco_names = [
- "person", "bicycle", "car", "motorcycle", "airplane",
- "bus", "train", "truck", "boat", "traffic light",
- "fire hydrant", "stop sign", "parking meter", "bench", "bird",
- "cat", "dog", "horse", "sheep", "cow",
- "elephant", "bear", "zebra", "giraffe", "backpack",
- "umbrella", "handbag", "tie", "suitcase", "frisbee",
- "skis", "snowboard", "sports ball", "kite", "baseball bat",
- "baseball glove", "skateboard", "surfboard", "tennis racket", "bottle",
- "wine glass", "cup", "fork", "knife", "spoon",
- "bowl", "banana", "apple", "sandwich", "orange",
- "broccoli", "carrot", "hot dog", "pizza", "donut",
- "cake", "chair", "couch", "potted plant", "bed",
- "dining table", "toilet", "tv", "laptop", "mouse",
- "remote", "keyboard", "cell phone", "microwave", "oven",
- "toaster", "sink", "refrigerator", "book", "clock",
- "vase", "scissors", "teddy bear", "hair drier", "toothbrush"
- ]
- yolo_colors = [
- (56, 56, 255), (151, 157, 255), (31, 112, 255), (29, 178, 255),
- (49, 210, 207), (10, 249, 72), (23, 204, 146), (134, 219, 61),
- (52, 147, 26), (187, 212, 0), (168, 153, 44), (255, 194, 0),
- (147, 69, 52), (255, 115, 100), (236, 24, 0), (255, 56, 132),
- (133, 0, 82), (255, 56, 203), (200, 149, 255), (199, 55, 255)]
- def draw_detection(img, box, score, class_id):
- x1, y1, x2, y2 = box
- color = yolo_colors[class_id%20]
- cv2.rectangle(img, (x1, y1),
复制代码 |