YOLOv8目标检测算法在地平线Bernoulli2架构BPU上高效摆设参考（PTQ方案）——RDK X3(旭日X3派)30fps

登录 · 发表于 2026-4-24 09:10:12

—— 以RDK X3为例，修改Head部分，8ms疾速Python后处理处罚步调，30fps稳妥当当

本文在地平线对YOLOv8s的Backbone修改的根本上，提出一种在地平线Bernoulli2架构BPU上摆设YOLOv8的后处理处罚思绪。使用640×640分辨率，80种别基于COCO数据集的预练习权重，让BPU加速Backbone的Neck的部分，推理时间约62ms，使用numpy优化的后处理处罚部分，约8ms。并使用高效的Dataflow，充实使用X3的盘算资源，Python多进程推理+Web推流的方式完成了一个30fps的实时目标检测demo。本文全部步调均开源。
1.媒介

串行步调筹划结果如下图所示，RDK X3，4×A53@1.8Ghz，2×Bernoulli2 BPU@5TOPS，YOLOv8s，微调Backbone，1120万参数，640x640分辨率，80种别，单核模子，单CPU单帧单线程，纯numpy向量化后处理处罚。
使用OpenCV从当地读取一张图片，调解为NCHW的输入给bin模子，使用numpy后处理处罚，末了绘制检测结果。

并行步调筹划扬弃了OpenCV的BGR8的Mat，使用了nv12的数据作为关键数据，将bin模子设置为了nv12的输入，而且使用Python多进程来推理，使用TROS相干工具完成可视化，终极在X3大将YOLOv8跑到了30fps。
【RDK X3开辟板推理YOLOv8s，30fps，Python多进程】 https://www.bilibili.com/video/BV1rz421B7jL/?share_source=copy_web&vd_source=5b24829d168bb2d02896ddeeaa6a20d2
2.后处理处罚优化

如下图所示，Backbone和Neck部分的算子均能较好的被Bernoulli2架构的BPU加速。
Head部分不能较好的被BPU加速，以是只能完全摘出来放到后处理处罚中，用CPU实现。同时由于摆设时只思量前向传播，以是不须要对8400个Grid Cell的信息全部盘算。紧张的优化加速思绪为先筛选，再盘算，这个盘算包罗Classify部分的Sigmoid，Bounding Box部分的DFL盘算（SoftMax回归 + Conv卷积求渴望）和特性解码盘算（dist2bbox， ltrb2xyxy）。

除NMS外的表明请参考作者YOLOv10的文章。
NMS操纵：去掉重复辨认的目标，得到终极的检测结果了，包罗种别(id)，分数(score)和位置(xyxy)。
3. 步调参考

注：任何No such file or directory, No module named "xxx", command not found.等报错请细致查抄，请勿逐条复制运行，如果对修改过程不明确请前去地平线开辟者社区从YOLOv5开始相识。
下载地平线优化过的Backbone的堆栈，并参考YOLOv8官方文档，设置好情况

相干权重文件存储在HorizonRDK构造的model_zoo堆栈，相干的修改步调和性能、精度等数据请参考：【前沿算法】地平线适配 YOLOv8 -v1.0.0 (horizon.cc)

$ git clone https://github.com/HorizonRDK/model_zoo.git

复制代码

卸载yolo相干的下令行下令，如许直接修改./ultralytics/ultralytics目次即可见效。

$ conda list | grep ultralytics
$ pip list | grep ultralytics # 或者
# 如果存在，则卸载
$ conda uninstall ultralytics
$ pip uninstall ultralytics # 或者

复制代码

修改Detect的输出头，直接将三个特性层的Bounding Box信息和Classify信息分开输出，一共6个输出头。

文件目次：./ultralytics/ultralytics/nn/modules/head.py，约第43行，Detect类的forward函数更换成以下内容：

def forward(self, x):
bbox = []
cls = []
for i in range(self.nl):
bbox.append(self.cv2[i](x[i]))
cls.append(self.cv3[i](x[i]))
return (bbox, cls)

复制代码

运行以下Python脚本，如果有No module named onnxsim报错，安装一个即可

from ultralytics import YOLO
YOLO('yolov8s.pt').export(format='onnx', simplify=True, opset=11)

复制代码

参考天工开物工具链手册和OE包的参考，对模子举行查抄，全部算子均在BPU上，举行编译即可：

(bpu) $ hb_mapper checker --model-type onnx --march bayes --model yolov8s.onnx
(bpu) $ hb_mapper makertbin --model-type onnx --config ./yolov8s.yaml

复制代码

如果您须要使用OpenCV，发起使用NCHW的输入，编译时的yaml设置文件

model_parameters:
onnx_model: './yolov8s_bernoulli2.onnx'
march: "bernoulli2"
layer_out_dump: False
working_dir: 'yolov8s_bernoulli2_NCHW'
output_model_file_prefix: 'yolov8s_bernoulli2_NCHW'
# remove_node_type: "Dequantize;"
input_parameters:
input_name: ""
input_type_rt: 'rgb'
input_layout_rt: 'NCHW'
input_type_train: 'rgb'
input_layout_train: 'NCHW'
norm_type: 'data_scale'
scale_value: 0.003921568627451
calibration_parameters:
cal_data_dir: './calibration_data_rgb_f32'
cal_data_type: 'float32'
compiler_parameters:
compile_mode: 'latency'
debug: False
optimize_level: 'O3'

复制代码

如果您使用实时视频流检测，发起使用nv12为主的DataFlow筹划，共同hobot_dnn大概本文的hobot_py_dnn使用，编译时的yaml设置文件

model_parameters:
onnx_model: './yolov8s_bernoulli2.onnx'
march: "bernoulli2"
layer_out_dump: False
working_dir: 'yolov8s_bernoulli2_nv12'
output_model_file_prefix: 'yolov8s_bernoulli2_nv12'
input_parameters:
input_name: ""
input_type_rt: 'nv12'
input_type_train: 'rgb'
input_layout_train: 'NCHW'
norm_type: 'data_scale'
scale_value: 0.003921568627451
calibration_parameters:
cal_data_dir: './calibration_data_rgb_f32'
cal_data_type: 'float32'
compiler_parameters:
compile_mode: 'latency'
debug: False
optimize_level: 'O3'

复制代码

检察bbox信息的3个输出头的反量化节点名称

通过hb_mapper makerbin时的日志

，看到巨细为[1, 64, 80, 80], [1, 64, 40, 40], [1, 64, 20, 20]的三个输出的名称为output0， 384， 394。

ONNX IR version: 6
Opset version: ['ai.onnx v11', 'horizon v1']
Producer: pytorch v2.2.0
Domain: None
Model version: None
Graph input:
images: shape=[1, 3, 640, 640], dtype=FLOAT32
Graph output:
output0: shape=[1, 64, 80, 80], dtype=FLOAT32
384: shape=[1, 64, 40, 40], dtype=FLOAT32
394: shape=[1, 64, 20, 20], dtype=FLOAT32
379: shape=[1, 80, 80, 80], dtype=FLOAT32
389: shape=[1, 80, 40, 40], dtype=FLOAT32
399: shape=[1, 80, 20, 20], dtype=FLOAT32

复制代码

移除bbox信息的3个输出头的反量化节点

进入编译产物的目次

$ cd yolov8s_bernoulli2

复制代码

检察可以被移除的反量化节点

$ hb_model_modifier yolov8s_bernoulli2_nv12.bin
$ hb_model_modifier yolov8s_bernoulli2_NCHW.bin

复制代码

输出日志

如下，此中output0_HzDequantize, 384_HzDequantize, 394_HzDequantize三个节点是我们要移除的。

2024-06-08 06:53:54,548 INFO log will be stored in /ws/yolov8s_bernoulli2/hb_model_modifier.log
2024-06-08 06:53:54,566 INFO Nodes that can be deleted: ['output0_HzDequantize', '379_HzDequantize', '384_HzDequantize', '389_HzDequantize', '394_HzDequantize', '399_HzDequantize']

复制代码

使用以下下令移除上述三个反量化节点，注意，导出时这些名称大概差别，请细致确认。

$ hb_model_modifier yolov8s_bernoulli2_nv12.bin -r output0_HzDequantize -r 384_HzDequantize -r 394_HzDequantize
$ hb_model_modifier yolov8s_bernoulli2_NCHW.bin -r output0_HzDequantize -r 384_HzDequantize -r 394_HzDequantize

复制代码

板端性能测试

使用scp等工具将bin模子拷贝到板端，使用以下下令测试性能。约莫单线程能跑到17fps，占用BPU一个核到100%，双线程34fps，双核BPU占用到200%，推理的部分能高出30fps，不构成瓶颈，接下来偏重优化后处理处罚。

hrt_model_exec perf --model_file yolov8s_bernoulli2.bin \
--model_name="" \
--core_id=0 \
--frame_count=200 \
--perf_time=0 \
--thread_num=2 \
--profile_path="."

复制代码

4. 串行步调筹划

使用以下步调时记得修改图片和模子文件路径，缺包少库请自行pip install安装。注意，此时使用的是NCHW输入的模子。

#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import cv2
import numpy as np
from scipy.special import softmax
from time import time
from hobot_dnn import pyeasy_dnn as dnn
img_path = "kite.jpg"
result_save_path = "kite.result.jpg"
quantize_model_path = "./yolov8s_bernoulli2_NCHW_modified.bin"
input_image_size = 640
conf=0.5
iou=0.5
conf_inverse = -np.log(1/conf - 1)
print("sigmoid_inverse threshol = %.2f"%conf_inverse)
# 一些常量或函数
coco_names = [
"person", "bicycle", "car", "motorcycle", "airplane",
"bus", "train", "truck", "boat", "traffic light",
"fire hydrant", "stop sign", "parking meter", "bench", "bird",
"cat", "dog", "horse", "sheep", "cow",
"elephant", "bear", "zebra", "giraffe", "backpack",
"umbrella", "handbag", "tie", "suitcase", "frisbee",
"skis", "snowboard", "sports ball", "kite", "baseball bat",
"baseball glove", "skateboard", "surfboard", "tennis racket", "bottle",
"wine glass", "cup", "fork", "knife", "spoon",
"bowl", "banana", "apple", "sandwich", "orange",
"broccoli", "carrot", "hot dog", "pizza", "donut",
"cake", "chair", "couch", "potted plant", "bed",
"dining table", "toilet", "tv", "laptop", "mouse",
"remote", "keyboard", "cell phone", "microwave", "oven",
"toaster", "sink", "refrigerator", "book", "clock",
"vase", "scissors", "teddy bear", "hair drier", "toothbrush"
]
yolo_colors = [
(56, 56, 255), (151, 157, 255), (31, 112, 255), (29, 178, 255),
(49, 210, 207), (10, 249, 72), (23, 204, 146), (134, 219, 61),
(52, 147, 26), (187, 212, 0), (168, 153, 44), (255, 194, 0),
(147, 69, 52), (255, 115, 100), (236, 24, 0), (255, 56, 132),
(133, 0, 82), (255, 56, 203), (200, 149, 255), (199, 55, 255)]
def draw_detection(img, box, score, class_id):
x1, y1, x2, y2 = box
color = yolo_colors[class_id%20]
cv2.rectangle(img, (x1, y1),

复制代码

YOLOv8目标检测算法在地平线Bernoulli2架构BPU上高效摆设参考（PTQ方案）——RDK X3(旭日X3派)30fps

本帖子中包含更多资源

浏览过的版块

缠丝猫