NPU (HUAWEI Ascend)¶

Usage¶

Please refer to the building documentation of MMCV to install MMCV on NPU devices

Here we use 8 NPUs on your computer to train the model with the following command:

bash tools/dist_train.sh configs/ssd/ssd300_coco.py 8

Also, you can use only one NPU to train the model with the following command:

python tools/train.py configs/ssd/ssd300_coco.py

Model	box AP	mask AP	Config	Download
ssd300	25.6	---	config	log
ssd512	29.4	---	config	log
ssdlite-mbv2*	20.2	---	config	log
retinanet-r18	31.8	---	config	log
retinanet-r50	36.6	---	config	log
yolov3-608	34.7	---	config	log
yolox-s**	39.9	---	config	log
centernet-r18	26.1	---	config	log
fcos-r50*	36.1	---	config	log
solov2-r50	---	34.7	config	log

Notes:

If not specially marked, the results on NPU are the same as those on the GPU with FP32.
(*) The results on the NPU of these models are aligned with the results of the mixed-precision training on the GPU, but are lower than the results of the FP32. This situation is mainly related to the phase of the model itself in mixed-precision training, users may need to adjust the hyperparameters to achieve better results.
(**) The accuracy of yolox-s on the GPU in mixed precision is 40.1, with persister_woker=True in the data loader config by default. There are currently some bugs on NPUs that prevent the last few epochs from running, but the accuracy is less affected and the difference can be ignored.

Introduction to optimization:

Modify the loop calculation as a whole batch calculation to reduce the number of instructions issued.
Modify the index calculation to mask calculation, because the SIMD architecture is good at processing continuous data calculation.

Model	Config	v100 iter time	910A iter time
ascend-ssd300	config	0.165s/iter	0.383s/iter -> 0.13s/iter
ascend-retinanet-r18	config	0.567s/iter	0.780s/iter -> 0.420s/iter

All above models are provided by Huawei Ascend group.