Benchmark and Model Zoo¶

Environment¶

Hardware¶

8 NVIDIA Tesla V100 GPUs
Intel Xeon 4114 CPU @ 2.20GHz

Software environment¶

Python 3.6 / 3.7
PyTorch 1.1
CUDA 9.0.176
CUDNN 7.0.4
NCCL 2.1.15

Mirror sites¶

We use AWS as the main site to host our model zoo, and maintain a mirror on aliyun. You can replace https://s3.ap-northeast-2.amazonaws.com/open-mmlab with https://open-mmlab.oss-cn-beijing.aliyuncs.com in model urls.

Common settings¶

All FPN baselines and RPN-C4 baselines were trained using 8 GPU with a batch size of 16 (2 images per GPU). Other C4 baselines were trained using 8 GPU with a batch size of 8 (1 image per GPU).
All models were trained on coco_2017_train, and tested on the coco_2017_val.
We use distributed training and BN layer stats are fixed.
We adopt the same training schedules as Detectron. 1x indicates 12 epochs and 2x indicates 24 epochs, which corresponds to slightly less iterations than Detectron and the difference can be ignored.
All pytorch-style pretrained backbones on ImageNet are from PyTorch model zoo.
For fair comparison with other codebases, we report the GPU memory as the maximum value of torch.cuda.max_memory_allocated() for all 8 GPUs. Note that this value is usually less than what nvidia-smi shows.
We report the inference time as the overall time including data loading, network forwarding and post processing.

Baselines¶

More models with different backbones will be added to the model zoo.

RPN¶

Backbone	Style	Lr schd	Mem (GB)	Train time (s/iter)	Inf time (fps)	AR1000	Download
R-50-C4	caffe	1x	-	-	20.5	51.1	model
R-50-C4	caffe	2x	2.2	0.17	20.3	52.2	model
R-50-C4	pytorch	1x	-	-	20.1	50.2	model
R-50-C4	pytorch	2x	-	-	20.0	51.1	model
R-50-FPN	caffe	1x	3.3	0.253	16.9	58.2	-
R-50-FPN	pytorch	1x	3.5	0.276	17.7	57.1	model
R-50-FPN	pytorch	2x	-	-	-	57.6	model
R-101-FPN	caffe	1x	5.2	0.379	13.9	59.4	-
R-101-FPN	pytorch	1x	5.4	0.396	14.4	58.6	model
R-101-FPN	pytorch	2x	-	-	-	59.1	model
X-101-32x4d-FPN	pytorch	1x	6.6	0.589	11.8	59.4	model
X-101-32x4d-FPN	pytorch	2x	-	-	-	59.9	model
X-101-64x4d-FPN	pytorch	1x	9.5	0.955	8.3	59.8	model
X-101-64x4d-FPN	pytorch	2x	-	-	-	60.0	model

Faster R-CNN¶

Backbone	Style	Lr schd	Mem (GB)	Train time (s/iter)	Inf time (fps)	box AP	Download
R-50-C4	caffe	1x	-	-	9.5	34.9	model
R-50-C4	caffe	2x	4.0	0.39	9.3	36.5	model
R-50-C4	pytorch	1x	-	-	9.3	33.9	model
R-50-C4	pytorch	2x	-	-	9.4	35.9	model
R-50-FPN	caffe	1x	3.6	0.333	13.5	36.6	-
R-50-FPN	pytorch	1x	3.8	0.353	13.6	36.4	model
R-50-FPN	pytorch	2x	-	-	-	37.7	model
R-101-FPN	caffe	1x	5.5	0.465	11.5	38.8	-
R-101-FPN	pytorch	1x	5.7	0.474	11.9	38.5	model
R-101-FPN	pytorch	2x	-	-	-	39.4	model
X-101-32x4d-FPN	pytorch	1x	6.9	0.672	10.3	40.1	model
X-101-32x4d-FPN	pytorch	2x	-	-	-	40.4	model
X-101-64x4d-FPN	pytorch	1x	9.8	1.040	7.3	41.3	model
X-101-64x4d-FPN	pytorch	2x	-	-	-	40.7	model
HRNetV2p-W18	pytorch	1x	-	-	-	36.1	model
HRNetV2p-W18	pytorch	2x	-	-	-	38.3	model
HRNetV2p-W32	pytorch	1x	-	-	-	39.5	model
HRNetV2p-W32	pytorch	2x	-	-	-	40.6	model
HRNetV2p-W48	pytorch	1x	-	-	-	40.9	model
HRNetV2p-W48	pytorch	2x	-	-	-	41.5	model

Mask R-CNN¶

Backbone	Style	Lr schd	Mem (GB)	Train time (s/iter)	Inf time (fps)	box AP	mask AP	Download
R-50-C4	caffe	1x	-	-	8.1	35.9	31.5	model
R-50-C4	caffe	2x	4.2	0.43	8.1	37.9	32.9	model
R-50-C4	pytorch	1x	-	-	7.9	35.1	31.2	model
R-50-C4	pytorch	2x	-	-	8.0	37.2	32.5	model
R-50-FPN	caffe	1x	3.8	0.430	10.2	37.4	34.3	-
R-50-FPN	pytorch	1x	3.9	0.453	10.6	37.3	34.2	model
R-50-FPN	pytorch	2x	-	-	-	38.5	35.1	model
R-101-FPN	caffe	1x	5.7	0.534	9.4	39.9	36.1	-
R-101-FPN	pytorch	1x	5.8	0.571	9.5	39.4	35.9	model
R-101-FPN	pytorch	2x	-	-	-	40.3	36.5	model
X-101-32x4d-FPN	pytorch	1x	7.1	0.759	8.3	41.1	37.1	model
X-101-32x4d-FPN	pytorch	2x	-	-	-	41.4	37.1	model
X-101-64x4d-FPN	pytorch	1x	10.0	1.102	6.5	42.1	38.0	model
X-101-64x4d-FPN	pytorch	2x	-	-	-	42.0	37.7	model
HRNetV2p-W18	pytorch	1x	-	-	-	37.3	34.2	model
HRNetV2p-W18	pytorch	2x	-	-	-	39.2	35.7	model
HRNetV2p-W32	pytorch	1x	-	-	-	40.7	36.8	model
HRNetV2p-W32	pytorch	2x	-	-	-	41.7	37.5	model
HRNetV2p-W48	pytorch	1x	-	-	-	42.4	38.1	model
HRNetV2p-W48	pytorch	2x	-	-	-	42.9	38.3	model

Fast R-CNN (with pre-computed proposals)¶

Backbone	Style	Type	Lr schd	Mem (GB)	Train time (s/iter)	Inf time (fps)	box AP	mask AP	Download
R-50-C4	caffe	Faster	1x	-	-	6.7	35.0	-	model
R-50-C4	caffe	Faster	2x	3.8	0.34	6.6	36.4	-	model
R-50-C4	pytorch	Faster	1x	-	-	6.3	34.2	-	model
R-50-C4	pytorch	Faster	2x	-	-	6.1	35.8	-	model
R-50-FPN	caffe	Faster	1x	3.3	0.242	18.4	36.6	-	-
R-50-FPN	pytorch	Faster	1x	3.5	0.250	16.5	35.8	-	model
R-50-C4	caffe	Mask	1x	-	-	8.1	35.9	31.5	model
R-50-C4	caffe	Mask	2x	4.2	0.43	8.1	37.9	32.9	model
R-50-C4	pytorch	Mask	1x	-	-	7.9	35.1	31.2	model
R-50-C4	pytorch	Mask	2x	-	-	8.0	37.2	32.5	model
R-50-FPN	pytorch	Faster	2x	-	-	-	37.1	-	model
R-101-FPN	caffe	Faster	1x	5.2	0.355	14.4	38.6	-	-
R-101-FPN	pytorch	Faster	1x	5.4	0.388	13.2	38.1	-	model
R-101-FPN	pytorch	Faster	2x	-	-	-	38.8	-	model
R-50-FPN	caffe	Mask	1x	3.4	0.328	12.8	37.3	34.5	-
R-50-FPN	pytorch	Mask	1x	3.5	0.346	12.7	36.8	34.1	model
R-50-FPN	pytorch	Mask	2x	-	-	-	37.9	34.8	model
R-101-FPN	caffe	Mask	1x	5.2	0.429	11.2	39.4	36.1	-
R-101-FPN	pytorch	Mask	1x	5.4	0.462	10.9	38.9	35.8	model
R-101-FPN	pytorch	Mask	2x	-	-	-	39.9	36.4	model

RetinaNet¶

Backbone	Style	Lr schd	Mem (GB)	Train time (s/iter)	Inf time (fps)	box AP	Download
R-50-FPN	caffe	1x	3.4	0.285	12.5	35.8	-
R-50-FPN	pytorch	1x	3.6	0.308	12.1	35.6	model
R-50-FPN	pytorch	2x	-	-	-	36.4	model
R-101-FPN	caffe	1x	5.3	0.410	10.4	37.8	-
R-101-FPN	pytorch	1x	5.5	0.429	10.9	37.7	model
R-101-FPN	pytorch	2x	-	-	-	38.1	model
X-101-32x4d-FPN	pytorch	1x	6.7	0.632	9.3	39.0	model
X-101-32x4d-FPN	pytorch	2x	-	-	-	39.3	model
X-101-64x4d-FPN	pytorch	1x	9.6	0.993	7.0	40.0	model
X-101-64x4d-FPN	pytorch	2x	-	-	-	39.6	model

Cascade R-CNN¶

Backbone	Style	Lr schd	Mem (GB)	Train time (s/iter)	Inf time (fps)	box AP	Download
R-50-C4	caffe	1x	8.7	0.92	5.0	38.7	model
R-50-FPN	caffe	1x	3.9	0.464	10.9	40.5	-
R-50-FPN	pytorch	1x	4.1	0.455	11.9	40.4	model
R-50-FPN	pytorch	20e	-	-	-	41.1	model
R-101-FPN	caffe	1x	5.8	0.569	9.6	42.4	-
R-101-FPN	pytorch	1x	6.0	0.584	10.3	42.0	model
R-101-FPN	pytorch	20e	-	-	-	42.5	model
X-101-32x4d-FPN	pytorch	1x	7.2	0.770	8.9	43.6	model
X-101-32x4d-FPN	pytorch	20e	-	-	-	44.0	model
X-101-64x4d-FPN	pytorch	1x	10.0	1.133	6.7	44.5	model
X-101-64x4d-FPN	pytorch	20e	-	-	-	44.7	model
HRNetV2p-W18	pytorch	20e	-	-	-	41.2	model
HRNetV2p-W32	pytorch	20e	-	-	-	43.7	model
HRNetV2p-W48	pytorch	20e	-	-	-	44.6	model

Cascade Mask R-CNN¶

Backbone	Style	Lr schd	Mem (GB)	Train time (s/iter)	Inf time (fps)	box AP	mask AP	Download
R-50-C4	caffe	1x	9.1	0.99	4.5	39.3	32.8	model
R-50-FPN	caffe	1x	5.1	0.692	7.6	40.9	35.5	-
R-50-FPN	pytorch	1x	5.3	0.683	7.4	41.2	35.7	model
R-50-FPN	pytorch	20e	-	-	-	42.3	36.6	model
R-101-FPN	caffe	1x	7.0	0.803	7.2	43.1	37.2	-
R-101-FPN	pytorch	1x	7.2	0.807	6.8	42.6	37.0	model
R-101-FPN	pytorch	20e	-	-	-	43.3	37.6	model
X-101-32x4d-FPN	pytorch	1x	8.4	0.976	6.6	44.4	38.2	model
X-101-32x4d-FPN	pytorch	20e	-	-	-	44.7	38.6	model
X-101-64x4d-FPN	pytorch	1x	11.4	1.33	5.3	45.4	39.1	model
X-101-64x4d-FPN	pytorch	20e	-	-	-	45.7	39.4	model
HRNetV2p-W18	pytorch	20e	-	-	-	41.9	36.4	model
HRNetV2p-W32	pytorch	20e	-	-	-	44.5	38.5	model
HRNetV2p-W48	pytorch	20e	-	-	-	46.0	39.5	model

Notes:

The 20e schedule in Cascade (Mask) R-CNN indicates decreasing the lr at 16 and 19 epochs, with a total of 20 epochs.

Hybrid Task Cascade (HTC)¶

Backbone	Style	Lr schd	Mem (GB)	Train time (s/iter)	Inf time (fps)	box AP	mask AP	Download
R-50-FPN	pytorch	1x	7.4	0.936	4.1	42.1	37.3	model
R-50-FPN	pytorch	20e	-	-	-	43.2	38.1	model
R-101-FPN	pytorch	20e	9.3	1.051	4.0	44.9	39.4	model
X-101-32x4d-FPN	pytorch	20e	5.8	0.769	3.8	46.1	40.3	model
X-101-64x4d-FPN	pytorch	20e	7.5	1.120	3.5	46.9	40.8	model
HRNetV2p-W18	pytorch	20e	-	-	-	43.1	37.9	model
HRNetV2p-W32	pytorch	20e	-	-	-	45.3	39.6	model
HRNetV2p-W48	pytorch	20e	-	-	-	46.8	40.7	model
HRNetV2p-W48	pytorch	28e	-	-	-	47.0	41.0	model

Notes:

Please refer to Hybrid Task Cascade for details and more a powerful model (50.7/43.9).

SSD¶

Backbone	Size	Style	Lr schd	Mem (GB)	Train time (s/iter)	Inf time (fps)	box AP	Download
VGG16	300	caffe	120e	3.5	0.256	25.9 / 34.6	25.7	model
VGG16	512	caffe	120e	7.6	0.412	20.7 / 25.4	29.3	model

Notes:

cudnn.benchmark is set as True for SSD training and testing.
Inference time is reported for batch size = 1 and batch size = 8.
The speed on COCO and VOC are different due to model parameters and nms.

Group Normalization (GN)¶

Please refer to Group Normalization for details.

Weight Standardization¶

Please refer to Weight Standardization for details.

Deformable Convolution v2¶

Please refer to Deformable Convolutional Networks for details.

CARAFE: Content-Aware ReAssembly of FEatures¶

Please refer to CARAFE for details.

Instaboost¶

Please refer to Instaboost for details.

Libra R-CNN¶

Please refer to Libra R-CNN for details.

Guided Anchoring¶

Please refer to Guided Anchoring for details.

FCOS¶

Please refer to FCOS for details.

FoveaBox¶

Please refer to FoveaBox for details.

RepPoints¶

Please refer to RepPoints for details.

FreeAnchor¶

Please refer to FreeAnchor for details.

Grid R-CNN (plus)¶

Please refer to Grid R-CNN for details.

GHM¶

Please refer to GHM for details.

GCNet¶

Please refer to GCNet for details.

HRNet¶

Please refer to HRNet for details.

Mask Scoring R-CNN¶

Please refer to Mask Scoring R-CNN for details.

Train from Scratch¶

Please refer to Rethinking ImageNet Pre-training for details.

NAS-FPN¶

Please refer to NAS-FPN for details.

ATSS¶

Please refer to ATSS for details.

Other datasets¶

We also benchmark some methods on PASCAL VOC, Cityscapes and WIDER FACE.

Comparison with Detectron and maskrcnn-benchmark¶

We compare mmdetection with Detectron and maskrcnn-benchmark. The backbone used is R-50-FPN.

In general, mmdetection has 3 advantages over Detectron.

Higher performance (especially in terms of mask AP)
Faster training speed
Memory efficient

Performance¶

Detectron and maskrcnn-benchmark use caffe-style ResNet as the backbone. We report results using both caffe-style (weights converted from here) and pytorch-style (weights from the official model zoo) ResNet backbone, indicated as pytorch-style results / caffe-style results.

We find that pytorch-style ResNet usually converges slower than caffe-style ResNet, thus leading to slightly lower results in 1x schedule, but the final results of 2x schedule is higher.

Type	Lr schd	Detectron	maskrcnn-benchmark	mmdetection
RPN	1x	57.2	-	57.1 / 58.2
RPN	2x	-	-	57.6 / -
Faster R-CNN	1x	36.7	36.8	36.4 / 36.6
Faster R-CNN	2x	37.9	-	37.7 / -
Mask R-CNN	1x	37.7 & 33.9	37.8 & 34.2	37.3 & 34.2 / 37.4 & 34.3
Mask R-CNN	2x	38.6 & 34.5	-	38.5 & 35.1 / -
Fast R-CNN	1x	36.4	-	35.8 / 36.6
Fast R-CNN	2x	36.8	-	37.1 / -
Fast R-CNN (w/mask)	1x	37.3 & 33.7	-	36.8 & 34.1 / 37.3 & 34.5
Fast R-CNN (w/mask)	2x	37.7 & 34.0	-	37.9 & 34.8 / -

Training Speed¶

The training speed is measure with s/iter. The lower, the better.

Type	Detectron (P100¹)	maskrcnn-benchmark (V100)	mmdetection (V100²)
RPN	0.416	-	0.253
Faster R-CNN	0.544	0.353	0.333
Mask R-CNN	0.889	0.454	0.430
Fast R-CNN	0.285	-	0.242
Fast R-CNN (w/mask)	0.377	-	0.328

*1. Facebook’s Big Basin servers (P100/V100) is slightly faster than the servers we use. mmdetection can also run slightly faster on FB’s servers.

*2. For fair comparison, we list the caffe-style results here.

Inference Speed¶

The inference speed is measured with fps (img/s) on a single GPU. The higher, the better.

Type	Detectron (P100)	maskrcnn-benchmark (V100)	mmdetection (V100)
RPN	12.5	-	16.9
Faster R-CNN	10.3	7.9	13.5
Mask R-CNN	8.5	7.7	10.2
Fast R-CNN	12.5	-	18.4
Fast R-CNN (w/mask)	9.9	-	12.8

Training memory¶

Type	Detectron	maskrcnn-benchmark	mmdetection
RPN	6.4	-	3.3
Faster R-CNN	7.2	4.4	3.6
Mask R-CNN	8.6	5.2	3.8
Fast R-CNN	6.0	-	3.3
Fast R-CNN (w/mask)	7.9	-	3.4

There is no doubt that maskrcnn-benchmark and mmdetection is more memory efficient than Detectron, and the main advantage is PyTorch itself. We also perform some memory optimizations to push it forward.

Note that Caffe2 and PyTorch have different apis to obtain memory usage with different implementations. For all codebases, nvidia-smi shows a larger memory usage than the reported number in the above table.