API Reference

mmdet.apis

mmdet.core

anchor

class mmdet.core.anchor.AnchorGenerator(strides, ratios, scales=None, base_sizes=None, scale_major=True, octave_base_scale=None, scales_per_octave=None, centers=None, center_offset=0.0)[source]

Standard anchor generator for 2D anchor-based detectors.

Parameters:
  • strides (list[int] | list[tuple[int, int]]) – Strides of anchors in multiple feature levels in order (w, h).
  • ratios (list[float]) – The list of ratios between the height and width of anchors in a single level.
  • scales (list[int] | None) – Anchor scales for anchors in a single level. It cannot be set at the same time if octave_base_scale and scales_per_octave are set.
  • base_sizes (list[int] | None) – The basic sizes of anchors in multiple levels. If None is given, strides will be used as base_sizes. (If strides are non square, the shortest stride is taken.)
  • scale_major (bool) – Whether to multiply scales first when generating base anchors. If true, the anchors in the same row will have the same scales. By default it is True in V2.0
  • octave_base_scale (int) – The base scale of octave.
  • scales_per_octave (int) – Number of scales for each octave. octave_base_scale and scales_per_octave are usually used in retinanet and the scales should be None when they are set.
  • centers (list[tuple[float, float]] | None) – The centers of the anchor relative to the feature grid center in multiple feature levels. By default it is set to be None and not used. If a list of tuple of float is given, they will be used to shift the centers of anchors.
  • center_offset (float) – The offset of center in proportion to anchors’ width and height. By default it is 0 in V2.0.

Examples

>>> from mmdet.core import AnchorGenerator
>>> self = AnchorGenerator([16], [1.], [1.], [9])
>>> all_anchors = self.grid_anchors([(2, 2)], device='cpu')
>>> print(all_anchors)
[tensor([[-4.5000, -4.5000,  4.5000,  4.5000],
        [11.5000, -4.5000, 20.5000,  4.5000],
        [-4.5000, 11.5000,  4.5000, 20.5000],
        [11.5000, 11.5000, 20.5000, 20.5000]])]
>>> self = AnchorGenerator([16, 32], [1.], [1.], [9, 18])
>>> all_anchors = self.grid_anchors([(2, 2), (1, 1)], device='cpu')
>>> print(all_anchors)
[tensor([[-4.5000, -4.5000,  4.5000,  4.5000],
        [11.5000, -4.5000, 20.5000,  4.5000],
        [-4.5000, 11.5000,  4.5000, 20.5000],
        [11.5000, 11.5000, 20.5000, 20.5000]]),         tensor([[-9., -9., 9., 9.]])]
gen_base_anchors()[source]

Generate base anchors.

Returns:Base anchors of a feature grid in multiple feature levels.
Return type:list(torch.Tensor)
gen_single_level_base_anchors(base_size, scales, ratios, center=None)[source]

Generate base anchors of a single level.

Parameters:
  • base_size (int | float) – Basic size of an anchor.
  • scales (torch.Tensor) – Scales of the anchor.
  • ratios (torch.Tensor) – The ratio between between the height and width of anchors in a single level.
  • center (tuple[float], optional) – The center of the base anchor related to a single feature grid. Defaults to None.
Returns:

Anchors in a single-level feature maps.

Return type:

torch.Tensor

grid_anchors(featmap_sizes, device='cuda')[source]

Generate grid anchors in multiple feature levels.

Parameters:
  • featmap_sizes (list[tuple]) – List of feature map sizes in multiple feature levels.
  • device (str) – Device where the anchors will be put on.
Returns:

Anchors in multiple feature levels. The sizes of each tensor should be [N, 4], where N = width * height * num_base_anchors, width and height are the sizes of the corresponding feature level, num_base_anchors is the number of anchors for that level.

Return type:

list[torch.Tensor]

num_base_anchors

total number of base anchors in a feature grid

Type:list[int]
num_levels

number of feature levels that the generator will be applied

Type:int
single_level_grid_anchors(base_anchors, featmap_size, stride=(16, 16), device='cuda')[source]

Generate grid anchors of a single level.

Note

This function is usually called by method self.grid_anchors.

Parameters:
  • base_anchors (torch.Tensor) – The base anchors of a feature grid.
  • featmap_size (tuple[int]) – Size of the feature maps.
  • stride (tuple[int], optional) – Stride of the feature map in order (w, h). Defaults to (16, 16).
  • device (str, optional) – Device the tensor will be put on. Defaults to ‘cuda’.
Returns:

Anchors in the overall feature maps.

Return type:

torch.Tensor

single_level_valid_flags(featmap_size, valid_size, num_base_anchors, device='cuda')[source]

Generate the valid flags of anchor in a single feature map.

Parameters:
  • featmap_size (tuple[int]) – The size of feature maps.
  • valid_size (tuple[int]) – The valid size of the feature maps.
  • num_base_anchors (int) – The number of base anchors.
  • device (str, optional) – Device where the flags will be put on. Defaults to ‘cuda’.
Returns:

The valid flags of each anchor in a single level feature map.

Return type:

torch.Tensor

valid_flags(featmap_sizes, pad_shape, device='cuda')[source]

Generate valid flags of anchors in multiple feature levels.

Parameters:
  • featmap_sizes (list(tuple)) – List of feature map sizes in multiple feature levels.
  • pad_shape (tuple) – The padded shape of the image.
  • device (str) – Device where the anchors will be put on.
Returns:

Valid flags of anchors in multiple levels.

Return type:

list(torch.Tensor)

class mmdet.core.anchor.LegacyAnchorGenerator(strides, ratios, scales=None, base_sizes=None, scale_major=True, octave_base_scale=None, scales_per_octave=None, centers=None, center_offset=0.0)[source]

Legacy anchor generator used in MMDetection V1.x.

Note

Difference to the V2.0 anchor generator:

  1. The center offset of V1.x anchors are set to be 0.5 rather than 0.
  2. The width/height are minused by 1 when calculating the anchors’ centers and corners to meet the V1.x coordinate system.
  3. The anchors’ corners are quantized.
Parameters:
  • strides (list[int] | list[tuple[int]]) – Strides of anchors in multiple feature levels.
  • ratios (list[float]) – The list of ratios between the height and width of anchors in a single level.
  • scales (list[int] | None) – Anchor scales for anchors in a single level. It cannot be set at the same time if octave_base_scale and scales_per_octave are set.
  • base_sizes (list[int]) – The basic sizes of anchors in multiple levels. If None is given, strides will be used to generate base_sizes.
  • scale_major (bool) – Whether to multiply scales first when generating base anchors. If true, the anchors in the same row will have the same scales. By default it is True in V2.0
  • octave_base_scale (int) – The base scale of octave.
  • scales_per_octave (int) – Number of scales for each octave. octave_base_scale and scales_per_octave are usually used in retinanet and the scales should be None when they are set.
  • centers (list[tuple[float, float]] | None) – The centers of the anchor relative to the feature grid center in multiple feature levels. By default it is set to be None and not used. It a list of float is given, this list will be used to shift the centers of anchors.
  • center_offset (float) – The offset of center in propotion to anchors’ width and height. By default it is 0.5 in V2.0 but it should be 0.5 in v1.x models.

Examples

>>> from mmdet.core import LegacyAnchorGenerator
>>> self = LegacyAnchorGenerator(
>>>     [16], [1.], [1.], [9], center_offset=0.5)
>>> all_anchors = self.grid_anchors(((2, 2),), device='cpu')
>>> print(all_anchors)
[tensor([[ 0.,  0.,  8.,  8.],
        [16.,  0., 24.,  8.],
        [ 0., 16.,  8., 24.],
        [16., 16., 24., 24.]])]
gen_single_level_base_anchors(base_size, scales, ratios, center=None)[source]

Generate base anchors of a single level.

Note

The width/height of anchors are minused by 1 when calculating the centers and corners to meet the V1.x coordinate system.

Parameters:
  • base_size (int | float) – Basic size of an anchor.
  • scales (torch.Tensor) – Scales of the anchor.
  • ratios (torch.Tensor) – The ratio between between the height. and width of anchors in a single level.
  • center (tuple[float], optional) – The center of the base anchor related to a single feature grid. Defaults to None.
Returns:

Anchors in a single-level feature map.

Return type:

torch.Tensor

mmdet.core.anchor.anchor_inside_flags(flat_anchors, valid_flags, img_shape, allowed_border=0)[source]

Check whether the anchors are inside the border.

Parameters:
  • flat_anchors (torch.Tensor) – Flatten anchors, shape (n, 4).
  • valid_flags (torch.Tensor) – An existing valid flags of anchors.
  • img_shape (tuple(int)) – Shape of current image.
  • allowed_border (int, optional) – The border to allow the valid anchor. Defaults to 0.
Returns:

Flags indicating whether the anchors are inside a valid range.

Return type:

torch.Tensor

mmdet.core.anchor.images_to_levels(target, num_levels)[source]

Convert targets by image to targets by feature level.

[target_img0, target_img1] -> [target_level0, target_level1, …]

mmdet.core.anchor.calc_region(bbox, ratio, featmap_size=None)[source]

Calculate a proportional bbox region.

The bbox center are fixed and the new h’ and w’ is h * ratio and w * ratio.

Parameters:
  • bbox (Tensor) – Bboxes to calculate regions, shape (n, 4).
  • ratio (float) – Ratio of the output region.
  • featmap_size (tuple) – Feature map size used for clipping the boundary.
Returns:

x1, y1, x2, y2

Return type:

tuple

class mmdet.core.anchor.YOLOAnchorGenerator(strides, base_sizes)[source]

Anchor generator for YOLO.

Parameters:
  • strides (list[int] | list[tuple[int, int]]) – Strides of anchors in multiple feature levels.
  • base_sizes (list[list[tuple[int, int]]]) – The basic sizes of anchors in multiple levels.
gen_base_anchors()[source]

Generate base anchors.

Returns:Base anchors of a feature grid in multiple feature levels.
Return type:list(torch.Tensor)
gen_single_level_base_anchors(base_sizes_per_level, center=None)[source]

Generate base anchors of a single level.

Parameters:
  • base_sizes_per_level (list[tuple[int, int]]) – Basic sizes of anchors.
  • center (tuple[float], optional) – The center of the base anchor related to a single feature grid. Defaults to None.
Returns:

Anchors in a single-level feature maps.

Return type:

torch.Tensor

num_levels

number of feature levels that the generator will be applied

Type:int
responsible_flags(featmap_sizes, gt_bboxes, device='cuda')[source]

Generate responsible anchor flags of grid cells in multiple scales.

Parameters:
  • featmap_sizes (list(tuple)) – List of feature map sizes in multiple feature levels.
  • gt_bboxes (Tensor) – Ground truth boxes, shape (n, 4).
  • device (str) – Device where the anchors will be put on.
Returns:

responsible flags of anchors in multiple level

Return type:

list(torch.Tensor)

single_level_responsible_flags(featmap_size, gt_bboxes, stride, num_base_anchors, device='cuda')[source]

Generate the responsible flags of anchor in a single feature map.

Parameters:
  • featmap_size (tuple[int]) – The size of feature maps.
  • gt_bboxes (Tensor) – Ground truth boxes, shape (n, 4).
  • stride (tuple(int)) – stride of current level
  • num_base_anchors (int) – The number of base anchors.
  • device (str, optional) – Device where the flags will be put on. Defaults to ‘cuda’.
Returns:

The valid flags of each anchor in a single level feature map.

Return type:

torch.Tensor

bbox

export

mask

evaluation

post_processing

optimizer

utils

mmdet.datasets

datasets

pipelines

mmdet.models

detectors

backbones

class mmdet.models.backbones.RegNet(arch, in_channels=3, stem_channels=32, base_channels=32, strides=(2, 2, 2, 2), dilations=(1, 1, 1, 1), out_indices=(0, 1, 2, 3), style='pytorch', deep_stem=False, avg_down=False, frozen_stages=-1, conv_cfg=None, norm_cfg={'requires_grad': True, 'type': 'BN'}, norm_eval=True, dcn=None, stage_with_dcn=(False, False, False, False), plugins=None, with_cp=False, zero_init_residual=True)[source]

RegNet backbone.

More details can be found in paper .

Parameters:
  • arch (dict) –

    The parameter of RegNets.

    • w0 (int): initial width
    • wa (float): slope of width
    • wm (float): quantization parameter to quantize the width
    • depth (int): depth of the backbone
    • group_w (int): width of group
    • bot_mul (float): bottleneck ratio, i.e. expansion of bottlneck.
  • strides (Sequence[int]) – Strides of the first block of each stage.
  • base_channels (int) – Base channels after stem layer.
  • in_channels (int) – Number of input image channels. Default: 3.
  • dilations (Sequence[int]) – Dilation of each stage.
  • out_indices (Sequence[int]) – Output from which stages.
  • style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer.
  • frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters.
  • norm_cfg (dict) – dictionary to construct and config norm layer.
  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only.
  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed.
  • zero_init_residual (bool) – whether to use zero init for last norm layer in resblocks to let them behave as identity.

Example

>>> from mmdet.models import RegNet
>>> import torch
>>> self = RegNet(
        arch=dict(
            w0=88,
            wa=26.31,
            wm=2.25,
            group_w=48,
            depth=25,
            bot_mul=1.0))
>>> self.eval()
>>> inputs = torch.rand(1, 3, 32, 32)
>>> level_outputs = self.forward(inputs)
>>> for level_out in level_outputs:
...     print(tuple(level_out.shape))
(1, 96, 8, 8)
(1, 192, 4, 4)
(1, 432, 2, 2)
(1, 1008, 1, 1)
adjust_width_group(widths, bottleneck_ratio, groups)[source]

Adjusts the compatibility of widths and groups.

Parameters:
  • widths (list[int]) – Width of each stage.
  • bottleneck_ratio (float) – Bottleneck ratio.
  • groups (int) – number of groups in each stage
Returns:

The adjusted widths and groups of each stage.

Return type:

tuple(list)

forward(x)[source]

Forward function.

generate_regnet(initial_width, width_slope, width_parameter, depth, divisor=8)[source]

Generates per block width from RegNet parameters.

Parameters:
  • initial_width ([int]) – Initial width of the backbone
  • width_slope ([float]) – Slope of the quantized linear function
  • width_parameter ([int]) – Parameter used to quantize the width.
  • depth ([int]) – Depth of the backbone.
  • divisor (int, optional) – The divisor of channels. Defaults to 8.
Returns:

return a list of widths of each stage and the number of stages

Return type:

list, int

get_stages_from_blocks(widths)[source]

Gets widths/stage_blocks of network at each stage.

Parameters:widths (list[int]) – Width in each stage.
Returns:width and depth of each stage
Return type:tuple(list)
static quantize_float(number, divisor)[source]

Converts a float to closest non-zero int divisible by divior.

Parameters:
  • number (int) – Original number to be quantized.
  • divisor (int) – Divisor used to quantize the number.
Returns:

quantized number that is divisible by devisor.

Return type:

int

class mmdet.models.backbones.ResNet(depth, in_channels=3, stem_channels=None, base_channels=64, num_stages=4, strides=(1, 2, 2, 2), dilations=(1, 1, 1, 1), out_indices=(0, 1, 2, 3), style='pytorch', deep_stem=False, avg_down=False, frozen_stages=-1, conv_cfg=None, norm_cfg={'requires_grad': True, 'type': 'BN'}, norm_eval=True, dcn=None, stage_with_dcn=(False, False, False, False), plugins=None, with_cp=False, zero_init_residual=True)[source]

ResNet backbone.

Parameters:
  • depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
  • stem_channels (int | None) – Number of stem channels. If not specified, it will be the same as base_channels. Default: None.
  • base_channels (int) – Number of base channels of res layer. Default: 64.
  • in_channels (int) – Number of input image channels. Default: 3.
  • num_stages (int) – Resnet stages. Default: 4.
  • strides (Sequence[int]) – Strides of the first block of each stage.
  • dilations (Sequence[int]) – Dilation of each stage.
  • out_indices (Sequence[int]) – Output from which stages.
  • style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer.
  • deep_stem (bool) – Replace 7x7 conv in input stem with 3 3x3 conv
  • avg_down (bool) – Use AvgPool instead of stride conv when downsampling in the bottleneck.
  • frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters.
  • norm_cfg (dict) – Dictionary to construct and config norm layer.
  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only.
  • plugins (list[dict]) –

    List of plugins for stages, each dict contains:

    • cfg (dict, required): Cfg dict to build plugin.
    • position (str, required): Position inside block to insert plugin, options are ‘after_conv1’, ‘after_conv2’, ‘after_conv3’.
    • stages (tuple[bool], optional): Stages to apply plugin, length should be same as ‘num_stages’.
  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed.
  • zero_init_residual (bool) – Whether to use zero init for last norm layer in resblocks to let them behave as identity.

Example

>>> from mmdet.models import ResNet
>>> import torch
>>> self = ResNet(depth=18)
>>> self.eval()
>>> inputs = torch.rand(1, 3, 32, 32)
>>> level_outputs = self.forward(inputs)
>>> for level_out in level_outputs:
...     print(tuple(level_out.shape))
(1, 64, 8, 8)
(1, 128, 4, 4)
(1, 256, 2, 2)
(1, 512, 1, 1)
forward(x)[source]

Forward function.

init_weights(pretrained=None)[source]

Initialize the weights in backbone.

Parameters:pretrained (str, optional) – Path to pre-trained weights. Defaults to None.
make_res_layer(**kwargs)[source]

Pack all blocks in a stage into a ResLayer.

make_stage_plugins(plugins, stage_idx)[source]

Make plugins for ResNet stage_idx th stage.

Currently we support to insert context_block, empirical_attention_block, nonlocal_block into the backbone like ResNet/ResNeXt. They could be inserted after conv1/conv2/conv3 of Bottleneck.

An example of plugins format could be:

Examples

>>> plugins=[
...     dict(cfg=dict(type='xxx', arg1='xxx'),
...          stages=(False, True, True, True),
...          position='after_conv2'),
...     dict(cfg=dict(type='yyy'),
...          stages=(True, True, True, True),
...          position='after_conv3'),
...     dict(cfg=dict(type='zzz', postfix='1'),
...          stages=(True, True, True, True),
...          position='after_conv3'),
...     dict(cfg=dict(type='zzz', postfix='2'),
...          stages=(True, True, True, True),
...          position='after_conv3')
... ]
>>> self = ResNet(depth=18)
>>> stage_plugins = self.make_stage_plugins(plugins, 0)
>>> assert len(stage_plugins) == 3

Suppose stage_idx=0, the structure of blocks in the stage would be:

conv1-> conv2->conv3->yyy->zzz1->zzz2

Suppose ‘stage_idx=1’, the structure of blocks in the stage would be:

conv1-> conv2->xxx->conv3->yyy->zzz1->zzz2

If stages is missing, the plugin would be applied to all stages.

Parameters:
  • plugins (list[dict]) – List of plugins cfg to build. The postfix is required if multiple same type plugins are inserted.
  • stage_idx (int) – Index of stage to build
Returns:

Plugins for current stage

Return type:

list[dict]

norm1

the normalization layer named “norm1”

Type:nn.Module
train(mode=True)[source]

Convert the model into training mode while keep normalization layer freezed.

class mmdet.models.backbones.ResNetV1d(**kwargs)[source]

ResNetV1d variant described in Bag of Tricks.

Compared with default ResNet(ResNetV1b), ResNetV1d replaces the 7x7 conv in the input stem with three 3x3 convs. And in the downsampling block, a 2x2 avg_pool with stride 2 is added before conv, whose stride is changed to 1.

class mmdet.models.backbones.ResNeXt(groups=1, base_width=4, **kwargs)[source]

ResNeXt backbone.

Parameters:
  • depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
  • in_channels (int) – Number of input image channels. Default: 3.
  • num_stages (int) – Resnet stages. Default: 4.
  • groups (int) – Group of resnext.
  • base_width (int) – Base width of resnext.
  • strides (Sequence[int]) – Strides of the first block of each stage.
  • dilations (Sequence[int]) – Dilation of each stage.
  • out_indices (Sequence[int]) – Output from which stages.
  • style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer.
  • frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters.
  • norm_cfg (dict) – dictionary to construct and config norm layer.
  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only.
  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed.
  • zero_init_residual (bool) – whether to use zero init for last norm layer in resblocks to let them behave as identity.
make_res_layer(**kwargs)[source]

Pack all blocks in a stage into a ResLayer

class mmdet.models.backbones.SSDVGG(input_size, depth, with_last_pool=False, ceil_mode=True, out_indices=(3, 4), out_feature_indices=(22, 34), l2_norm_scale=20.0)[source]

VGG Backbone network for single-shot-detection.

Parameters:
  • input_size (int) – width and height of input, from {300, 512}.
  • depth (int) – Depth of vgg, from {11, 13, 16, 19}.
  • out_indices (Sequence[int]) – Output from which stages.

Example

>>> self = SSDVGG(input_size=300, depth=11)
>>> self.eval()
>>> inputs = torch.rand(1, 3, 300, 300)
>>> level_outputs = self.forward(inputs)
>>> for level_out in level_outputs:
...     print(tuple(level_out.shape))
(1, 1024, 19, 19)
(1, 512, 10, 10)
(1, 256, 5, 5)
(1, 256, 3, 3)
(1, 256, 1, 1)
forward(x)[source]

Forward function.

init_weights(pretrained=None)[source]

Initialize the weights in backbone.

Parameters:pretrained (str, optional) – Path to pre-trained weights. Defaults to None.
class mmdet.models.backbones.HRNet(extra, in_channels=3, conv_cfg=None, norm_cfg={'type': 'BN'}, norm_eval=True, with_cp=False, zero_init_residual=False)[source]

HRNet backbone.

High-Resolution Representations for Labeling Pixels and Regions arXiv: https://arxiv.org/abs/1904.04514

Parameters:
  • extra (dict) – detailed configuration for each stage of HRNet.
  • in_channels (int) – Number of input image channels. Default: 3.
  • conv_cfg (dict) – dictionary to construct and config conv layer.
  • norm_cfg (dict) – dictionary to construct and config norm layer.
  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only.
  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed.
  • zero_init_residual (bool) – whether to use zero init for last norm layer in resblocks to let them behave as identity.

Example

>>> from mmdet.models import HRNet
>>> import torch
>>> extra = dict(
>>>     stage1=dict(
>>>         num_modules=1,
>>>         num_branches=1,
>>>         block='BOTTLENECK',
>>>         num_blocks=(4, ),
>>>         num_channels=(64, )),
>>>     stage2=dict(
>>>         num_modules=1,
>>>         num_branches=2,
>>>         block='BASIC',
>>>         num_blocks=(4, 4),
>>>         num_channels=(32, 64)),
>>>     stage3=dict(
>>>         num_modules=4,
>>>         num_branches=3,
>>>         block='BASIC',
>>>         num_blocks=(4, 4, 4),
>>>         num_channels=(32, 64, 128)),
>>>     stage4=dict(
>>>         num_modules=3,
>>>         num_branches=4,
>>>         block='BASIC',
>>>         num_blocks=(4, 4, 4, 4),
>>>         num_channels=(32, 64, 128, 256)))
>>> self = HRNet(extra, in_channels=1)
>>> self.eval()
>>> inputs = torch.rand(1, 1, 32, 32)
>>> level_outputs = self.forward(inputs)
>>> for level_out in level_outputs:
...     print(tuple(level_out.shape))
(1, 32, 8, 8)
(1, 64, 4, 4)
(1, 128, 2, 2)
(1, 256, 1, 1)
forward(x)[source]

Forward function.

init_weights(pretrained=None)[source]

Initialize the weights in backbone.

Parameters:pretrained (str, optional) – Path to pre-trained weights. Defaults to None.
norm1

the normalization layer named “norm1”

Type:nn.Module
norm2

the normalization layer named “norm2”

Type:nn.Module
train(mode=True)[source]

Convert the model into training mode whill keeping the normalization layer freezed.

class mmdet.models.backbones.Res2Net(scales=4, base_width=26, style='pytorch', deep_stem=True, avg_down=True, **kwargs)[source]

Res2Net backbone.

Parameters:
  • scales (int) – Scales used in Res2Net. Default: 4
  • base_width (int) – Basic width of each scale. Default: 26
  • depth (int) – Depth of res2net, from {50, 101, 152}.
  • in_channels (int) – Number of input image channels. Default: 3.
  • num_stages (int) – Res2net stages. Default: 4.
  • strides (Sequence[int]) – Strides of the first block of each stage.
  • dilations (Sequence[int]) – Dilation of each stage.
  • out_indices (Sequence[int]) – Output from which stages.
  • style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer.
  • deep_stem (bool) – Replace 7x7 conv in input stem with 3 3x3 conv
  • avg_down (bool) – Use AvgPool instead of stride conv when downsampling in the bottle2neck.
  • frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters.
  • norm_cfg (dict) – Dictionary to construct and config norm layer.
  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only.
  • plugins (list[dict]) –

    List of plugins for stages, each dict contains:

    • cfg (dict, required): Cfg dict to build plugin.
    • position (str, required): Position inside block to insert plugin, options are ‘after_conv1’, ‘after_conv2’, ‘after_conv3’.
    • stages (tuple[bool], optional): Stages to apply plugin, length should be same as ‘num_stages’.
  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed.
  • zero_init_residual (bool) – Whether to use zero init for last norm layer in resblocks to let them behave as identity.

Example

>>> from mmdet.models import Res2Net
>>> import torch
>>> self = Res2Net(depth=50, scales=4, base_width=26)
>>> self.eval()
>>> inputs = torch.rand(1, 3, 32, 32)
>>> level_outputs = self.forward(inputs)
>>> for level_out in level_outputs:
...     print(tuple(level_out.shape))
(1, 256, 8, 8)
(1, 512, 4, 4)
(1, 1024, 2, 2)
(1, 2048, 1, 1)
init_weights(pretrained=None)[source]

Initialize the weights in backbone.

Parameters:pretrained (str, optional) – Path to pre-trained weights. Defaults to None.
make_res_layer(**kwargs)[source]

Pack all blocks in a stage into a ResLayer.

class mmdet.models.backbones.HourglassNet(downsample_times=5, num_stacks=2, stage_channels=(256, 256, 384, 384, 384, 512), stage_blocks=(2, 2, 2, 2, 2, 4), feat_channel=256, norm_cfg={'requires_grad': True, 'type': 'BN'})[source]

HourglassNet backbone.

Stacked Hourglass Networks for Human Pose Estimation. More details can be found in the paper .

Parameters:
  • downsample_times (int) – Downsample times in a HourglassModule.
  • num_stacks (int) – Number of HourglassModule modules stacked, 1 for Hourglass-52, 2 for Hourglass-104.
  • stage_channels (list[int]) – Feature channel of each sub-module in a HourglassModule.
  • stage_blocks (list[int]) – Number of sub-modules stacked in a HourglassModule.
  • feat_channel (int) – Feature channel of conv after a HourglassModule.
  • norm_cfg (dict) – Dictionary to construct and config norm layer.

Example

>>> from mmdet.models import HourglassNet
>>> import torch
>>> self = HourglassNet()
>>> self.eval()
>>> inputs = torch.rand(1, 3, 511, 511)
>>> level_outputs = self.forward(inputs)
>>> for level_output in level_outputs:
...     print(tuple(level_output.shape))
(1, 256, 128, 128)
(1, 256, 128, 128)
forward(x)[source]

Forward function.

init_weights(pretrained=None)[source]

Init module weights.

We do nothing in this function because all modules we used (ConvModule, BasicBlock and etc.) have default initialization, and currently we don’t provide pretrained model of HourglassNet.

Detector’s __init__() will call backbone’s init_weights() with pretrained as input, so we keep this function.

class mmdet.models.backbones.DetectoRS_ResNet(sac=None, stage_with_sac=(False, False, False, False), rfp_inplanes=None, output_img=False, pretrained=None, **kwargs)[source]

ResNet backbone for DetectoRS.

Parameters:
  • sac (dict, optional) – Dictionary to construct SAC (Switchable Atrous Convolution). Default: None.
  • stage_with_sac (list) – Which stage to use sac. Default: (False, False, False, False).
  • rfp_inplanes (int, optional) – The number of channels from RFP. Default: None. If specified, an additional conv layer will be added for rfp_feat. Otherwise, the structure is the same as base class.
  • output_img (bool) – If True, the input image will be inserted into the starting position of output. Default: False.
  • pretrained (str, optional) – The pretrained model to load.
forward(x)[source]

Forward function.

make_res_layer(**kwargs)[source]

Pack all blocks in a stage into a ResLayer for DetectoRS.

rfp_forward(x, rfp_feats)[source]

Forward function for RFP.

class mmdet.models.backbones.DetectoRS_ResNeXt(groups=1, base_width=4, **kwargs)[source]

ResNeXt backbone for DetectoRS.

Parameters:
  • groups (int) – The number of groups in ResNeXt.
  • base_width (int) – The base width of ResNeXt.
make_res_layer(**kwargs)[source]

Pack all blocks in a stage into a ResLayer for DetectoRS.

class mmdet.models.backbones.Darknet(depth=53, out_indices=(3, 4, 5), frozen_stages=-1, conv_cfg=None, norm_cfg={'requires_grad': True, 'type': 'BN'}, act_cfg={'negative_slope': 0.1, 'type': 'LeakyReLU'}, norm_eval=True)[source]

Darknet backbone.

Parameters:
  • depth (int) – Depth of Darknet. Currently only support 53.
  • out_indices (Sequence[int]) – Output from which stages.
  • frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Default: -1.
  • conv_cfg (dict) – Config dict for convolution layer. Default: None.
  • norm_cfg (dict) – Dictionary to construct and config norm layer. Default: dict(type=’BN’, requires_grad=True)
  • act_cfg (dict) – Config dict for activation layer. Default: dict(type=’LeakyReLU’, negative_slope=0.1).
  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only.

Example

>>> from mmdet.models import Darknet
>>> import torch
>>> self = Darknet(depth=53)
>>> self.eval()
>>> inputs = torch.rand(1, 3, 416, 416)
>>> level_outputs = self.forward(inputs)
>>> for level_out in level_outputs:
...     print(tuple(level_out.shape))
...
(1, 256, 52, 52)
(1, 512, 26, 26)
(1, 1024, 13, 13)
forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

static make_conv_res_block(in_channels, out_channels, res_repeat, conv_cfg=None, norm_cfg={'requires_grad': True, 'type': 'BN'}, act_cfg={'negative_slope': 0.1, 'type': 'LeakyReLU'})[source]

In Darknet backbone, ConvLayer is usually followed by ResBlock. This function will make that. The Conv layers always have 3x3 filters with stride=2. The number of the filters in Conv layer is the same as the out channels of the ResBlock.

Parameters:
  • in_channels (int) – The number of input channels.
  • out_channels (int) – The number of output channels.
  • res_repeat (int) – The number of ResBlocks.
  • conv_cfg (dict) – Config dict for convolution layer. Default: None.
  • norm_cfg (dict) – Dictionary to construct and config norm layer. Default: dict(type=’BN’, requires_grad=True)
  • act_cfg (dict) – Config dict for activation layer. Default: dict(type=’LeakyReLU’, negative_slope=0.1).
train(mode=True)[source]

Sets the module in training mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

Parameters:mode (bool) – whether to set training mode (True) or evaluation mode (False). Default: True.
Returns:self
Return type:Module
class mmdet.models.backbones.ResNeSt(groups=1, base_width=4, radix=2, reduction_factor=4, avg_down_stride=True, **kwargs)[source]

ResNeSt backbone.

Parameters:
  • groups (int) – Number of groups of Bottleneck. Default: 1
  • base_width (int) – Base width of Bottleneck. Default: 4
  • radix (int) – Radix of SplitAttentionConv2d. Default: 2
  • reduction_factor (int) – Reduction factor of inter_channels in SplitAttentionConv2d. Default: 4.
  • avg_down_stride (bool) – Whether to use average pool for stride in Bottleneck. Default: True.
  • kwargs (dict) – Keyword arguments for ResNet.
make_res_layer(**kwargs)[source]

Pack all blocks in a stage into a ResLayer.

necks

dense_heads

roi_heads

losses

utils

class mmdet.models.utils.ResLayer(block, inplanes, planes, num_blocks, stride=1, avg_down=False, conv_cfg=None, norm_cfg={'type': 'BN'}, downsample_first=True, **kwargs)[source]

ResLayer to build ResNet style backbone.

Parameters:
  • block (nn.Module) – block used to build ResLayer.
  • inplanes (int) – inplanes of block.
  • planes (int) – planes of block.
  • num_blocks (int) – number of blocks.
  • stride (int) – stride of the first block. Default: 1
  • avg_down (bool) – Use AvgPool instead of stride conv when downsampling in the bottleneck. Default: False
  • conv_cfg (dict) – dictionary to construct and config conv layer. Default: None
  • norm_cfg (dict) – dictionary to construct and config norm layer. Default: dict(type=’BN’)
  • downsample_first (bool) – Downsample at the first block or last block. False for Hourglass, True for ResNet. Default: True
mmdet.models.utils.gaussian_radius(det_size, min_overlap)[source]

Generate 2D gaussian radius.

This function is modified from the official github repo.

Given min_overlap, radius could computed by a quadratic equation according to Vieta’s formulas.

There are 3 cases for computing gaussian radius, details are following:

  • Explanation of figure: lt and br indicates the left-top and bottom-right corner of ground truth box. x indicates the generated corner at the limited position when radius=r.
  • Case1: one corner is inside the gt box and the other is outside.
|<   width   >|

lt-+----------+         -
|  |          |         ^
+--x----------+--+
|  |          |  |
|  |          |  |    height
|  | overlap  |  |
|  |          |  |
|  |          |  |      v
+--+---------br--+      -
   |          |  |
   +----------+--x

To ensure IoU of generated box and gt box is larger than min_overlap:

\[\begin{split}\cfrac{(w-r)*(h-r)}{w*h+(w+h)r-r^2} \ge {iou} \quad\Rightarrow\quad {r^2-(w+h)r+\cfrac{1-iou}{1+iou}*w*h} \ge 0 \\ {a} = 1,\quad{b} = {-(w+h)},\quad{c} = {\cfrac{1-iou}{1+iou}*w*h} {r} \le \cfrac{-b-\sqrt{b^2-4*a*c}}{2*a}\end{split}\]
  • Case2: both two corners are inside the gt box.
|<   width   >|

lt-+----------+         -
|  |          |         ^
+--x-------+  |
|  |       |  |
|  |overlap|  |       height
|  |       |  |
|  +-------x--+
|          |  |         v
+----------+-br         -

To ensure IoU of generated box and gt box is larger than min_overlap:

\[\begin{split}\cfrac{(w-2*r)*(h-2*r)}{w*h} \ge {iou} \quad\Rightarrow\quad {4r^2-2(w+h)r+(1-iou)*w*h} \ge 0 \\ {a} = 4,\quad {b} = {-2(w+h)},\quad {c} = {(1-iou)*w*h} {r} \le \cfrac{-b-\sqrt{b^2-4*a*c}}{2*a}\end{split}\]
  • Case3: both two corners are outside the gt box.
   |<   width   >|

x--+----------------+
|  |                |
+-lt-------------+  |   -
|  |             |  |   ^
|  |             |  |
|  |   overlap   |  | height
|  |             |  |
|  |             |  |   v
|  +------------br--+   -
|                |  |
+----------------+--x

To ensure IoU of generated box and gt box is larger than min_overlap:

\[\begin{split}\cfrac{w*h}{(w+2*r)*(h+2*r)} \ge {iou} \quad\Rightarrow\quad {4*iou*r^2+2*iou*(w+h)r+(iou-1)*w*h} \le 0 \\ {a} = {4*iou},\quad {b} = {2*iou*(w+h)},\quad {c} = {(iou-1)*w*h} \\ {r} \le \cfrac{-b+\sqrt{b^2-4*a*c}}{2*a}\end{split}\]
Parameters:
  • det_size (list[int]) – Shape of object.
  • min_overlap (float) – Min IoU with ground truth for boxes generated by keypoints inside the gaussian kernel.
Returns:

Radius of gaussian kernel.

Return type:

radius (int)

mmdet.models.utils.gen_gaussian_target(heatmap, center, radius, k=1)[source]

Generate 2D gaussian heatmap.

Parameters:
  • heatmap (Tensor) – Input heatmap, the gaussian kernel will cover on it and maintain the max value.
  • center (list[int]) – Coord of gaussian kernel’s center.
  • radius (int) – Radius of gaussian kernel.
  • k (int) – Coefficient of gaussian kernel. Default: 1.
Returns:

Updated heatmap covered by gaussian kernel.

Return type:

out_heatmap (Tensor)

class mmdet.models.utils.MultiheadAttention(embed_dims, num_heads, dropout=0.0)[source]

A warpper for torch.nn.MultiheadAttention.

This module implements MultiheadAttention with residual connection, and positional encoding used in DETR is also passed as input.

Parameters:
  • embed_dims (int) – The embedding dimension.
  • num_heads (int) – Parallel attention heads. Same as nn.MultiheadAttention.
  • dropout (float) – A Dropout layer on attn_output_weights. Default 0.0.
forward(x, key=None, value=None, residual=None, query_pos=None, key_pos=None, attn_mask=None, key_padding_mask=None)[source]

Forward function for MultiheadAttention.

Parameters:
  • x (Tensor) – The input query with shape [num_query, bs, embed_dims]. Same in nn.MultiheadAttention.forward.
  • key (Tensor) – The key tensor with shape [num_key, bs, embed_dims]. Same in nn.MultiheadAttention.forward. Default None. If None, the query will be used.
  • value (Tensor) – The value tensor with same shape as key. Same in nn.MultiheadAttention.forward. Default None. If None, the key will be used.
  • residual (Tensor) – The tensor used for addition, with the same shape as x. Default None. If None, x will be used.
  • query_pos (Tensor) – The positional encoding for query, with the same shape as x. Default None. If not None, it will be added to x before forward function.
  • key_pos (Tensor) – The positional encoding for key, with the same shape as key. Default None. If not None, it will be added to key before forward function. If None, and query_pos has the same shape as key, then query_pos will be used for key_pos.
  • attn_mask (Tensor) – ByteTensor mask with shape [num_query, num_key]. Same in nn.MultiheadAttention.forward. Default None.
  • key_padding_mask (Tensor) – ByteTensor with shape [bs, num_key]. Same in nn.MultiheadAttention.forward. Default None.
Returns:

forwarded results with shape [num_query, bs, embed_dims].

Return type:

Tensor

class mmdet.models.utils.FFN(embed_dims, feedforward_channels, num_fcs=2, act_cfg={'inplace': True, 'type': 'ReLU'}, dropout=0.0, add_residual=True)[source]

Implements feed-forward networks (FFNs) with residual connection.

Parameters:
  • embed_dims (int) – The feature dimension. Same as MultiheadAttention.
  • feedforward_channels (int) – The hidden dimension of FFNs.
  • num_fcs (int) – The number of fully-connected layers in FFNs.
  • act_cfg (dict) – The activation config for FFNs.
  • dropout (float) – Probability of an element to be zeroed. Default 0.0.
forward(x, residual=None)[source]

Forward function for FFN.

class mmdet.models.utils.TransformerEncoderLayer(embed_dims, num_heads, feedforward_channels, dropout=0.0, order=('selfattn', 'norm', 'ffn', 'norm'), act_cfg={'inplace': True, 'type': 'ReLU'}, norm_cfg={'type': 'LN'}, num_fcs=2)[source]

Implements one encoder layer in DETR transformer.

Parameters:
  • embed_dims (int) – The feature dimension. Same as FFN.
  • num_heads (int) – Parallel attention heads.
  • feedforward_channels (int) – The hidden dimension for FFNs.
  • dropout (float) – Probability of an element to be zeroed. Default 0.0.
  • order (tuple[str]) – The order for encoder layer. Valid examples are (‘selfattn’, ‘norm’, ‘ffn’, ‘norm’) and (‘norm’, ‘selfattn’, ‘norm’, ‘ffn’). Default (‘selfattn’, ‘norm’, ‘ffn’, ‘norm’).
  • act_cfg (dict) – The activation config for FFNs. Defalut ReLU.
  • norm_cfg (dict) – Config dict for normalization layer. Default layer normalization.
  • num_fcs (int) – The number of fully-connected layers for FFNs. Default 2.
forward(x, pos=None, attn_mask=None, key_padding_mask=None)[source]

Forward function for TransformerEncoderLayer.

Parameters:
  • x (Tensor) – The input query with shape [num_key, bs, embed_dims]. Same in MultiheadAttention.forward.
  • pos (Tensor) – The positional encoding for query. Default None. Same as query_pos in MultiheadAttention.forward.
  • attn_mask (Tensor) – ByteTensor mask with shape [num_key, num_key]. Same in MultiheadAttention.forward. Default None.
  • key_padding_mask (Tensor) – ByteTensor with shape [bs, num_key]. Same in MultiheadAttention.forward. Default None.
Returns:

forwarded results with shape [num_key, bs, embed_dims].

Return type:

Tensor

class mmdet.models.utils.TransformerEncoder(num_layers, embed_dims, num_heads, feedforward_channels, dropout=0.0, order=('selfattn', 'norm', 'ffn', 'norm'), act_cfg={'inplace': True, 'type': 'ReLU'}, norm_cfg={'type': 'LN'}, num_fcs=2)[source]

Implements the encoder in DETR transformer.

Parameters:
  • num_layers (int) – The number of TransformerEncoderLayer.
  • embed_dims (int) – Same as TransformerEncoderLayer.
  • num_heads (int) – Same as TransformerEncoderLayer.
  • feedforward_channels (int) – Same as TransformerEncoderLayer.
  • dropout (float) – Same as TransformerEncoderLayer. Default 0.0.
  • order (tuple[str]) – Same as TransformerEncoderLayer.
  • act_cfg (dict) – Same as TransformerEncoderLayer. Defalut ReLU.
  • norm_cfg (dict) – Same as TransformerEncoderLayer. Default layer normalization.
  • num_fcs (int) – Same as TransformerEncoderLayer. Default 2.
forward(x, pos=None, attn_mask=None, key_padding_mask=None)[source]

Forward function for TransformerEncoder.

Parameters:
  • x (Tensor) – Input query. Same in TransformerEncoderLayer.forward.
  • pos (Tensor) – Positional encoding for query. Default None. Same in TransformerEncoderLayer.forward.
  • attn_mask (Tensor) – ByteTensor attention mask. Default None. Same in TransformerEncoderLayer.forward.
  • key_padding_mask (Tensor) – Same in TransformerEncoderLayer.forward. Default None.
Returns:

Results with shape [num_key, bs, embed_dims].

Return type:

Tensor

class mmdet.models.utils.TransformerDecoderLayer(embed_dims, num_heads, feedforward_channels, dropout=0.0, order=('selfattn', 'norm', 'multiheadattn', 'norm', 'ffn', 'norm'), act_cfg={'inplace': True, 'type': 'ReLU'}, norm_cfg={'type': 'LN'}, num_fcs=2)[source]

Implements one decoder layer in DETR transformer.

Parameters:
  • embed_dims (int) – The feature dimension. Same as TransformerEncoderLayer.
  • num_heads (int) – Parallel attention heads.
  • feedforward_channels (int) – Same as TransformerEncoderLayer.
  • dropout (float) – Same as TransformerEncoderLayer. Default 0.0.
  • order (tuple[str]) – The order for decoder layer. Valid examples are (‘selfattn’, ‘norm’, ‘multiheadattn’, ‘norm’, ‘ffn’, ‘norm’) and (‘norm’, ‘selfattn’, ‘norm’, ‘multiheadattn’, ‘norm’, ‘ffn’). Default the former.
  • act_cfg (dict) – Same as TransformerEncoderLayer. Defalut ReLU.
  • norm_cfg (dict) – Config dict for normalization layer. Default layer normalization.
  • num_fcs (int) – The number of fully-connected layers in FFNs.
forward(x, memory, memory_pos=None, query_pos=None, memory_attn_mask=None, target_attn_mask=None, memory_key_padding_mask=None, target_key_padding_mask=None)[source]

Forward function for TransformerDecoderLayer.

Parameters:
  • x (Tensor) – Input query with shape [num_query, bs, embed_dims].
  • memory (Tensor) – Tensor got from TransformerEncoder, with shape [num_key, bs, embed_dims].
  • memory_pos (Tensor) – The positional encoding for memory. Default None. Same as key_pos in MultiheadAttention.forward.
  • query_pos (Tensor) – The positional encoding for query. Default None. Same as query_pos in MultiheadAttention.forward.
  • memory_attn_mask (Tensor) – ByteTensor mask for memory, with shape [num_key, num_key]. Same as attn_mask in MultiheadAttention.forward. Default None.
  • target_attn_mask (Tensor) – ByteTensor mask for x, with shape [num_query, num_query]. Same as attn_mask in MultiheadAttention.forward. Default None.
  • memory_key_padding_mask (Tensor) – ByteTensor for memory, with shape [bs, num_key]. Same as key_padding_mask in MultiheadAttention.forward. Default None.
  • target_key_padding_mask (Tensor) – ByteTensor for x, with shape [bs, num_query]. Same as key_padding_mask in MultiheadAttention.forward. Default None.
Returns:

forwarded results with shape [num_query, bs, embed_dims].

Return type:

Tensor

class mmdet.models.utils.TransformerDecoder(num_layers, embed_dims, num_heads, feedforward_channels, dropout=0.0, order=('selfattn', 'norm', 'multiheadattn', 'norm', 'ffn', 'norm'), act_cfg={'inplace': True, 'type': 'ReLU'}, norm_cfg={'type': 'LN'}, num_fcs=2, return_intermediate=False)[source]

Implements the decoder in DETR transformer.

Parameters:
  • num_layers (int) – The number of TransformerDecoderLayer.
  • embed_dims (int) – Same as TransformerDecoderLayer.
  • num_heads (int) – Same as TransformerDecoderLayer.
  • feedforward_channels (int) – Same as TransformerDecoderLayer.
  • dropout (float) – Same as TransformerDecoderLayer. Default 0.0.
  • order (tuple[str]) – Same as TransformerDecoderLayer.
  • act_cfg (dict) – Same as TransformerDecoderLayer. Defalut ReLU.
  • norm_cfg (dict) – Same as TransformerDecoderLayer. Default layer normalization.
  • num_fcs (int) – Same as TransformerDecoderLayer. Default 2.
forward(x, memory, memory_pos=None, query_pos=None, memory_attn_mask=None, target_attn_mask=None, memory_key_padding_mask=None, target_key_padding_mask=None)[source]

Forward function for TransformerDecoder.

Parameters:
  • x (Tensor) – Input query. Same in TransformerDecoderLayer.forward.
  • memory (Tensor) – Same in TransformerDecoderLayer.forward.
  • memory_pos (Tensor) – Same in TransformerDecoderLayer.forward. Default None.
  • query_pos (Tensor) – Same in TransformerDecoderLayer.forward. Default None.
  • memory_attn_mask (Tensor) – Same in TransformerDecoderLayer.forward. Default None.
  • target_attn_mask (Tensor) – Same in TransformerDecoderLayer.forward. Default None.
  • memory_key_padding_mask (Tensor) – Same in TransformerDecoderLayer.forward. Default None.
  • target_key_padding_mask (Tensor) – Same in TransformerDecoderLayer.forward. Default None.
Returns:

Results with shape [num_query, bs, embed_dims].

Return type:

Tensor

class mmdet.models.utils.Transformer(embed_dims=512, num_heads=8, num_encoder_layers=6, num_decoder_layers=6, feedforward_channels=2048, dropout=0.0, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_cfg={'type': 'LN'}, num_fcs=2, pre_norm=False, return_intermediate_dec=False)[source]

Implements the DETR transformer.

Following the official DETR implementation, this module copy-paste from torch.nn.Transformer with modifications:

  • positional encodings are passed in MultiheadAttention
  • extra LN at the end of encoder is removed
  • decoder returns a stack of activations from all decoding layers

See paper: End-to-End Object Detection with Transformers for details.

Parameters:
  • embed_dims (int) – The feature dimension.
  • num_heads (int) – Parallel attention heads. Same as nn.MultiheadAttention.
  • num_encoder_layers (int) – Number of TransformerEncoderLayer.
  • num_decoder_layers (int) – Number of TransformerDecoderLayer.
  • feedforward_channels (int) – The hidden dimension for FFNs used in both encoder and decoder.
  • dropout (float) – Probability of an element to be zeroed. Default 0.0.
  • act_cfg (dict) – Activation config for FFNs used in both encoder and decoder. Defalut ReLU.
  • norm_cfg (dict) – Config dict for normalization used in both encoder and decoder. Default layer normalization.
  • num_fcs (int) – The number of fully-connected layers in FFNs, which is used for both encoder and decoder.
  • pre_norm (bool) – Whether the normalization layer is ordered first in the encoder and decoder. Default False.
  • return_intermediate_dec (bool) – Whether to return the intermediate output from each TransformerDecoderLayer or only the last TransformerDecoderLayer. Default False. If False, the returned hs has shape [num_decoder_layers, bs, num_query, embed_dims]. If True, the returned hs will have shape [1, bs, num_query, embed_dims].
forward(x, mask, query_embed, pos_embed)[source]

Forward function for Transformer.

Parameters:
  • x (Tensor) – Input query with shape [bs, c, h, w] where c = embed_dims.
  • mask (Tensor) – The key_padding_mask used for encoder and decoder, with shape [bs, h, w].
  • query_embed (Tensor) – The query embedding for decoder, with shape [num_query, c].
  • pos_embed (Tensor) – The positional encoding for encoder and decoder, with the same shape as x.
Returns:

results of decoder containing the following tensor.

  • out_dec: Output from decoder. If return_intermediate_dec is True output has shape [num_dec_layers, bs,
    num_query, embed_dims], else has shape [1, bs, num_query, embed_dims].
  • memory: Output results from encoder, with shape [bs, embed_dims, h, w].

Return type:

tuple[Tensor]

init_weights(distribution='uniform')[source]

Initialize the transformer weights.

mmdet.models.utils.build_transformer(cfg, default_args=None)[source]

Builder for Transformer.

mmdet.models.utils.build_positional_encoding(cfg, default_args=None)[source]

Builder for Position Encoding.

class mmdet.models.utils.SinePositionalEncoding(num_feats, temperature=10000, normalize=False, scale=6.283185307179586, eps=1e-06)[source]

Position encoding with sine and cosine functions.

See End-to-End Object Detection with Transformers for details.

Parameters:
  • num_feats (int) – The feature dimension for each position along x-axis or y-axis. Note the final returned dimension for each position is 2 times of this value.
  • temperature (int, optional) – The temperature used for scaling the position embedding. Default 10000.
  • normalize (bool, optional) – Whether to normalize the position embedding. Default False.
  • scale (float, optional) – A scale factor that scales the position embedding. The scale will be used only when normalize is True. Default 2*pi.
  • eps (float, optional) – A value added to the denominator for numerical stability. Default 1e-6.
forward(mask)[source]

Forward function for SinePositionalEncoding.

Parameters:mask (Tensor) – ByteTensor mask. Non-zero values representing ignored positions, while zero values means valid positions for this image. Shape [bs, h, w].
Returns:
Returned position embedding with shape
[bs, num_feats*2, h, w].
Return type:pos (Tensor)
class mmdet.models.utils.LearnedPositionalEncoding(num_feats, row_num_embed=50, col_num_embed=50)[source]

Position embedding with learnable embedding weights.

Parameters:
  • num_feats (int) – The feature dimension for each position along x-axis or y-axis. The final returned dimension for each position is 2 times of this value.
  • row_num_embed (int, optional) – The dictionary size of row embeddings. Default 50.
  • col_num_embed (int, optional) – The dictionary size of col embeddings. Default 50.
forward(mask)[source]

Forward function for LearnedPositionalEncoding.

Parameters:mask (Tensor) – ByteTensor mask. Non-zero values representing ignored positions, while zero values means valid positions for this image. Shape [bs, h, w].
Returns:
Returned position embedding with shape
[bs, num_feats*2, h, w].
Return type:pos (Tensor)
init_weights()[source]

Initialize the learnable weights.