Beyond mAP: Towards better evaluation of instance segmentation

1University of Pennsylvania, 2Stanford University

Accepted as a highlight paper at CVPR 2023.


Correctness of instance segmentation constitutes:
We wanted to verify how good mAP is at evaluating these three tasks. We analyse the mAP (and the underlying precision-recall curves) of state-of-the-art models on the COCO dataset.

Problem statement

An initial analysis shows that state-of-the-art models achieve significant gains in AP, but produce qualitatively unsatisfactory results with a lot of duplicates. A deeper dive shows that these predictions are not penalized by AP, and rather, are rewarded by the metric. We call these low-confidence false positives hedged predictions because they are hedged/predicted to improve mAP, but clutter the segmentation result.
AP curves for person AP curves for person class for (a) MaskNMS , (b) SoftNMS, and (c) MatrixNMS shows the tremendous number of false-positives that are introduced (upto 4x compared to MaskNMS), but virtually have the same AP.


We propose a three-fold contribution:
Without Semantic Sorting and NMS
With Semantic Sorting and NMS

Please read the paper for more exciting details!

Paper Abstract

Correctness of instance segmentation constitutes counting the number of objects, correctly localizing all predictions and classifying each localized prediction. Average Precision is the de-facto metric used to measure all these constituents of segmentation. However, this metric does not penalize duplicate predictions in the high-recall range, and cannot distinguish instances that are localized correctly but categorized incorrectly. This weakness has inadvertently led to network designs that achieve significant gains in AP but also introduce a large number of false positives. We therefore cannot rely on AP to choose a model that provides an optimal tradeoff between false positives and high recall. To resolve this dilemma, we review alternative metrics in the literature and propose two new measures to explicitly measure the amount of both spatial and categorical duplicate predictions. We also propose a Semantic Sorting and NMS module to remove these duplicates based on a pixel occupancy matching scheme. Experiments show that modern segmentation networks have significant gains in AP, but also contain a considerable amount of duplicates. Our Semantic Sorting and NMS can be added as a plug-and-play module to mitigate hedged predictions and preserve AP.


  author    = {Jena, Rohit and Zhornyak, Lukas and Doiphode, Nehal and Chaudhari, Pratik and Buch, Vivek and Gee, James and Shi, Jianbo},
  title     = {Beyond mAP: Towards better evaluation of instance segmentation},
  journal   = {CVPR},
  year      = {2023},