Authors:
(1) Zhaoqing Wang, The University of Sydney and AI2Robotics;
(2) Xiaobo Xia, The University of Sydney;
(3) Ziye Chen, The University of Melbourne;
(4) Xiao He, AI2Robotics;
(5) Yandong Guo, AI2Robotics;
(6) Mingming Gong, The University of Melbourne and Mohamed bin Zayed University of Artificial Intelligence;
(7) Tongliang Liu, The University of Sydney.
3. Method and 3.1. Problem definition
3.2. Baseline and 3.3. Uni-OVSeg framework
4. Experiments
6. Broader impacts and References
Generic segmentation. Given an image, segmentation of specific visual concepts has remained an ongoing research topic in computer vision, as indicated by the extensive literature on it [23, 33, 44]. Generic segmentation mainly includes semantic segmentation [12, 44, 77, 79], instance segmentation [4, 7, 23], and panoptic segmentation [11, 11, 33, 56], related to different levels of granularity. In more detail, semantic segmentation [12, 20, 28, 38, 78] aims to assign a label to each pixel of the input image, according to their respective semantic classes. In addition, instance segmentation [53, 54, 59] attempts to distinguish different object instances of the same semantic class. Panoptic segmentation [9, 57, 74, 75] combines the characteristics of semantic segmentation and instance segmentation. Following a close-vocabulary assumption, previous works only can predict a predefined set of object categories. In this paper, we aim to build an advanced open-vocabulary segmentation system, which can categorise objects and stuff from an open set of vocabulary in the real world.
Vision foundation models. Recent advancements in visual foundation models have diversified optimisation techniques across various learning paradigms. These developments range from vision-only pretraining [2, 24, 25, 61, 62] to joint vision-language pre-training [30, 48, 73], and extend to multi-modal frameworks that integrate visual prompting [1]. A prime example of this evolution is SAM [34], which shows the potential of extensive training for general segmentation, offering impressive generalisability and scalability. Despite its impressive capabilities, SAM cannot categorise predicted masks into different semantic classes, which is limited by the supervision of the image-mask pairs.
More recently, Semantic-SAM [36] unifies different sources of human-annotated segmentation datasets and augments SAM by adding semantic labels and increased levels of granularity. In our work, our aim is to develop a more flexible vision foundation model, which can be trained with unpaired mask-text supervision (e.g., independent imagemask and image-text pairs) and can be easily adapted to different segmentation tasks.
Open-vocabulary segmentation. Open-vocabulary segmentation counters the constraints of closed-vocabulary segmentation by allowing the segmentation of a diverse range of classes, even those unseen during training [21, 67, 68, 80]. Existing works [17, 66, 76] leverage the pretrained vision-language models (e.g., CLIP [48] and ALIGN [30]) to perform open-vocabulary segmentation. Most open-vocabulary segmentation methods commonly utilise human-annotated supervision (i.e., the image-masktext triplets) to generalise the capability of vision-language models from the image level to the pixel level. To reduce the dependency on this labour-intensive supervision, some weakly-supervised methods are proposed to use only text supervisions [46, 65]. They learn to group image regions into shaped segments, but struggle to distinguish different instances with the same semantic class and the segmentation performance is unsatisfactory [64, 85]. This dilemma drives our pursuit of more advanced open-vocabulary segmentation framework, where the aim is to enjoy as low annotation cost as possible and simultaneously achieve significant performance.