Locate then Segment: A Strong Pipeline for Referring Image Segmentation
Ya Jing1Tao Kong2Wei Wang1Liang Wang2Lei Li2, Tieniu Tan2,
1Chinese Academy of Sciences,    2ByteDance AI Lab
Referring image segmentation aims to segment the objects referred by a natural language expression. In this work, we view this task from the perspective of decoupling it into a "Locate-Then-Segment" (LTS) scheme. Given a language expression, people generally first perform attention to the corresponding target image regions, then generate a fine segmentation mask about the object based on its context. The LTS first extracts and fuses both visual and textual features to get a cross-modal representation, then applies a cross-model interaction on the visual-textual features to locate the referred object with position prior, and finally generates the segmentation result with a light-weight segmentation network. Our LTS is simple but surprisingly effective. On three popular benchmark datasets, the LTS outperforms all the previous state-of-the-art methods by a large margin (e.g., +3.2% on RefCOCO+ and +3.4% on RefCOCOg). In addition, our model is more interpretable with explicitly locating the object, which is also proved by visualization experiments.
Visualization of correlation heatmaps and final results predicted by our model. gt means the ground truth segmentation mask of input image.
  title={Locate then Segment: A Strong Pipeline for Referring Image Segmentation},
  author={Jing, Ya and Kong, Tao and Wang, Wei and Wang, Liang and Li, Lei and Tan, Tieniu},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},