TokenCut: Segmenting Objects in Images and Videos with Self-supervised Transformer and Normalized Cut

(CVPR 2022) Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut

Yangtao Wang¹ Xi Shen² Yuan Yuan³ Yuming Du⁴ Maomao Li² Shel Xu Hu⁵ James L. Crowley¹ Dominique Vaufreydaz¹

Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France¹
Tencent AI Lab² MIT CSAIL³ LIGM (UMR 8049) - Ecole des Ponts, UPE⁴
Samsung AI Center, Cambridge⁵

Seg. for Video: [Paper] [GitLab] [GitHub]

Seg./Det. for Image: [GitLab] [GitHub]

Abstract

In this paper, we describe a graph-based algorithm that uses the features obtained by a self-supervised transformer to detect and segment salient objects in images and videos. With this approach, the image patches that compose an image or video are organised into a fully connected graph, where the edge between each pair of patches is labeled with a similarity score between patches using features learned by the transformer. Detection and segmentation of salient objects is then formulated as a graph-cut problem and solved using the classical Normalized Cut algorithm.

Despite the simplicity of this approach, it achieves state-of-the-art results on several common image and video detection and segmentation tasks. For unsupervised object discovery, this approach outperforms the competing approaches by a margin of 6.1%, 5.7%, and 2.6%, respectively, when tested with the VOC07, VOC12, and COCO20K datasets. For the unsupervised saliency detection task in images, this method improves the score for Intersection over Union (IoU) by 4.4%, 5.6% and 5.2%. When tested with the ECSSD, DUTS, and DUT-OMRON datasets, respectively, compared to current state-of-the-art techniques. This method also achieves competitive results for unsupervised video object segmentation tasks with the DAVIS, SegTV2, and FBMS datasets.

Visual Results

Segmentation Results

Video Segmentation

More result on DAVIS, FBMS and SegTrackv2

Image Segmentation

Raw Image	TokenCut	TokenCut + Bilateral Solver

More results on ECSSD, DUTS and DUT-OMRON

Detection Results

Raw Image	EigenVector Attention	Detection(Red)

More results on VOC07, VOC12 and COCO

Internet Image Results

Raw Image	Attention	Detection

Code and Paper

Code(Seg./Det. for Image)	Demo	Paper1(CVPR)	Code(Seg. for Video)	Paper2(arXiv)

To cite our paper,

  @inproceedings{wang2022tokencut,
          title={Self-supervised Transformers for Unsupervised Object Discovery using Normalized Cut},
          author={Wang, Yangtao and Shen, Xi and Hu, Shell Xu and Yuan, Yuan and Crowley, James L.
                  and Vaufreydaz, Dominique},
          booktitle={Conference on Computer Vision and Pattern Recognition},
          address = {New Orleans, LA, USA},
          month = {June},
          year={2022}
        }

  @unpublished{wang2022tokencut2, 
	  title = {{TokenCut: Segmenting Objects in Images and Videos with Self-supervised Transformer 
		and Normalized Cut}}, 
	  author = {Wang, Yangtao and Shen, Xi and Yuan, Yuan and Du, Yuming and Li, Maomao and 
		Hu, Shell Xu and Crowley, James L and Vaufreydaz, Dominique}, 
	  url = {https://hal.archives-ouvertes.fr/hal-03765422}, 
	  note = {working paper or preprint}, 
	  year = {2022}, 
	  hal_id = {hal-03765422}, 
	  hal_version = {v1}
	}

Acknowledgements

This work has been partially supported by the MIAI Multidisciplinary AI Institute at the Univ.Grenoble Alpes(MIAI@Grenoble Alpes - ANR-19-P3IA-0003), and by the EU H2020 ICT48 project Humane AI Net under contract EU #952026.