Abstract:
To address the issue of imprecise target localization in existing few-shot image semantic segmentation methods for unknown images,a multimodal cross-alignment network for few-shot image semantic segmentation is proposed. First,a set of backbone networks with shared weights are utilized to map both support and query images into a deep feature space,where the visual encoding features of the images are extracted.Subsequently,the class information of the objects within the support images is encoded into textual space using a pre-trained CLIP text encoder,capturing the corresponding textual semantics of the target classes.Then,a cross-attention mechanism is employed to promote the feature interaction between the textual and visual spaces,enhancing semantic alignment across different modalities. Finally,a temporary predicted query mask is used to establish a reverse cross-guidance strategy,which guides the mask prediction of known targets in the original support images. Comparative and ablation experiments conducted on PASCAL and COCO datasets have demonstrated the superiority of proposed method when dealing with unknown targets in the query images.