基于多模态交叉对齐网络的小样本图像语义分割

周莹; 赵国栋

doi:10.3969/j.issn.1007-791X.2026.01.009

基于多模态交叉对齐网络的小样本图像语义分割

周莹,
赵国栋

Few-shot image semantic segmentation based on multimodal cross-alignment network

摘要

摘要: 针对现有小样本图像语义分割方法对未知图片中目标定位不精确的问题，提出一种基于多模态交叉对齐网络的小样本图像语义分割方法。首先，利用一组共享权重的主干网络将支持图片和查询图片映射到深度特征空间，提取图片在视觉维度的编码特征。其次，利用预训练的CLIP文本编码器将支持图片中的目标类信息编码到文本空间中，捕获目标类对应的文本语义。再次，利用交叉注意力机制建立文本和视觉空间的特征交互，促进不同模态间的语义对齐。最后，利用临时预测的查询掩码建立反向交叉指导策略，指导原始支持图片中已知目标的掩码预测。在开源的PASCAL和COCO数据集上进行了对比实验和消融实验，实验结果验证了所设计方法在处理查询图片中未知目标时的优越性。

Abstract: To address the issue of imprecise target localization in existing few-shot image semantic segmentation methods for unknown images,a multimodal cross-alignment network for few-shot image semantic segmentation is proposed. First,a set of backbone networks with shared weights are utilized to map both support and query images into a deep feature space,where the visual encoding features of the images are extracted.Subsequently,the class information of the objects within the support images is encoded into textual space using a pre-trained CLIP text encoder,capturing the corresponding textual semantics of the target classes.Then,a cross-attention mechanism is employed to promote the feature interaction between the textual and visual spaces,enhancing semantic alignment across different modalities. Finally,a temporary predicted query mask is used to establish a reverse cross-guidance strategy,which guides the mask prediction of known targets in the original support images. Comparative and ablation experiments conducted on PASCAL and COCO datasets have demonstrated the superiority of proposed method when dealing with unknown targets in the query images.

HTML全文

参考文献(28)

施引文献

资源附件(0)