Lab and University Logo

SAM Guided Task-Specific Enhanced Nuclei Segmentation in Digital Pathology

1 Kumoh National Institute of Technology, Gumi, Korea 39177
2 Chungbuk National University, Cheongju, Korea 28644
MICCAI 2024

*Corresponding Author
Descriptive Alt Text

Qualitative evaluation of segmentation performance of sample images for with and without SAM-guidance for eU-Net3+. The image in the top row is from CryoNuSeg, middle row is from NuInsSeg and bottom row is from CoNIC.

Abstract

Cell nuclei segmentation is crucial in digital pathology for various diagnoses and treatments which are prominently performed using semantic segmentation that focus on scalable receptive field and multi-scale information. In such segmentation tasks, U-Net based task-specific encoders excel in capturing fine-grained information but fall short integrating high-level global context. Conversely, foundation models inherently grasp coarse-level features but are not as proficient as task-specific models to provide fine-grained details. To this end, we propose utilizing the foundation model to guide the task-specific supervised learning by dynamically combining their global and local latent representations, via our proposed X-Gated Fusion Block, which uses Gated squeeze and excitation block followed by Cross-attention to dynamically fuse latent representations. Through our experiments across datasets and visualization analysis, we demonstrate that the integration of task-specific knowledge with general insights from foundational models can drastically increase performance, even outperforming domain-specific semantic segmentation models to achieve state-of-the-art results by increasing the Dice score and mIoU by approximately 12% and 17.22% on CryoNuSeg, 15.55% and 16.77% on NuInsSeg, and 9% on both metrics for the CoNIC dataset.

Descriptive Alt Text

The overall architecture of SAM guided task-specific segmentation.

Method

Our proposed methodology first enhances U-Net3+ by adaptive feature selection for task-specific segmentation which we call eU-Net3+. Then we use frozen SAM encoder to guide the segmentation process by providing global contextual features into the $e$U-Net3+. Both the local and global representations are then dynamically fused together using the proposed X-GFB, that first uses GLU in gated squeeze and excitation block and then uses cross-attention block for retaining both local and global awareness.