Object Segmentation Reasoning
Author(s)
Fitte-Rey, Quentin Mathieu
Advisor(s)
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
Current image segmentation methods often lack flexibility, requiring retraining for new objects and struggling to interpret nuanced user requests expressed in natural language. This project addresses these limitations by exploring the integration of reasoning capabilities, via Vision-Language Models (VLMs), into the segmentation pipeline. The primary objective is to create a system capable of segmenting objects based on descriptive, free-form text prompts, thereby improving usability and adaptability for applications such as infrastructure inventory.
The research investigates two distinct architectural paradigms. The first approach, a Multi-Module Architecture, combines a reasoning VLM (Qwen) with a specialized segmentation decoder (SAM) via a trainable adapter. This method demonstrates high robustness, achieving state-of-the-art performance on reasoning benchmarks by effectively leveraging the pre-trained strengths of both components. However, analysis reveals that this disjoint architecture imposes an information bottleneck and limits end-to-end optimization.
To overcome these structural limitations, the project pivots to a second, novel approach: End-to-End VLM Segmentation. This method fine-tunes a VLM to directly generate structured geometric outputs (polygons) from text prompts, unifying reasoning and localization into a single differentiable process. A two-stage training pipeline was developed, utilizing Supervised Fine-Tuning (SFT) followed by Reinforcement Learning via Group Relative Policy Optimization (GRPO) to optimize spatial precision.
The thesis concludes with a critical comparison of these methodologies. While the Multi-Module architecture currently offers superior stability and immediate deployment utility, the End-to-End approach demonstrates a significantly higher theoretical ceiling and architectural simplicity. Although currently constrained by training instability and topological limitations of polygon representations, the End-to-End model represents the future of generalist segmentation. Key recommendations prioritize stabilizing the GRPO training process, scaling complex reasoning datasets, and investigating alternative output representations, such as low-resolution dense masks, to unlock the full potential of this unified approach.
Sponsor
Date
2025-12
Extent
Resource Type
Text
Resource Subtype
Thesis (Masters Degree)