ChatPaper.aiChatPaper

生物图像自动形态特征标注

Automatic Image-Level Morphological Trait Annotation for Organismal Images

April 2, 2026
作者: Vardaan Pahuja, Samuel Stevens, Alyson East, Sydne Record, Yu Su
cs.AI

摘要

形态性状是生物体的物理特征,能为理解生物与环境间的相互作用提供关键线索。然而,当前提取这些性状的过程仍依赖于缓慢的人工专家操作,限制了其在大规模生态研究中的应用。主要瓶颈在于缺乏将生物图像与性状级标注相关联的高质量数据集。本研究证明,基于基础模型特征训练的稀疏自编码器能够产生单义性、空间定位的神经元,这些神经元能持续在具有形态学意义的部位激活。利用这一特性,我们开发了性状标注流程:先定位显著区域,再通过视觉语言提示生成可解释的性状描述。基于该方法,我们构建了Bioscan-Traits数据集,包含来自BIOSCAN-5M的1.9万张昆虫图像的8万项性状标注。人工评估证实了生成形态描述的生物学合理性。我们通过系统消融实验评估设计敏感性,对关键设计选择进行参数化调整并量化其对性状描述质量的影响。这种模块化标注流程替代了成本高昂的人工标注,为基础模型注入生物语义监督提供了可扩展方案,既支持大规模形态分析,也在生态关联性与机器学习实用性之间架设了桥梁。
English
Morphological traits are physical characteristics of biological organisms that provide vital clues on how organisms interact with their environment. Yet extracting these traits remains a slow, expert-driven process, limiting their use in large-scale ecological studies. A major bottleneck is the absence of high-quality datasets linking biological images to trait-level annotations. In this work, we demonstrate that sparse autoencoders trained on foundation-model features yield monosemantic, spatially grounded neurons that consistently activate on meaningful morphological parts. Leveraging this property, we introduce a trait annotation pipeline that localizes salient regions and uses vision-language prompting to generate interpretable trait descriptions. Using this approach, we construct Bioscan-Traits, a dataset of 80K trait annotations spanning 19K insect images from BIOSCAN-5M. Human evaluation confirms the biological plausibility of the generated morphological descriptions. We assess design sensitivity through a comprehensive ablation study, systematically varying key design choices and measuring their impact on the quality of the resulting trait descriptions. By annotating traits with a modular pipeline rather than prohibitively expensive manual efforts, we offer a scalable way to inject biologically meaningful supervision into foundation models, enable large-scale morphological analyses, and bridge the gap between ecological relevance and machine-learning practicality.
PDF11April 4, 2026