AnyMo: 기하학 인식 설정 무관 야생 환경에서의 인간 동작 모델링

초록

웨어러블 및 모바일 기기가 일상생활에 점점 더 깊이 통합됨에 따라, 이는 야외 환경에서 인간의 움직임을 지속적으로 감지할 수 있는 실용적인 수단을 제공한다. 그러나 관성 신호는 신체 부위, 장착 위치, 센서 방향, 기기 하드웨어 및 샘플링 프로토콜을 포함한 감지 설정에 크게 의존한다. 이러한 설정 의존성은 기기와 데이터셋 간에 전이 가능한 움직임 표현을 학습하기 어렵게 만들며, 폐쇄 집합 인식을 넘어선 웨어러블 IMU의 광범위한 사용을 제한한다. 본 논문에서는 설정에 구애받지 않는 인간 움직임 모델링을 위한 기하학 인식 프레임워크인 AnyMo를 소개한다. AnyMo는 물리 기반 IMU 시뮬레이션을 밀집된 신체 표면 위치에 적용하여 다양하고 그럴듯한 합성 신호를 생성하고, 쌍을 이루는 합성 배치 뷰와 마스킹된 부분 관측값을 기반으로 그래프 인코더를 사전 학습하며, 다중 위치 IMU를 전신 움직임 토큰으로 토큰화하고, 이 토큰들을 LLM과 정렬하여 움직임-언어 이해를 수행한다. 우리는 AnyMo를 세 가지 보완적 과제, 즉 14개의 보이지 않는 하위 데이터셋에 대한 제로샷 활동 인식, 교차 양식 검색, 웨어러블 IMU 움직임 캡셔닝에서 평가하였으며, HAR에서 평균 정확도/F1/R@2가 각각 11.7%/11.6%/22.6% 향상되었고, 제로샷 IMU-텍스트 및 텍스트-IMU 검색 MRR이 각각 15.9% 및 28.6% 증가하였으며, 제로샷 캡셔닝 BERT-F1이 18.8% 향상되었다. 이러한 결과는 AnyMo가 야외 환경에서 웨어러블 움직임 이해를 위한 범용 모델로서의 가능성을 뒷받침한다. 프로젝트 페이지: https://baiyuchen.com/project/AnyMo

English

As wearable and mobile devices become increasingly embedded in daily life, they offer a practical way to continuously sense human motion in the wild. But inertial signals are highly dependent on the sensing setup, including body location, mounting position, sensor orientation, device hardware, and sampling protocol. This setup dependence makes it difficult to learn motion representations that transfer across devices and datasets, and limits the broader use of wearable IMUs beyond closed-set recognition. We introduce AnyMo, a geometry-aware framework for setup-agnostic human motion modeling. AnyMo uses physics-grounded IMU simulation over dense body-surface placements to generate diverse and plausible synthetic signals, pre-trains a graph encoder from paired synthetic placement views and masked partial observations, tokenizes multi-position IMU into full-body motion tokens, and aligns these tokens with an LLM for motion-language understanding. We evaluate AnyMo on three complementary tasks: zero-shot activity recognition across 14 unseen downstream datasets, cross-modal retrieval, and wearable IMU motion captioning, where it improves average Accuracy/F1/R@2 by 11.7\%/11.6\%/22.6\% on HAR, increases zero-shot IMU-to-text and text-to-IMU retrieval MRR by 15.9\% and 28.6\%, respectively, and improves zero-shot captioning BERT-F1 by 18.8\%. These results support AnyMo as a generalist model for wearable motion understanding in the wild. Project page: https://baiyuchen.com/project/AnyMo.