AnyMo: 幾何学的認識に基づくセットアップ非依存の実環境人間動作モデリング

要旨

ウェアラブルデバイスおよびモバイルデバイスが日常生活にますます組み込まれるにつれ、これらは実環境下で人間の動作を継続的にセンシングする実用的な手段を提供します。しかし、慣性信号は身体位置、装着位置、センサーの向き、デバイスハードウェア、サンプリングプロトコルなどのセンシング設定に大きく依存します。この設定依存性により、デバイスやデータセット間で転送可能な動作表現を学習することが困難になり、クローズドセット認識を超えたウェアラブルIMUの幅広い利用が制限されます。本論文では、設定非依存の人間動作モデリングのための幾何学を考慮したフレームワークAnyMoを紹介します。AnyMoは、密な体表面配置に対する物理学に基づくIMUシミュレーションを用いて多様で現実的な合成信号を生成し、ペア化された合成配置ビューとマスクされた部分観測からグラフエンコーダを事前学習し、複数位置のIMUを全身動作トークンにトークン化し、これらのトークンを大規模言語モデル（LLM）と整列させて動作言語理解を実現します。AnyMoを3つの補完的なタスクで評価します：14の未見の下流データセットにわたるゼロショット行動認識、クロスモーダル検索、およびウェアラブルIMU動作キャプショニングです。HARにおいて平均Accuracy/F1/R@2を11.7\%/11.6\%/22.6\%向上させ、ゼロショットのIMUからテキストおよびテキストからIMUへの検索MRRをそれぞれ15.9\%と28.6\%向上させ、ゼロショットキャプショニングのBERT-F1を18.8\%向上させました。これらの結果は、実環境下でのウェアラブル動作理解のための汎用モデルとしてAnyMoを支持するものです。プロジェクトページ：https://baiyuchen.com/project/AnyMo。

English

As wearable and mobile devices become increasingly embedded in daily life, they offer a practical way to continuously sense human motion in the wild. But inertial signals are highly dependent on the sensing setup, including body location, mounting position, sensor orientation, device hardware, and sampling protocol. This setup dependence makes it difficult to learn motion representations that transfer across devices and datasets, and limits the broader use of wearable IMUs beyond closed-set recognition. We introduce AnyMo, a geometry-aware framework for setup-agnostic human motion modeling. AnyMo uses physics-grounded IMU simulation over dense body-surface placements to generate diverse and plausible synthetic signals, pre-trains a graph encoder from paired synthetic placement views and masked partial observations, tokenizes multi-position IMU into full-body motion tokens, and aligns these tokens with an LLM for motion-language understanding. We evaluate AnyMo on three complementary tasks: zero-shot activity recognition across 14 unseen downstream datasets, cross-modal retrieval, and wearable IMU motion captioning, where it improves average Accuracy/F1/R@2 by 11.7\%/11.6\%/22.6\% on HAR, increases zero-shot IMU-to-text and text-to-IMU retrieval MRR by 15.9\% and 28.6\%, respectively, and improves zero-shot captioning BERT-F1 by 18.8\%. These results support AnyMo as a generalist model for wearable motion understanding in the wild. Project page: https://baiyuchen.com/project/AnyMo.