自然言語ベースのマインド社会におけるマインドストーム

要旨

ミンスキーの「心の社会」とシュミッドフーバーの「考えることを学ぶ」は、大規模なマルチモーダルニューラルネットワーク（NN）の多様な社会をインスパイアし、それらが「マインドストーム」の中で互いにインタビューすることで問題を解決する。最近のNNベースの心の社会の実装は、大規模言語モデル（LLM）や他のNNベースの専門家が自然言語インターフェースを通じてコミュニケーションを取ることで構成されている。これにより、単一のLLMの限界を克服し、マルチモーダルなゼロショット推論を改善している。これらの自然言語ベースの心の社会（NLSOM）では、新しいエージェントがモジュール方式で容易に追加され、すべてが同じ普遍的な記号言語を通じてコミュニケーションを取る。NLSOMの力を示すために、我々は最大129のメンバーからなるいくつかのNLSOMを構築し、それらを活用してマインドストームを用いていくつかの実用的なAIタスクを解決する実験を行った。具体的には、視覚的質問応答、画像キャプション生成、テキストから画像への合成、3D生成、エゴセントリック検索、エンボディドAI、および一般的な言語ベースのタスク解決である。我々はこれを、数十億のエージェント（その一部は人間かもしれない）からなるはるかに大規模なNLSOMへの出発点と見なしている。そして、この異種の心の大規模な社会の出現により、人工知能の未来にとって多くの新しい研究課題が突然重要となった。NLSOMの社会構造はどのようにあるべきか？君主制ではなく民主的な構造を持つことの（不）利点は何か？NN経済の原則をどのように活用して強化学習NLSOMの総報酬を最大化できるか？本論文では、これらの課題を特定し、議論し、いくつかの回答を試みる。

English

Both Minsky's "society of mind" and Schmidhuber's "learning to think" inspire diverse societies of large multimodal neural networks (NNs) that solve problems by interviewing each other in a "mindstorm." Recent implementations of NN-based societies of minds consist of large language models (LLMs) and other NN-based experts communicating through a natural language interface. In doing so, they overcome the limitations of single LLMs, improving multimodal zero-shot reasoning. In these natural language-based societies of mind (NLSOMs), new agents -- all communicating through the same universal symbolic language -- are easily added in a modular fashion. To demonstrate the power of NLSOMs, we assemble and experiment with several of them (having up to 129 members), leveraging mindstorms in them to solve some practical AI tasks: visual question answering, image captioning, text-to-image synthesis, 3D generation, egocentric retrieval, embodied AI, and general language-based task solving. We view this as a starting point towards much larger NLSOMs with billions of agents-some of which may be humans. And with this emergence of great societies of heterogeneous minds, many new research questions have suddenly become paramount to the future of artificial intelligence. What should be the social structure of an NLSOM? What would be the (dis)advantages of having a monarchical rather than a democratic structure? How can principles of NN economies be used to maximize the total reward of a reinforcement learning NLSOM? In this work, we identify, discuss, and try to answer some of these questions.

自然言語ベースのマインド社会におけるマインドストーム

Mindstorms in Natural Language-Based Societies of Mind

要旨

Support