ChatPaper.aiChatPaper

跨任務類型、應用領域和推理類型評估開放式語言模型:深入的實驗分析

Evaluating Open Language Models Across Task Types, Application Domains, and Reasoning Types: An In-Depth Experimental Analysis

June 17, 2024
作者: Neelabh Sinha, Vinija Jain, Aman Chadha
cs.AI

摘要

語言模型(LMs)的迅速崛起擴展了它們在多個應用中的使用。然而,由於模型大小的限制、相關成本或專有限制,並非總是可行利用最先進的(SOTA)大型語言模型。隨著開放、較小的LMs出現,更多應用可以利用它們的能力,但選擇合適的LM可能具有挑戰性。本研究對10個較小、開放的LM進行了深入的實驗分析,涵蓋三個方面:任務類型、應用領域和推理類型,並使用多樣的提示風格來評估其輸出的語義正確性。我們展示了根據具體要求,最有效的模型和提示風格會有所不同。我們的分析提供了一個基於用例和其他限制的三層方面架構,用於對LMs和提示風格進行比較評估,並進行戰略性選擇。我們還表明,如果適當使用,這些LMs可以與DeepSeek-v2、GPT-3.5-Turbo和GPT-4o等SOTA LMs競爭,有時甚至表現更佳。
English
The rapid rise of Language Models (LMs) has expanded their use in several applications. Yet, due to constraints of model size, associated cost, or proprietary restrictions, utilizing state-of-the-art (SOTA) LLMs is not always feasible. With open, smaller LMs emerging, more applications can leverage their capabilities, but selecting the right LM can be challenging. This work conducts an in-depth experimental analysis of the semantic correctness of outputs of 10 smaller, open LMs across three aspects: task types, application domains and reasoning types, using diverse prompt styles. We demonstrate that most effective models and prompt styles vary depending on the specific requirements. Our analysis provides a comparative assessment of LMs and prompt styles using a proposed three-tier schema of aspects for their strategic selection based on use-case and other constraints. We also show that if utilized appropriately, these LMs can compete with, and sometimes outperform, SOTA LLMs like DeepSeek-v2, GPT-3.5-Turbo, and GPT-4o.

Summary

AI-Generated Summary

PDF61December 6, 2024