Frankentext：將隨機文本片段縫合成長篇敘事

摘要

我們引入了Frankentexts，這是一種由大型語言模型（LLMs）在極端約束下生成的長篇敘事，要求大部分詞彙（例如90%）必須逐字複製自人類寫作。這項任務對可控生成提出了嚴峻挑戰，要求模型在滿足寫作提示的同時，整合不同的文本片段，並仍能產出連貫的敘事。為生成Frankentexts，我們指導模型通過選擇和組合人類撰寫的段落來起草，然後在保持用戶指定的複製比例下，迭代修訂草稿。我們從三個維度評估生成的Frankentexts：寫作質量、指令遵循度及可檢測性。Gemini-2.5-Pro在此任務中表現出人意料地好：81%的Frankentexts連貫且100%符合提示要求。值得注意的是，高達59%的此類輸出被如Pangram等檢測器誤判為人類撰寫，揭示了AI文本檢測器的局限性。人類評審者有時能通過Frankentexts中突兀的語氣轉變和段落間不一致的語法識別它們，尤其是在較長的生成文本中。除了作為一項具有挑戰性的生成任務，Frankentexts還促進了對構建有效檢測器以應對這一新的作者身份灰色地帶的討論，為混合作者身份檢測提供了訓練數據，並作為研究人機協作寫作過程的實驗平台。

English

We introduce Frankentexts, a new type of long-form narratives produced by LLMs under the extreme constraint that most tokens (e.g., 90%) must be copied verbatim from human writings. This task presents a challenging test of controllable generation, requiring models to satisfy a writing prompt, integrate disparate text fragments, and still produce a coherent narrative. To generate Frankentexts, we instruct the model to produce a draft by selecting and combining human-written passages, then iteratively revise the draft while maintaining a user-specified copy ratio. We evaluate the resulting Frankentexts along three axes: writing quality, instruction adherence, and detectability. Gemini-2.5-Pro performs surprisingly well on this task: 81% of its Frankentexts are coherent and 100% relevant to the prompt. Notably, up to 59% of these outputs are misclassified as human-written by detectors like Pangram, revealing limitations in AI text detectors. Human annotators can sometimes identify Frankentexts through their abrupt tone shifts and inconsistent grammar between segments, especially in longer generations. Beyond presenting a challenging generation task, Frankentexts invite discussion on building effective detectors for this new grey zone of authorship, provide training data for mixed authorship detection, and serve as a sandbox for studying human-AI co-writing processes.