テキストデータ統合

要旨

データには多様な形態が存在する。表面的な観点からは、構造化データ（リレーションやキー・バリューペアなど）と非構造化データ（テキスト、画像など）に分類できる。これまで機械は、厳密なスキーマに従う構造化データの処理と推論において比較的高い性能を発揮してきた。しかし、データの異種性は、多様なカテゴリのデータを意味のある形で保存・処理する際に重大な課題をもたらす。データエンジニアリングパイプラインの重要な要素であるデータ統合は、この課題に対処し、分散したデータソースを統合してエンドユーザーに統一的なデータアクセスを提供する。従来、データ統合システムの大半は構造化データソースの統合に偏ってきた。しかしながら、非構造化データ（すなわち自由記述テキスト）にも、活用待ちの豊富な知識が含まれている。したがって本章ではまず、テキストデータ統合の必要性を論じた後、その課題、最新動向、未解決問題について述べる。

English

Data comes in many forms. From a shallow perspective, they can be viewed as being either in structured (e.g., as a relation, as key-value pairs) or unstructured (e.g., text, image) formats. So far, machines have been fairly good at processing and reasoning over structured data that follows a precise schema. However, the heterogeneity of data poses a significant challenge on how well diverse categories of data can be meaningfully stored and processed. Data Integration, a crucial part of the data engineering pipeline, addresses this by combining disparate data sources and providing unified data access to end-users. Until now, most data integration systems have leaned on only combining structured data sources. Nevertheless, unstructured data (a.k.a. free text) also contains a plethora of knowledge waiting to be utilized. Thus, in this chapter, we firstly make the case for the integration of textual data, to later present its challenges, state of the art and open problems.

テキストデータ統合

Text Data Integration

要旨

Support