87 lines
3.4 KiB
Plaintext
87 lines
3.4 KiB
Plaintext
|
Preemptive Information Extraction using Unrestricted Relation Discovery
|
|||
|
|
|||
|
Yusuke Shinyama
|
|||
|
|
|||
|
Satoshi Sekine
|
|||
|
|
|||
|
New York University
|
|||
|
715, Broadway, 7th Floor
|
|||
|
New York, NY, 10003
|
|||
|
{yusuke,sekine}@cs.nyu.edu
|
|||
|
|
|||
|
We are trying to extend the boundary of
|
|||
|
Information Extraction (IE) systems. Ex-
|
|||
|
isting IE systems require a lot of time and
|
|||
|
human effort to tune for a new scenario.
|
|||
|
Preemptive Information Extraction is an
|
|||
|
attempt to automatically create all feasible
|
|||
|
IE systems in advance without human in-
|
|||
|
tervention. We propose a technique called
|
|||
|
Unrestricted Relation Discovery that dis-
|
|||
|
covers all possible relations from texts and
|
|||
|
presents them as tables. We present a pre-
|
|||
|
liminary system that obtains reasonably
|
|||
|
good results.
|
|||
|
|
|||
|
Abstract
|
|||
|
|
|||
|
1 Background
|
|||
|
|
|||
|
Every day, a large number of news articles are cre-
|
|||
|
ated and reported, many of which are unique. But
|
|||
|
certain types of events, such as hurricanes or mur-
|
|||
|
ders, are reported again and again throughout a year.
|
|||
|
The goal of Information Extraction, or IE, is to re-
|
|||
|
trieve a certain type of news event from past articles
|
|||
|
and present the events as a table whose columns are
|
|||
|
filled with a name of a person or company, accord-
|
|||
|
ing to its role in the event. However, existing IE
|
|||
|
techniques require a lot of human labor. First, you
|
|||
|
have to specify the type of information you want and
|
|||
|
collect articles that include this information. Then,
|
|||
|
you have to analyze the articles and manually craft
|
|||
|
a set of patterns to capture these events. Most exist-
|
|||
|
ing IE research focuses on reducing this burden by
|
|||
|
helping people create such patterns. But each time
|
|||
|
you want to extract a different kind of information,
|
|||
|
you need to repeat the whole process: specify arti-
|
|||
|
|
|||
|
cles and adjust its patterns, either manually or semi-
|
|||
|
automatically. There is a bit of a dangerous pitfall
|
|||
|
here. First, it is hard to estimate how good the sys-
|
|||
|
tem can be after months of work. Furthermore, you
|
|||
|
might not know if the task is even doable in the first
|
|||
|
place. Knowing what kind of information is easily
|
|||
|
obtained in advance would help reduce this risk.
|
|||
|
An IE task can be defined as finding a relation
|
|||
|
among several entities involved in a certain type of
|
|||
|
For example, in the MUC-6 management
|
|||
|
event.
|
|||
|
succession scenario, one seeks a relation between
|
|||
|
COMPANY, PERSON and POST involved with hir-
|
|||
|
ing/firing events. For each row of an extracted ta-
|
|||
|
ble, you can always read it as “COMPANY hired
|
|||
|
(or fired) PERSON for POST.” The relation between
|
|||
|
these entities is retained throughout the table. There
|
|||
|
are many existing works on obtaining extraction pat-
|
|||
|
terns for pre-defined relations (Riloff, 1996; Yangar-
|
|||
|
ber et al., 2000; Agichtein and Gravano, 2000; Sudo
|
|||
|
et al., 2003).
|
|||
|
Unrestricted Relation Discovery is a technique to
|
|||
|
automatically discover such relations that repeatedly
|
|||
|
appear in a corpus and present them as a table, with
|
|||
|
absolutely no human intervention. Unlike most ex-
|
|||
|
isting IE research, a user does not specify the type
|
|||
|
of articles or information wanted. Instead, a system
|
|||
|
tries to find all the kinds of relations that are reported
|
|||
|
multiple times and can be reported in tabular form.
|
|||
|
This technique will open up the possibility of try-
|
|||
|
ing new IE scenarios. Furthermore, the system itself
|
|||
|
can be used as an IE system, since an obtained re-
|
|||
|
lation is already presented as a table. If this system
|
|||
|
works to a certain extent, tuning an IE system be-
|
|||
|
comes a search problem: all the tables are already
|
|||
|
built “preemptively.” A user only needs to search
|
|||
|
for a relevant table.
|
|||
|
|
|||
|
|