87 lines
3.4 KiB
Plaintext
87 lines
3.4 KiB
Plaintext
Preemptive Information Extraction using Unrestricted Relation Discovery
|
||
|
||
Yusuke Shinyama
|
||
|
||
Satoshi Sekine
|
||
|
||
New York University
|
||
715, Broadway, 7th Floor
|
||
New York, NY, 10003
|
||
{yusuke,sekine}@cs.nyu.edu
|
||
|
||
We are trying to extend the boundary of
|
||
Information Extraction (IE) systems. Ex-
|
||
isting IE systems require a lot of time and
|
||
human effort to tune for a new scenario.
|
||
Preemptive Information Extraction is an
|
||
attempt to automatically create all feasible
|
||
IE systems in advance without human in-
|
||
tervention. We propose a technique called
|
||
Unrestricted Relation Discovery that dis-
|
||
covers all possible relations from texts and
|
||
presents them as tables. We present a pre-
|
||
liminary system that obtains reasonably
|
||
good results.
|
||
|
||
Abstract
|
||
|
||
1 Background
|
||
|
||
Every day, a large number of news articles are cre-
|
||
ated and reported, many of which are unique. But
|
||
certain types of events, such as hurricanes or mur-
|
||
ders, are reported again and again throughout a year.
|
||
The goal of Information Extraction, or IE, is to re-
|
||
trieve a certain type of news event from past articles
|
||
and present the events as a table whose columns are
|
||
filled with a name of a person or company, accord-
|
||
ing to its role in the event. However, existing IE
|
||
techniques require a lot of human labor. First, you
|
||
have to specify the type of information you want and
|
||
collect articles that include this information. Then,
|
||
you have to analyze the articles and manually craft
|
||
a set of patterns to capture these events. Most exist-
|
||
ing IE research focuses on reducing this burden by
|
||
helping people create such patterns. But each time
|
||
you want to extract a different kind of information,
|
||
you need to repeat the whole process: specify arti-
|
||
|
||
cles and adjust its patterns, either manually or semi-
|
||
automatically. There is a bit of a dangerous pitfall
|
||
here. First, it is hard to estimate how good the sys-
|
||
tem can be after months of work. Furthermore, you
|
||
might not know if the task is even doable in the first
|
||
place. Knowing what kind of information is easily
|
||
obtained in advance would help reduce this risk.
|
||
An IE task can be defined as finding a relation
|
||
among several entities involved in a certain type of
|
||
For example, in the MUC-6 management
|
||
event.
|
||
succession scenario, one seeks a relation between
|
||
COMPANY, PERSON and POST involved with hir-
|
||
ing/firing events. For each row of an extracted ta-
|
||
ble, you can always read it as “COMPANY hired
|
||
(or fired) PERSON for POST.” The relation between
|
||
these entities is retained throughout the table. There
|
||
are many existing works on obtaining extraction pat-
|
||
terns for pre-defined relations (Riloff, 1996; Yangar-
|
||
ber et al., 2000; Agichtein and Gravano, 2000; Sudo
|
||
et al., 2003).
|
||
Unrestricted Relation Discovery is a technique to
|
||
automatically discover such relations that repeatedly
|
||
appear in a corpus and present them as a table, with
|
||
absolutely no human intervention. Unlike most ex-
|
||
isting IE research, a user does not specify the type
|
||
of articles or information wanted. Instead, a system
|
||
tries to find all the kinds of relations that are reported
|
||
multiple times and can be reported in tabular form.
|
||
This technique will open up the possibility of try-
|
||
ing new IE scenarios. Furthermore, the system itself
|
||
can be used as an IE system, since an obtained re-
|
||
lation is already presented as a table. If this system
|
||
works to a certain extent, tuning an IE system be-
|
||
comes a search problem: all the tables are already
|
||
built “preemptively.” A user only needs to search
|
||
for a relevant table.
|
||
|
||
|