pdfminer.six/samples/naacl06-shinyama.txt.ref

87 lines
3.4 KiB
Plaintext
Raw Normal View History

Preemptive Information Extraction using Unrestricted Relation Discovery
Yusuke Shinyama
Satoshi Sekine
New York University
715, Broadway, 7th Floor
New York, NY, 10003
{yusuke,sekine}@cs.nyu.edu
We are trying to extend the boundary of
Information Extraction (IE) systems. Ex-
isting IE systems require a lot of time and
human effort to tune for a new scenario.
Preemptive Information Extraction is an
attempt to automatically create all feasible
IE systems in advance without human in-
tervention. We propose a technique called
Unrestricted Relation Discovery that dis-
covers all possible relations from texts and
presents them as tables. We present a pre-
liminary system that obtains reasonably
good results.
Abstract
1 Background
Every day, a large number of news articles are cre-
ated and reported, many of which are unique. But
certain types of events, such as hurricanes or mur-
ders, are reported again and again throughout a year.
The goal of Information Extraction, or IE, is to re-
trieve a certain type of news event from past articles
and present the events as a table whose columns are
filled with a name of a person or company, accord-
ing to its role in the event. However, existing IE
techniques require a lot of human labor. First, you
have to specify the type of information you want and
collect articles that include this information. Then,
you have to analyze the articles and manually craft
a set of patterns to capture these events. Most exist-
ing IE research focuses on reducing this burden by
helping people create such patterns. But each time
you want to extract a different kind of information,
you need to repeat the whole process: specify arti-
cles and adjust its patterns, either manually or semi-
automatically. There is a bit of a dangerous pitfall
here. First, it is hard to estimate how good the sys-
tem can be after months of work. Furthermore, you
might not know if the task is even doable in the first
place. Knowing what kind of information is easily
obtained in advance would help reduce this risk.
An IE task can be defined as finding a relation
among several entities involved in a certain type of
event.
For example, in the MUC-6 management
succession scenario, one seeks a relation between
COMPANY, PERSON and POST involved with hir-
ing/firing events. For each row of an extracted ta-
ble, you can always read it as “COMPANY hired
(or fired) PERSON for POST.” The relation between
these entities is retained throughout the table. There
are many existing works on obtaining extraction pat-
terns for pre-defined relations (Riloff, 1996; Yangar-
ber et al., 2000; Agichtein and Gravano, 2000; Sudo
et al., 2003).
Unrestricted Relation Discovery is a technique to
automatically discover such relations that repeatedly
appear in a corpus and present them as a table, with
absolutely no human intervention. Unlike most ex-
isting IE research, a user does not specify the type
of articles or information wanted. Instead, a system
tries to find all the kinds of relations that are reported
multiple times and can be reported in tabular form.
This technique will open up the possibility of try-
ing new IE scenarios. Furthermore, the system itself
can be used as an IE system, since an obtained re-
lation is already presented as a table. If this system
works to a certain extent, tuning an IE system be-
comes a search problem: all the tables are already
built “preemptively.” A user only needs to search
for a relevant table.