pdfminer.six/samples/naacl06-shinyama.txt.ref

Preemptive Information Extraction using Unrestricted Relation Discovery

Yusuke Shinyama

Satoshi Sekine

New York University
715, Broadway, 7th Floor
New York, NY, 10003
{yusuke,sekine}@cs.nyu.edu

We are trying to extend the boundary of
Information Extraction (IE) systems. Ex-
isting IE systems require a lot of time and
human effort to tune for a new scenario.
Preemptive Information Extraction is an
attempt to automatically create all feasible
IE systems in advance without human in-
tervention. We propose a technique called
Unrestricted Relation Discovery that dis-
covers all possible relations from texts and
presents them as tables. We present a pre-
liminary system that obtains reasonably
good results.

Abstract

1 Background

Every day, a large number of news articles are cre-
ated and reported, many of which are unique. But
certain types of events, such as hurricanes or mur-
ders, are reported again and again throughout a year.
The goal of Information Extraction, or IE, is to re-
trieve a certain type of news event from past articles
and present the events as a table whose columns are
ﬁlled with a name of a person or company, accord-
ing to its role in the event. However, existing IE
techniques require a lot of human labor. First, you
have to specify the type of information you want and
collect articles that include this information. Then,
you have to analyze the articles and manually craft
a set of patterns to capture these events. Most exist-
ing IE research focuses on reducing this burden by
helping people create such patterns. But each time
you want to extract a different kind of information,
you need to repeat the whole process: specify arti-

cles and adjust its patterns, either manually or semi-
automatically. There is a bit of a dangerous pitfall
here. First, it is hard to estimate how good the sys-
tem can be after months of work. Furthermore, you
might not know if the task is even doable in the ﬁrst
place. Knowing what kind of information is easily
obtained in advance would help reduce this risk.
An IE task can be deﬁned as ﬁnding a relation
among several entities involved in a certain type of
For example, in the MUC-6 management
event.
succession scenario, one seeks a relation between
COMPANY, PERSON and POST involved with hir-
ing/ﬁring events. For each row of an extracted ta-
ble, you can always read it as “COMPANY hired
(or ﬁred) PERSON for POST.” The relation between
these entities is retained throughout the table. There
are many existing works on obtaining extraction pat-
terns for pre-deﬁned relations (Riloff, 1996; Yangar-
ber et al., 2000; Agichtein and Gravano, 2000; Sudo
et al., 2003).
Unrestricted Relation Discovery is a technique to
automatically discover such relations that repeatedly
appear in a corpus and present them as a table, with
absolutely no human intervention. Unlike most ex-
isting IE research, a user does not specify the type
of articles or information wanted. Instead, a system
tries to ﬁnd all the kinds of relations that are reported
multiple times and can be reported in tabular form.
This technique will open up the possibility of try-
ing new IE scenarios. Furthermore, the system itself
can be used as an IE system, since an obtained re-
lation is already presented as a table. If this system
works to a certain extent, tuning an IE system be-
comes a search problem: all the tables are already
built “preemptively.” A user only needs to search
for a relevant table.