Preemptive Information Extraction using Unrestricted Relation Discovery Yusuke Shinyama Satoshi Sekine New York University 715, Broadway, 7th Floor New York, NY, 10003 {yusuke,sekine}@cs.nyu.edu We are trying to extend the boundary of Information Extraction (IE) systems. Ex- isting IE systems require a lot of time and human effort to tune for a new scenario. Preemptive Information Extraction is an attempt to automatically create all feasible IE systems in advance without human in- tervention. We propose a technique called Unrestricted Relation Discovery that dis- covers all possible relations from texts and presents them as tables. We present a pre- liminary system that obtains reasonably good results. Abstract 1 Background Every day, a large number of news articles are cre- ated and reported, many of which are unique. But certain types of events, such as hurricanes or mur- ders, are reported again and again throughout a year. The goal of Information Extraction, or IE, is to re- trieve a certain type of news event from past articles and present the events as a table whose columns are filled with a name of a person or company, accord- ing to its role in the event. However, existing IE techniques require a lot of human labor. First, you have to specify the type of information you want and collect articles that include this information. Then, you have to analyze the articles and manually craft a set of patterns to capture these events. Most exist- ing IE research focuses on reducing this burden by helping people create such patterns. But each time you want to extract a different kind of information, you need to repeat the whole process: specify arti- cles and adjust its patterns, either manually or semi- automatically. There is a bit of a dangerous pitfall here. First, it is hard to estimate how good the sys- tem can be after months of work. Furthermore, you might not know if the task is even doable in the first place. Knowing what kind of information is easily obtained in advance would help reduce this risk. An IE task can be defined as finding a relation among several entities involved in a certain type of For example, in the MUC-6 management event. succession scenario, one seeks a relation between COMPANY, PERSON and POST involved with hir- ing/firing events. For each row of an extracted ta- ble, you can always read it as “COMPANY hired (or fired) PERSON for POST.” The relation between these entities is retained throughout the table. There are many existing works on obtaining extraction pat- terns for pre-defined relations (Riloff, 1996; Yangar- ber et al., 2000; Agichtein and Gravano, 2000; Sudo et al., 2003). Unrestricted Relation Discovery is a technique to automatically discover such relations that repeatedly appear in a corpus and present them as a table, with absolutely no human intervention. Unlike most ex- isting IE research, a user does not specify the type of articles or information wanted. Instead, a system tries to find all the kinds of relations that are reported multiple times and can be reported in tabular form. This technique will open up the possibility of try- ing new IE scenarios. Furthermore, the system itself can be used as an IE system, since an obtained re- lation is already presented as a table. If this system works to a certain extent, tuning an IE system be- comes a search problem: all the tables are already built “preemptively.” A user only needs to search for a relevant table.