84 lines
7.6 KiB
Plaintext
84 lines
7.6 KiB
Plaintext
<html><head>
|
|
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
|
|
</head><body>
|
|
<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:595px; height:842px;"></span>
|
|
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
|
|
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:79px; top:125px; width:452px; height:18px;"><span style="font-family: MZSZGI+NimbusRomNo9L-Medi; font-size:18px">Preemptive Information Extraction using Unrestricted Relation Discovery
|
|
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:148px; top:164px; width:91px; height:15px;"><span style="font-family: MZSZGI+NimbusRomNo9L-Medi; font-size:15px">Yusuke Shinyama
|
|
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:389px; top:164px; width:74px; height:15px;"><span style="font-family: MZSZGI+NimbusRomNo9L-Medi; font-size:15px">Satoshi Sekine
|
|
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:255px; top:187px; width:101px; height:14px;"><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:14px">New York University
|
|
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:244px; top:201px; width:122px; height:14px;"><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:14px">715, Broadway, 7th Floor
|
|
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:252px; top:215px; width:106px; height:14px;"><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:14px">New York, NY, 10003
|
|
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:222px; top:224px; width:168px; height:18px;"><span style="font-family: ZNQAHA+CMSY10; font-size:18px">{</span><span style="font-family: CXOZYQ+NimbusMonL-Regu; font-size:11px">yusuke,sekine</span><span style="font-family: ZNQAHA+CMSY10; font-size:18px">}</span><span style="font-family: CXOZYQ+NimbusMonL-Regu; font-size:11px">@cs.nyu.edu
|
|
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:163px; top:281px; width:44px; height:15px;"><span style="font-family: MZSZGI+NimbusRomNo9L-Medi; font-size:15px">Abstract
|
|
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:93px; top:310px; width:183px; height:175px;"><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:13px">We are trying to extend the boundary of
|
|
<br>Information Extraction (IE) systems. Ex-
|
|
<br>isting IE systems require a lot of time and
|
|
<br>human effort to tune for a new scenario.
|
|
<br>Preemptive Information Extraction is an
|
|
<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:13px">attempt to automatically create all feasible
|
|
<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:13px">IE systems in advance without human in-
|
|
<br>tervention. We propose a technique called
|
|
<br>Unrestricted Relation Discovery that dis-
|
|
<br>covers all possible relations from texts and
|
|
<br>presents them as tables. We present a pre-
|
|
<br>liminary system that obtains reasonably
|
|
<br>good results.
|
|
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:505px; width:80px; height:15px;"><span style="font-family: MZSZGI+NimbusRomNo9L-Medi; font-size:15px">1 Background
|
|
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:528px; width:226px; height:243px;"><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:13px">Every day, a large number of news articles are cre-
|
|
<br>ated and reported, many of which are unique. But
|
|
<br>certain types of events, such as hurricanes or mur-
|
|
<br>ders, are reported again and again throughout a year.
|
|
<br>The goal of Information Extraction, or IE, is to re-
|
|
<br>trieve a certain type of news event from past articles
|
|
<br>and present the events as a table whose columns are
|
|
<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:13px">filled with a name of a person or company, accord-
|
|
<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:13px">ing to its role in the event. However, existing IE
|
|
<br>techniques require a lot of human labor. First, you
|
|
<br>have to specify the type of information you want and
|
|
<br>collect articles that include this information. Then,
|
|
<br>you have to analyze the articles and manually craft
|
|
<br>a set of patterns to capture these events. Most exist-
|
|
<br>ing IE research focuses on reducing this burden by
|
|
<br>helping people create such patterns. But each time
|
|
<br>you want to extract a different kind of information,
|
|
<br>you need to repeat the whole process: specify arti-
|
|
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:313px; top:284px; width:226px; height:94px;"><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:13px">cles and adjust its patterns, either manually or semi-
|
|
<br>automatically. There is a bit of a dangerous pitfall
|
|
<br>here. First, it is hard to estimate how good the sys-
|
|
<br>tem can be after months of work. Furthermore, you
|
|
<br>might not know if the task is even doable in the first
|
|
<br>place. Knowing what kind of information is easily
|
|
<br>obtained in advance would help reduce this risk.
|
|
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:313px; top:379px; width:226px; height:175px;"><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:13px">An IE task can be defined as finding a relation
|
|
<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:13px">among several entities involved in a certain type of
|
|
<br>event. For example, in the MUC-6 management
|
|
<br>succession scenario, one seeks a relation between
|
|
<br>COMPANY, PERSON and POST involved with hir-
|
|
<br>ing/firing events. For each row of an extracted ta-
|
|
<br>ble, you can always read it as “COMPANY hired
|
|
<br>(or fired) PERSON for POST.” The relation between
|
|
<br>these entities is retained throughout the table. There
|
|
<br>are many existing works on obtaining extraction pat-
|
|
<br>terns for pre-defined relations (Riloff, 1996; Yangar-
|
|
<br>ber et al., 2000; Agichtein and Gravano, 2000; Sudo
|
|
<br>et al., 2003).
|
|
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:313px; top:555px; width:226px; height:216px;"><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:13px">Unrestricted Relation Discovery is a technique to
|
|
<br>automatically discover such relations that repeatedly
|
|
<br>appear in a corpus and present them as a table, with
|
|
<br>absolutely no human intervention. Unlike most ex-
|
|
<br>isting IE research, a user does not specify the type
|
|
<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:13px">of articles or information wanted. Instead, a system
|
|
<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:13px">tries to find all the kinds of relations that are reported
|
|
<br>multiple times and can be reported in tabular form.
|
|
<br>This technique will open up the possibility of try-
|
|
<br>ing new IE scenarios. Furthermore, the system itself
|
|
<br>can be used as an IE system, since an obtained re-
|
|
<br>lation is already presented as a table. If this system
|
|
<br>works to a certain extent, tuning an IE system be-
|
|
<br>comes a search problem: all the tables are already
|
|
<br>built “preemptively.” A user only needs to search
|
|
<br>for a relevant table.
|
|
<br></span></div><div style="position:absolute; top:0px;">Page: <a href="#1">1</a></div>
|
|
</body></html>
|