pdfminer.six/samples/nonfree/naacl06-shinyama.html.ref

<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head><body>
<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:595px; height:842px;"></span>
<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:79px; top:125px; width:452px; height:18px;"><span style="font-family: MZSZGI+NimbusRomNo9L-Medi; font-size:13px">Preemptive Information Extraction using Unrestricted Relation Discovery
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:148px; top:164px; width:91px; height:15px;"><span style="font-family: MZSZGI+NimbusRomNo9L-Medi; font-size:10px">Yusuke Shinyama
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:389px; top:164px; width:74px; height:15px;"><span style="font-family: MZSZGI+NimbusRomNo9L-Medi; font-size:10px">Satoshi Sekine
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:222px; top:187px; width:168px; height:56px;"><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:10px">New York University
<br>715, Broadway, 7th Floor
<br>New York, NY, 10003
<br></span><span style="font-family: ZNQAHA+CMSY10; font-size:13px">{</span><span style="font-family: CXOZYQ+NimbusMonL-Regu; font-size:8px">yusuke,sekine</span><span style="font-family: ZNQAHA+CMSY10; font-size:13px">}</span><span style="font-family: CXOZYQ+NimbusMonL-Regu; font-size:8px">@cs.nyu.edu
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:163px; top:281px; width:44px; height:15px;"><span style="font-family: MZSZGI+NimbusRomNo9L-Medi; font-size:10px">Abstract
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:93px; top:310px; width:183px; height:175px;"><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:9px">We are trying to extend the boundary of
<br>Information Extraction (IE) systems. Ex-
<br>isting IE systems require a lot of time and
<br>human effort to tune for a new scenario.
<br>Preemptive Information Extraction is an
<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:9px">attempt to automatically create all feasible
<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:9px">IE systems in advance without human in-
<br>tervention. We propose a technique called
<br>Unrestricted Relation Discovery that dis-
<br>covers all possible relations from texts and
<br>presents them as tables. We present a pre-
<br>liminary system that obtains reasonably
<br>good results.
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:505px; width:80px; height:15px;"><span style="font-family: MZSZGI+NimbusRomNo9L-Medi; font-size:10px">1 Background
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:528px; width:226px; height:243px;"><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:9px">Every day, a large number of news articles are cre-
<br>ated and reported, many of which are unique. But
<br>certain types of events, such as hurricanes or mur-
<br>ders, are reported again and again throughout a year.
<br>The goal of Information Extraction, or IE, is to re-
<br>trieve a certain type of news event from past articles
<br>and present the events as a table whose columns are
<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:9px">ﬁlled with a name of a person or company, accord-
<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:9px">ing to its role in the event. However, existing IE
<br>techniques require a lot of human labor. First, you
<br>have to specify the type of information you want and
<br>collect articles that include this information. Then,
<br>you have to analyze the articles and manually craft
<br>a set of patterns to capture these events. Most exist-
<br>ing IE research focuses on reducing this burden by
<br>helping people create such patterns. But each time
<br>you want to extract a different kind of information,
<br>you need to repeat the whole process: specify arti-
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:313px; top:284px; width:226px; height:488px;"><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:9px">cles and adjust its patterns, either manually or semi-
<br>automatically. There is a bit of a dangerous pitfall
<br>here. First, it is hard to estimate how good the sys-
<br>tem can be after months of work. Furthermore, you
<br>might not know if the task is even doable in the ﬁrst
<br>place. Knowing what kind of information is easily
<br>obtained in advance would help reduce this risk.
<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:9px">An IE task can be deﬁned as ﬁnding a relation
<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:9px">among several entities involved in a certain type of
<br>event. For example, in the MUC-6 management
<br>succession scenario, one seeks a relation between
<br>COMPANY, PERSON and POST involved with hir-
<br>ing/ﬁring events. For each row of an extracted ta-
<br>ble, you can always read it as “COMPANY hired
<br>(or ﬁred) PERSON for POST.” The relation between
<br>these entities is retained throughout the table. There
<br>are many existing works on obtaining extraction pat-
<br>terns for pre-deﬁned relations (Riloff, 1996; Yangar-
<br>ber et al., 2000; Agichtein and Gravano, 2000; Sudo
<br>et al., 2003).
<br>Unrestricted Relation Discovery is a technique to
<br>automatically discover such relations that repeatedly
<br>appear in a corpus and present them as a table, with
<br>absolutely no human intervention. Unlike most ex-
<br>isting IE research, a user does not specify the type
<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:9px">of articles or information wanted. Instead, a system
<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:9px">tries to ﬁnd all the kinds of relations that are reported
<br>multiple times and can be reported in tabular form.
<br>This technique will open up the possibility of try-
<br>ing new IE scenarios. Furthermore, the system itself
<br>can be used as an IE system, since an obtained re-
<br>lation is already presented as a table. If this system
<br>works to a certain extent, tuning an IE system be-
<br>comes a search problem: all the tables are already
<br>built “preemptively.” A user only needs to search
<br>for a relevant table.
<br></span></div><div style="position:absolute; top:0px;">Page: <a href="#1">1</a></div>
</body></html>
add regression tests. git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@189 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-03-22 04:34:52 +00:00			`<html><head>`
			`<meta http-equiv="Content-Type" content="text/html; charset=utf-8">`
			`</head><body>`
			`<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:595px; height:842px;"></span>`
			`<div style="position:absolute; top:50px;"><a name="1">Page 1</a></div>`
test cases updated git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@282 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-12-25 08:41:11 +00:00			`<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:79px; top:125px; width:452px; height:18px;"><span style="font-family: MZSZGI+NimbusRomNo9L-Medi; font-size:13px">Preemptive Information Extraction using Unrestricted Relation Discovery`
			`<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:148px; top:164px; width:91px; height:15px;"><span style="font-family: MZSZGI+NimbusRomNo9L-Medi; font-size:10px">Yusuke Shinyama`
			`<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:389px; top:164px; width:74px; height:15px;"><span style="font-family: MZSZGI+NimbusRomNo9L-Medi; font-size:10px">Satoshi Sekine`
			`<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:222px; top:187px; width:168px; height:56px;"><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:10px">New York University`
			`<br>715, Broadway, 7th Floor`
			`<br>New York, NY, 10003`
			`<br></span><span style="font-family: ZNQAHA+CMSY10; font-size:13px">{</span><span style="font-family: CXOZYQ+NimbusMonL-Regu; font-size:8px">yusuke,sekine</span><span style="font-family: ZNQAHA+CMSY10; font-size:13px">}</span><span style="font-family: CXOZYQ+NimbusMonL-Regu; font-size:8px">@cs.nyu.edu`
			`<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:163px; top:281px; width:44px; height:15px;"><span style="font-family: MZSZGI+NimbusRomNo9L-Medi; font-size:10px">Abstract`
			`<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:93px; top:310px; width:183px; height:175px;"><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:9px">We are trying to extend the boundary of`
			`<br>Information Extraction (IE) systems. Ex-`
			`<br>isting IE systems require a lot of time and`
			`<br>human effort to tune for a new scenario.`
			`<br>Preemptive Information Extraction is an`
			`<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:9px">attempt to automatically create all feasible`
			`<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:9px">IE systems in advance without human in-`
			`<br>tervention. We propose a technique called`
			`<br>Unrestricted Relation Discovery that dis-`
			`<br>covers all possible relations from texts and`
			`<br>presents them as tables. We present a pre-`
			`<br>liminary system that obtains reasonably`
			`<br>good results.`
			`<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:505px; width:80px; height:15px;"><span style="font-family: MZSZGI+NimbusRomNo9L-Medi; font-size:10px">1 Background`
			`<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:71px; top:528px; width:226px; height:243px;"><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:9px">Every day, a large number of news articles are cre-`
			`<br>ated and reported, many of which are unique. But`
			`<br>certain types of events, such as hurricanes or mur-`
			`<br>ders, are reported again and again throughout a year.`
			`<br>The goal of Information Extraction, or IE, is to re-`
			`<br>trieve a certain type of news event from past articles`
			`<br>and present the events as a table whose columns are`
			`<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:9px">ﬁlled with a name of a person or company, accord-`
			`<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:9px">ing to its role in the event. However, existing IE`
			`<br>techniques require a lot of human labor. First, you`
			`<br>have to specify the type of information you want and`
			`<br>collect articles that include this information. Then,`
			`<br>you have to analyze the articles and manually craft`
			`<br>a set of patterns to capture these events. Most exist-`
			`<br>ing IE research focuses on reducing this burden by`
			`<br>helping people create such patterns. But each time`
			`<br>you want to extract a different kind of information,`
			`<br>you need to repeat the whole process: specify arti-`
			`<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:313px; top:284px; width:226px; height:488px;"><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:9px">cles and adjust its patterns, either manually or semi-`
			`<br>automatically. There is a bit of a dangerous pitfall`
			`<br>here. First, it is hard to estimate how good the sys-`
			`<br>tem can be after months of work. Furthermore, you`
			`<br>might not know if the task is even doable in the ﬁrst`
			`<br>place. Knowing what kind of information is easily`
			`<br>obtained in advance would help reduce this risk.`
			`<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:9px">An IE task can be deﬁned as ﬁnding a relation`
			`<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:9px">among several entities involved in a certain type of`
			`<br>event. For example, in the MUC-6 management`
			`<br>succession scenario, one seeks a relation between`
			`<br>COMPANY, PERSON and POST involved with hir-`
			`<br>ing/ﬁring events. For each row of an extracted ta-`
			`<br>ble, you can always read it as “COMPANY hired`
			`<br>(or ﬁred) PERSON for POST.” The relation between`
			`<br>these entities is retained throughout the table. There`
			`<br>are many existing works on obtaining extraction pat-`
			`<br>terns for pre-deﬁned relations (Riloff, 1996; Yangar-`
			`<br>ber et al., 2000; Agichtein and Gravano, 2000; Sudo`
			`<br>et al., 2003).`
			`<br>Unrestricted Relation Discovery is a technique to`
			`<br>automatically discover such relations that repeatedly`
			`<br>appear in a corpus and present them as a table, with`
			`<br>absolutely no human intervention. Unlike most ex-`
			`<br>isting IE research, a user does not specify the type`
			`<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:9px">of articles or information wanted. Instead, a system`
			`<br></span><span style="font-family: QTLIUY+NimbusRomNo9L-Regu; font-size:9px">tries to ﬁnd all the kinds of relations that are reported`
			`<br>multiple times and can be reported in tabular form.`
			`<br>This technique will open up the possibility of try-`
			`<br>ing new IE scenarios. Furthermore, the system itself`
			`<br>can be used as an IE system, since an obtained re-`
			`<br>lation is already presented as a table. If this system`
			`<br>works to a certain extent, tuning an IE system be-`
			`<br>comes a search problem: all the tables are already`
			`<br>built “preemptively.” A user only needs to search`
			`<br>for a relevant table.`
			`<br></span></div><div style="position:absolute; top:0px;">Page: <a href="#1">1</a></div>`
add regression tests. git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@189 1aa58f4a-7d42-0410-adbc-911cccaed67c 2010-03-22 04:34:52 +00:00			`</body></html>`