2013-10-17 14:05:27 +00:00
#!/usr/bin/env python
2016-11-08 19:01:11 +00:00
Many changes to make pdf2txt.py work better in Py3, some in that script, others in module!
Sorry, changes should have been more atomic.
*In pdf2txt.py:*
* Re-wrote main function to use argparse instead of optparse.
* Manually tested in Py2/Py3 to get partial consistency.
* Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway.
* Py2 mode *probably* unchanged, cannot find any bugs yet...
* Kept old main function for posterity, for now.
*In utils:*
* Added a few compatibility functions (some string hax required chardet, new dependency):
- make_compat_bytes(in_str)-> (py3->bytes | py2->str)
- make_compat_str(in_str)-> (str)
- compatible_encode_method(bytesorstring, encoding, erraction)-> (str)
*In pdfdevice:*
* To handle different output filetypes in Py3, injected lots of calls to new utils methods,
as well as some six.PYX checks and logic. These changes are largely responsible for
enhanced Py2/Py3 consistency.
*In converter:*
* To handle output filetypes in Py2, injected a few checks and fixes particularly around the
py2 `str.encode` method and its *assumed* usual use-analogies in Py3.
2015-05-17 20:08:57 +00:00
"""
Converts PDF text content ( though not images containing text ) to plain text , html , xml or " tags " .
"""
2018-08-13 04:07:52 +00:00
import argparse
Many changes to make pdf2txt.py work better in Py3, some in that script, others in module!
Sorry, changes should have been more atomic.
*In pdf2txt.py:*
* Re-wrote main function to use argparse instead of optparse.
* Manually tested in Py2/Py3 to get partial consistency.
* Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway.
* Py2 mode *probably* unchanged, cannot find any bugs yet...
* Kept old main function for posterity, for now.
*In utils:*
* Added a few compatibility functions (some string hax required chardet, new dependency):
- make_compat_bytes(in_str)-> (py3->bytes | py2->str)
- make_compat_str(in_str)-> (str)
- compatible_encode_method(bytesorstring, encoding, erraction)-> (str)
*In pdfdevice:*
* To handle different output filetypes in Py3, injected lots of calls to new utils methods,
as well as some six.PYX checks and logic. These changes are largely responsible for
enhanced Py2/Py3 consistency.
*In converter:*
* To handle output filetypes in Py2, injected a few checks and fixes particularly around the
py2 `str.encode` method and its *assumed* usual use-analogies in Py3.
2015-05-17 20:08:57 +00:00
import logging
import six
2018-08-13 04:07:52 +00:00
import sys
2015-11-01 21:24:30 +00:00
import pdfminer . settings
pdfminer . settings . STRICT = False
2015-05-30 16:03:55 +00:00
import pdfminer . high_level
import pdfminer . layout
2016-04-26 02:38:42 +00:00
from pdfminer . image import ImageWriter
2009-05-15 14:34:53 +00:00
2015-05-30 15:14:24 +00:00
def extract_text ( files = [ ] , outfile = ' - ' ,
2015-05-30 16:03:55 +00:00
_py2_no_more_posargs = None , # Bloody Python2 needs a shim
no_laparams = False , all_texts = None , detect_vertical = None , # LAParams
word_margin = None , char_margin = None , line_margin = None , boxes_flow = None , # LAParams
output_type = ' text ' , codec = ' utf-8 ' , strip_control = False ,
maxpages = 0 , page_numbers = None , password = " " , scale = 1.0 , rotation = 0 ,
layoutmode = ' normal ' , output_dir = None , debug = False ,
disable_caching = False , * * other ) :
2015-05-30 15:14:24 +00:00
if _py2_no_more_posargs is not None :
raise ValueError ( " Too many positional arguments passed. " )
if not files :
raise ValueError ( " Must provide files to work upon! " )
# If any LAParams group arguments were passed, create an LAParams object and
# populate with given args. Otherwise, set it to None.
2015-11-01 21:24:30 +00:00
if not no_laparams :
2015-05-30 16:03:55 +00:00
laparams = pdfminer . layout . LAParams ( )
2015-05-30 15:14:24 +00:00
for param in ( " all_texts " , " detect_vertical " , " word_margin " , " char_margin " , " line_margin " , " boxes_flow " ) :
paramv = locals ( ) . get ( param , None )
if paramv is not None :
setattr ( laparams , param , paramv )
else :
laparams = None
imagewriter = None
if output_dir :
imagewriter = ImageWriter ( output_dir )
if output_type == " text " and outfile != " - " :
for override , alttype in ( ( " .htm " , " html " ) ,
( " .html " , " html " ) ,
( " .xml " , " xml " ) ,
( " .tag " , " tag " ) ) :
if outfile . endswith ( override ) :
output_type = alttype
2015-11-01 21:24:30 +00:00
2015-05-30 15:14:24 +00:00
if outfile == " - " :
outfp = sys . stdout
if outfp . encoding is not None :
codec = ' utf-8 '
else :
outfp = open ( outfile , " wb " )
2015-11-01 21:24:30 +00:00
2015-05-30 15:14:24 +00:00
for fname in files :
with open ( fname , " rb " ) as fp :
2015-05-30 16:03:55 +00:00
pdfminer . high_level . extract_text_to_fp ( fp , * * locals ( ) )
2015-05-30 15:14:24 +00:00
return outfp
2018-08-13 04:07:52 +00:00
def maketheparser ( ) :
parser = argparse . ArgumentParser ( description = __doc__ , add_help = True )
parser . add_argument ( " files " , type = str , default = None , nargs = " + " , help = " File to process. " )
parser . add_argument ( " -d " , " --debug " , default = False , action = " store_true " , help = " Debug output. " )
parser . add_argument ( " -p " , " --pagenos " , type = str , help = " Comma-separated list of page numbers to parse. Included for legacy applications, use --page-numbers for more idiomatic argument entry. " )
parser . add_argument ( " --page-numbers " , type = int , default = None , nargs = " + " , help = " Alternative to --pagenos with space-separated numbers; supercedes --pagenos where it is used. " )
parser . add_argument ( " -m " , " --maxpages " , type = int , default = 0 , help = " Maximum pages to parse " )
parser . add_argument ( " -P " , " --password " , type = str , default = " " , help = " Decryption password for PDF " )
parser . add_argument ( " -o " , " --outfile " , type = str , default = " - " , help = " Output file (default \" - \" is stdout) " )
parser . add_argument ( " -t " , " --output_type " , type = str , default = " text " , help = " Output type: text|html|xml|tag (default is text) " )
parser . add_argument ( " -c " , " --codec " , type = str , default = " utf-8 " , help = " Text encoding " )
parser . add_argument ( " -s " , " --scale " , type = float , default = 1.0 , help = " Scale " )
parser . add_argument ( " -A " , " --all-texts " , default = None , action = " store_true " , help = " LAParams all texts " )
parser . add_argument ( " -V " , " --detect-vertical " , default = None , action = " store_true " , help = " LAParams detect vertical " )
parser . add_argument ( " -W " , " --word-margin " , type = float , default = None , help = " LAParams word margin " )
parser . add_argument ( " -M " , " --char-margin " , type = float , default = None , help = " LAParams char margin " )
parser . add_argument ( " -L " , " --line-margin " , type = float , default = None , help = " LAParams line margin " )
parser . add_argument ( " -F " , " --boxes-flow " , type = float , default = None , help = " LAParams boxes flow " )
parser . add_argument ( " -Y " , " --layoutmode " , default = " normal " , type = str , help = " HTML Layout Mode " )
parser . add_argument ( " -n " , " --no-laparams " , default = False , action = " store_true " , help = " Pass None as LAParams " )
parser . add_argument ( " -R " , " --rotation " , default = 0 , type = int , help = " Rotation " )
parser . add_argument ( " -O " , " --output-dir " , default = None , help = " Output directory for images " )
parser . add_argument ( " -C " , " --disable-caching " , default = False , action = " store_true " , help = " Disable caching " )
parser . add_argument ( " -S " , " --strip-control " , default = False , action = " store_true " , help = " Strip control in XML mode " )
return parser
2009-05-15 14:34:53 +00:00
# main
2018-08-13 04:07:52 +00:00
2015-05-30 15:14:24 +00:00
def main ( args = None ) :
2018-08-13 04:07:52 +00:00
P = maketheparser ( )
2015-05-30 15:14:24 +00:00
A = P . parse_args ( args = args )
Many changes to make pdf2txt.py work better in Py3, some in that script, others in module!
Sorry, changes should have been more atomic.
*In pdf2txt.py:*
* Re-wrote main function to use argparse instead of optparse.
* Manually tested in Py2/Py3 to get partial consistency.
* Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway.
* Py2 mode *probably* unchanged, cannot find any bugs yet...
* Kept old main function for posterity, for now.
*In utils:*
* Added a few compatibility functions (some string hax required chardet, new dependency):
- make_compat_bytes(in_str)-> (py3->bytes | py2->str)
- make_compat_str(in_str)-> (str)
- compatible_encode_method(bytesorstring, encoding, erraction)-> (str)
*In pdfdevice:*
* To handle different output filetypes in Py3, injected lots of calls to new utils methods,
as well as some six.PYX checks and logic. These changes are largely responsible for
enhanced Py2/Py3 consistency.
*In converter:*
* To handle output filetypes in Py2, injected a few checks and fixes particularly around the
py2 `str.encode` method and its *assumed* usual use-analogies in Py3.
2015-05-17 20:08:57 +00:00
if A . page_numbers :
A . page_numbers = set ( [ x - 1 for x in A . page_numbers ] )
if A . pagenos :
A . page_numbers = set ( [ int ( x ) - 1 for x in A . pagenos . split ( " , " ) ] )
2015-11-01 21:24:30 +00:00
Many changes to make pdf2txt.py work better in Py3, some in that script, others in module!
Sorry, changes should have been more atomic.
*In pdf2txt.py:*
* Re-wrote main function to use argparse instead of optparse.
* Manually tested in Py2/Py3 to get partial consistency.
* Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway.
* Py2 mode *probably* unchanged, cannot find any bugs yet...
* Kept old main function for posterity, for now.
*In utils:*
* Added a few compatibility functions (some string hax required chardet, new dependency):
- make_compat_bytes(in_str)-> (py3->bytes | py2->str)
- make_compat_str(in_str)-> (str)
- compatible_encode_method(bytesorstring, encoding, erraction)-> (str)
*In pdfdevice:*
* To handle different output filetypes in Py3, injected lots of calls to new utils methods,
as well as some six.PYX checks and logic. These changes are largely responsible for
enhanced Py2/Py3 consistency.
*In converter:*
* To handle output filetypes in Py2, injected a few checks and fixes particularly around the
py2 `str.encode` method and its *assumed* usual use-analogies in Py3.
2015-05-17 20:08:57 +00:00
imagewriter = None
if A . output_dir :
imagewriter = ImageWriter ( A . output_dir )
if six . PY2 and sys . stdin . encoding :
A . password = A . password . decode ( sys . stdin . encoding )
if A . output_type == " text " and A . outfile != " - " :
2015-05-30 15:14:24 +00:00
for override , alttype in ( ( " .htm " , " html " ) ,
Many changes to make pdf2txt.py work better in Py3, some in that script, others in module!
Sorry, changes should have been more atomic.
*In pdf2txt.py:*
* Re-wrote main function to use argparse instead of optparse.
* Manually tested in Py2/Py3 to get partial consistency.
* Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway.
* Py2 mode *probably* unchanged, cannot find any bugs yet...
* Kept old main function for posterity, for now.
*In utils:*
* Added a few compatibility functions (some string hax required chardet, new dependency):
- make_compat_bytes(in_str)-> (py3->bytes | py2->str)
- make_compat_str(in_str)-> (str)
- compatible_encode_method(bytesorstring, encoding, erraction)-> (str)
*In pdfdevice:*
* To handle different output filetypes in Py3, injected lots of calls to new utils methods,
as well as some six.PYX checks and logic. These changes are largely responsible for
enhanced Py2/Py3 consistency.
*In converter:*
* To handle output filetypes in Py2, injected a few checks and fixes particularly around the
py2 `str.encode` method and its *assumed* usual use-analogies in Py3.
2015-05-17 20:08:57 +00:00
( " .html " , " html " ) ,
2015-05-30 15:14:24 +00:00
( " .xml " , " xml " ) ,
( " .tag " , " tag " ) ) :
Many changes to make pdf2txt.py work better in Py3, some in that script, others in module!
Sorry, changes should have been more atomic.
*In pdf2txt.py:*
* Re-wrote main function to use argparse instead of optparse.
* Manually tested in Py2/Py3 to get partial consistency.
* Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway.
* Py2 mode *probably* unchanged, cannot find any bugs yet...
* Kept old main function for posterity, for now.
*In utils:*
* Added a few compatibility functions (some string hax required chardet, new dependency):
- make_compat_bytes(in_str)-> (py3->bytes | py2->str)
- make_compat_str(in_str)-> (str)
- compatible_encode_method(bytesorstring, encoding, erraction)-> (str)
*In pdfdevice:*
* To handle different output filetypes in Py3, injected lots of calls to new utils methods,
as well as some six.PYX checks and logic. These changes are largely responsible for
enhanced Py2/Py3 consistency.
*In converter:*
* To handle output filetypes in Py2, injected a few checks and fixes particularly around the
py2 `str.encode` method and its *assumed* usual use-analogies in Py3.
2015-05-17 20:08:57 +00:00
if A . outfile . endswith ( override ) :
A . output_type = alttype
if A . outfile == " - " :
outfp = sys . stdout
if outfp . encoding is not None :
2015-05-30 15:14:24 +00:00
# Why ignore outfp.encoding? :-/ stupid cathal?
Many changes to make pdf2txt.py work better in Py3, some in that script, others in module!
Sorry, changes should have been more atomic.
*In pdf2txt.py:*
* Re-wrote main function to use argparse instead of optparse.
* Manually tested in Py2/Py3 to get partial consistency.
* Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway.
* Py2 mode *probably* unchanged, cannot find any bugs yet...
* Kept old main function for posterity, for now.
*In utils:*
* Added a few compatibility functions (some string hax required chardet, new dependency):
- make_compat_bytes(in_str)-> (py3->bytes | py2->str)
- make_compat_str(in_str)-> (str)
- compatible_encode_method(bytesorstring, encoding, erraction)-> (str)
*In pdfdevice:*
* To handle different output filetypes in Py3, injected lots of calls to new utils methods,
as well as some six.PYX checks and logic. These changes are largely responsible for
enhanced Py2/Py3 consistency.
*In converter:*
* To handle output filetypes in Py2, injected a few checks and fixes particularly around the
py2 `str.encode` method and its *assumed* usual use-analogies in Py3.
2015-05-17 20:08:57 +00:00
A . codec = ' utf-8 '
else :
outfp = open ( A . outfile , " wb " )
2015-05-30 15:14:24 +00:00
## Test Code
outfp = extract_text ( * * vars ( A ) )
Many changes to make pdf2txt.py work better in Py3, some in that script, others in module!
Sorry, changes should have been more atomic.
*In pdf2txt.py:*
* Re-wrote main function to use argparse instead of optparse.
* Manually tested in Py2/Py3 to get partial consistency.
* Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway.
* Py2 mode *probably* unchanged, cannot find any bugs yet...
* Kept old main function for posterity, for now.
*In utils:*
* Added a few compatibility functions (some string hax required chardet, new dependency):
- make_compat_bytes(in_str)-> (py3->bytes | py2->str)
- make_compat_str(in_str)-> (str)
- compatible_encode_method(bytesorstring, encoding, erraction)-> (str)
*In pdfdevice:*
* To handle different output filetypes in Py3, injected lots of calls to new utils methods,
as well as some six.PYX checks and logic. These changes are largely responsible for
enhanced Py2/Py3 consistency.
*In converter:*
* To handle output filetypes in Py2, injected a few checks and fixes particularly around the
py2 `str.encode` method and its *assumed* usual use-analogies in Py3.
2015-05-17 20:08:57 +00:00
outfp . close ( )
2015-05-30 16:03:55 +00:00
return 0
Many changes to make pdf2txt.py work better in Py3, some in that script, others in module!
Sorry, changes should have been more atomic.
*In pdf2txt.py:*
* Re-wrote main function to use argparse instead of optparse.
* Manually tested in Py2/Py3 to get partial consistency.
* Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway.
* Py2 mode *probably* unchanged, cannot find any bugs yet...
* Kept old main function for posterity, for now.
*In utils:*
* Added a few compatibility functions (some string hax required chardet, new dependency):
- make_compat_bytes(in_str)-> (py3->bytes | py2->str)
- make_compat_str(in_str)-> (str)
- compatible_encode_method(bytesorstring, encoding, erraction)-> (str)
*In pdfdevice:*
* To handle different output filetypes in Py3, injected lots of calls to new utils methods,
as well as some six.PYX checks and logic. These changes are largely responsible for
enhanced Py2/Py3 consistency.
*In converter:*
* To handle output filetypes in Py2, injected a few checks and fixes particularly around the
py2 `str.encode` method and its *assumed* usual use-analogies in Py3.
2015-05-17 20:08:57 +00:00
2009-05-15 14:34:53 +00:00
2015-05-30 15:14:24 +00:00
if __name__ == ' __main__ ' : sys . exit ( main ( ) )