Commit Graph

161 Commits (b9a8920cdfdd84aea4f217aad94b7e82b1e222dc)

Author SHA1 Message Date
Cathal Garvey 1b47bed306 Many changes to make pdf2txt.py work better in Py3, some in that script, others in module!
Sorry, changes should have been more atomic.

*In pdf2txt.py:*

* Re-wrote main function to use argparse instead of optparse.
* Manually tested in Py2/Py3 to get partial consistency.
* Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway.
* Py2 mode *probably* unchanged, cannot find any bugs yet...
* Kept old main function for posterity, for now.

*In utils:*

* Added a few compatibility functions (some string hax required chardet, new dependency):
    - make_compat_bytes(in_str)-> (py3->bytes | py2->str)
    - make_compat_str(in_str)-> (str)
    - compatible_encode_method(bytesorstring, encoding, erraction)-> (str)

*In pdfdevice:*

* To handle different output filetypes in Py3, injected lots of calls to new utils methods,
  as well as some six.PYX checks and logic. These changes are largely responsible for
  enhanced Py2/Py3 consistency.

*In converter:*

* To handle output filetypes in Py2, injected a few checks and fixes particularly around the
  py2 `str.encode` method and its *assumed* usual use-analogies in Py3.
2015-05-17 21:08:57 +01:00
cybjit 2639b15ef4 guess argv encoding in py2 using sys.stdin.encoding 2014-09-16 23:17:26 +02:00
cybjit 14585987c3 keep password api unicode, latin1 or utf-8 is encoded in handler 2014-09-16 22:58:25 +02:00
cybjit 714423883c setup logging for pdf2txt and fix dumppdf 2014-09-12 00:29:31 +02:00
cybjit ed13f7c47d conv_cmap py3 compat 2014-09-12 00:29:30 +02:00
cybjit 0a2d90c051 pdf2txt: do not double encode stdout 2014-09-07 18:34:11 +02:00
unknown 28c2a4e6ad 2.7/3.4 encoding corrected 2014-09-04 10:31:33 +02:00
unknown 7b610b34be tools must be a module to enable scripts tests 2014-09-04 09:47:33 +02:00
unknown 29c07ea770 Python 3.4 support and tests 2014-09-03 15:26:08 +02:00
unknown a6475b61b4 Python 3.4 support added and tested 2014-09-03 13:17:41 +02:00
Yusuke Shinyama fe86b4e64e Changed: StringIO -> io.BytesIO 2014-06-25 19:55:41 +09:00
Yusuke Shinyama 44074b42ea Added: stripcontrol for XMLConverter (-S option) 2014-06-22 00:33:00 +09:00
Yusuke Shinyama bb866ae148 Changed: new except syntax (2.6 or above). 2014-06-16 18:50:07 +09:00
Yusuke Shinyama 28e96ba3d0 Use print as a function. 2014-06-15 12:14:33 +09:00
Yusuke Shinyama 1384a3fe8d Code cleanup: removed some debug flags. 2014-06-14 15:43:10 +09:00
Yusuke Shinyama 17b9b19a26 Fixed for newer version: pdf2html.cgi 2014-04-02 18:54:50 +09:00
Yusuke Shinyama 340387bfc6 Cleanup: isinstance 2014-03-28 17:50:59 +09:00
Yusuke Shinyama f9079e4c0a Fixed dumppdf.py issues. 2014-03-24 20:55:00 +09:00
Yusuke Shinyama bb6f9b6fc9 Added: -R option. 2013-11-25 18:21:19 +09:00
Alex Rothberg af8c4a6b8f - only visit each objid once when dumping all objects 2013-11-18 20:41:09 -05:00
Yusuke Shinyama 2b56b2eedf Merged. 2013-11-07 19:50:41 +09:00
Matthew Duggan c1da8b835c PEP8: Remove trailing whitespace 2013-11-07 16:14:53 +09:00
Matthew Duggan 10a68c83bd Remove unused imports identified by pyflakes 2013-11-07 16:09:44 +09:00
Yusuke Shinyama d3730a29ec API change: process_pdf -> PDFPage.get_pages 2013-10-22 18:59:16 +09:00
Yusuke Shinyama 8a70a9f657 fixed: encoding problem with vertical characters. 2013-10-22 18:44:40 +09:00
Yusuke Shinyama 32844507ea Fixed some style issues. 2013-10-19 08:41:01 +09:00
Yusuke Shinyama 28cb424f8f Merge pull request #21 from eug48/master
dumppdf: support for extracting embedded files using the -E option
2013-10-18 16:23:09 -07:00
Yusuke Shinyama 6ca9ac5434 chmod fix. 2013-10-17 23:06:07 +09:00
Yusuke Shinyama 0ea08890d4 renamed: python2 -> python. 2013-10-17 23:05:27 +09:00
Yusuke Shinyama 6ad82e355c Beating the codepage dragon. 2013-10-17 22:57:48 +09:00
Yusuke Shinyama 774827b4ce Code cleanup: conv_cmap.py 2013-10-12 13:20:40 +09:00
Yusuke Shinyama f85c374cae Separated PDFPage to pdfpage.py. 2013-10-10 19:54:55 +09:00
Yusuke Shinyama c926874d20 API Change: the PDFDocument cstr now takes PDFParser. set_parser() is removed. 2013-10-10 18:40:06 +09:00
Yusuke Shinyama 2221163b94 Split pdfparser.py and pdfdocument.py. 2013-10-10 18:29:30 +09:00
Yusuke Shinyama 1467fc674c Added fallback for broken PDFs. 2013-10-09 22:45:54 +09:00
Yusuke Shinyama 06425bba00 Introducing PDFObjectNotFound 2013-10-09 21:39:23 +09:00
eug 925845b172 dumppdf: support for extracting embedded files using the -E option 2013-01-20 13:29:35 +10:00
Yusuke Shinyama 82ff98c7b3 imagewriter now works with text output 2011-11-07 01:15:10 +10:00
Yusuke Shinyama dc8fde0e47 added CCITTFaxFilter support and a very crude image extraction. 2011-07-18 21:07:00 +10:00
Yusuke Shinyama fcf0d74ecc tweaks for debugging 2011-04-21 22:07:52 +09:00
Yusuke Shinyama 4918d59bc2 disable caching support 2011-03-03 00:04:43 +09:00
Yusuke Shinyama 7dbb664db3 code cleanup and more debugging options 2011-02-14 23:42:05 +09:00
Yusuke Shinyama cbd58121e3 fix aggressive vertical writing detection (which ruins layout) 2011-02-02 23:09:34 +09:00
Yusuke Shinyama d3bcc0eef5 another minor fix 2010-12-26 19:30:46 +09:00
Yusuke Shinyama a24c452ba2 boxes_flow patch by Daniel Gerber 2010-12-26 17:26:39 +09:00
Yusuke Shinyama bf44e52cf7 merged 2010-12-25 17:54:17 +09:00
yusuke.shinyama.dummy 866f2bbb75 webapp fixed
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@283 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-12-25 08:41:35 +00:00
yusuke.shinyama.dummy 5d98a27d9c test cases updated
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@282 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-12-25 08:41:11 +00:00
Yusuke Shinyama 432b3829d3 test cases updated 2010-12-24 22:30:25 +09:00
yusuke.shinyama.dummy 2bf9c23801 check_extractable paramater added
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@276 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-11-23 10:53:28 +00:00
yusuke.shinyama.dummy 7374b81383 htmlconverter improved
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@274 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-11-14 15:04:28 +00:00
yusuke.shinyama.dummy 509ab66319 stay with python2
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@264 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-10-19 09:57:01 +00:00
yusuke.shinyama.dummy afe33312c6 outline bug fixed
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@249 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-10-17 05:14:52 +00:00
yusuke.shinyama.dummy ca5588a702 bugfix by Humberto Pereira
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@241 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-08-29 06:59:50 +00:00
yusuke.shinyama.dummy 4554705881 glyphlist bug (due to my misunderstanding of spec.)
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@237 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-08-26 15:02:46 +00:00
yusuke.shinyama.dummy a0dd46bd8e cmap compression patch. thanks to Jakub Wilk
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@228 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-06-13 13:50:24 +00:00
yusuke.shinyama.dummy f9c9357547 pdf2html.cgi code cleanup
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@218 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-05-29 11:51:15 +00:00
yusuke.shinyama.dummy 8e92ddca30 latin2ascii.py was moved as a utility
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@215 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-05-05 05:51:11 +00:00
yusuke.shinyama.dummy eb535d4106 change PDFPageAggregator -> PDFLayoutAnalyzer
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@213 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-04-24 13:31:21 +00:00
yusuke.shinyama.dummy 32d65b70f8 trivial change
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@211 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-04-24 13:31:03 +00:00
yusuke.shinyama.dummy 97848409e5 fix xobject resources bug, thanks to Jose Maria
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@209 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-04-24 04:32:03 +00:00
yusuke.shinyama.dummy 9052cd1ea7 better TOC extraction
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@207 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-04-24 01:34:18 +00:00
yusuke.shinyama.dummy e77a6ba997 -A (all_texts) option added for layout analysis
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@205 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-04-10 11:30:03 +00:00
yusuke.shinyama.dummy 2e5b92c18a writing mode detection
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@196 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-03-25 11:38:47 +00:00
yusuke.shinyama.dummy ee34d8d549 bugfix (thanks to Brian Berry).
Remaining TODOs: automatic testing for vertical texts. Various layout analysis tuning.


git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@193 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-03-22 08:36:39 +00:00
yusuke.shinyama.dummy 2555b38836 fix typos (patches by sm)
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@183 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-02-15 14:50:19 +00:00
yusuke.shinyama.dummy 2dee2efad9 apply more patches
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@181 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-02-13 15:00:43 +00:00
yusuke.shinyama.dummy 538a605ac0 several bugfixes.
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@179 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-02-07 03:14:00 +00:00
yusuke.shinyama.dummy 0f8fe3f19e Page rotation bug fixed.
Various minor fixes.


git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@176 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-01-31 02:09:28 +00:00
yusuke.shinyama.dummy dc6e5c366d jpeg extraction support added.
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@174 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-01-30 07:30:01 +00:00
yusuke.shinyama.dummy a9d7a00ccd trivial grammar errors
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@173 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-01-10 07:18:05 +00:00
yusuke.shinyama.dummy 9486303103 pdf2html.cgi
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@169 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-01-01 14:15:25 +00:00
yusuke.shinyama.dummy 98c8367339 warning removal.
code cleanup.
cmap bug fixed.


git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@168 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-01-01 03:09:26 +00:00
yusuke.shinyama.dummy fb05e4b990 for release 20091219
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@164 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-12-19 15:10:58 +00:00
yusuke.shinyama.dummy e4b089e327 include cmap
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@162 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-12-19 14:17:00 +00:00
yusuke.shinyama.dummy ed8a5362b9 renamed cmap.py -> cmapdb.py (avoiding future name changes)
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@161 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-12-19 06:52:02 +00:00
yusuke.shinyama.dummy 61d4872c3a add -n option to pdf2txt.py
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@157 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-11-07 09:12:54 +00:00
yusuke.shinyama.dummy faa775897c another bugfix
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@156 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-11-07 09:01:11 +00:00
yusuke.shinyama.dummy f444c88e3d testing against None with "is", not using "=="
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@153 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-11-06 15:10:29 +00:00
yusuke.shinyama.dummy 77986b8273 fix CMapDB initialization stuff. more code cleanup.
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@148 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-11-03 13:39:34 +00:00
yusuke.shinyama.dummy 78f7866554 sgml to xml
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@146 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-10-31 03:04:56 +00:00
yusuke.shinyama.dummy 23b8058ad4 outfp closing bug fixed
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@145 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-10-31 02:09:36 +00:00
yusuke.shinyama.dummy 7790808560 to 4-space indentation
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@142 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-10-24 04:41:59 +00:00
yusuke.shinyama.dummy 8a5bec5065 layout analysis improved.
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@120 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-07-21 07:55:19 +00:00
yusuke.shinyama.dummy 787ae4f814 documentation fix
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@117 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-07-11 12:42:12 +00:00
yusuke.shinyama.dummy 97dd4dda5e improved clustering
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@116 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-06-20 10:44:00 +00:00
yusuke.shinyama.dummy c7a0894182 auto detect output type
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@115 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-06-20 10:00:51 +00:00
yusuke.shinyama.dummy 8cae56a555 documentation fix.
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@108 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-05-17 06:21:08 +00:00
yusuke.shinyama.dummy 173d095522 text spacing bug fixed
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@106 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-05-16 10:42:35 +00:00
yusuke.shinyama.dummy 3e12268bf6 rename package pdflib -> pdfminer.
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@103 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-05-16 06:12:01 +00:00
yusuke.shinyama.dummy f628c0d3fe git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@101 1aa58f4a-7d42-0410-adbc-911cccaed67c 2009-05-15 14:34:53 +00:00
yusuke.shinyama.dummy 43e5c05307 handle error when an object was not found in dumpxml()
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@92 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-04-26 15:03:47 +00:00
yusuke.shinyama.dummy 6d91453187 text positioning got right.
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@87 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-04-18 17:15:49 +00:00
yusuke.shinyama.dummy f8510edffc AsciiHexDecode filter patch incorporated. Thanks to Troy Bollinger.
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@86 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-04-08 10:55:01 +00:00
yusuke.shinyama.dummy d11012d9f7 delete unused file
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@85 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-04-08 10:37:13 +00:00
yusuke.shinyama.dummy 162c5f0bfa webapp fixed
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@83 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-04-02 14:24:57 +00:00
yusuke.shinyama.dummy 70e42bff04 encoding bug fixed.
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@74 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-03-24 16:26:59 +00:00
yusuke.shinyama.dummy b432a3f4ae patch from Troy Bollinger.
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@71 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-02-28 05:44:08 +00:00
yusuke.shinyama.dummy 91770edd46 foo
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@59 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-01-10 09:25:03 +00:00
yusuke.shinyama.dummy 24bdd33557 various bugfixes
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@56 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-01-05 04:40:50 +00:00