Cathal Garvey
1b47bed306
Many changes to make pdf2txt.py work better in Py3, some in that script, others in module!
...
Sorry, changes should have been more atomic.
*In pdf2txt.py:*
* Re-wrote main function to use argparse instead of optparse.
* Manually tested in Py2/Py3 to get partial consistency.
* Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway.
* Py2 mode *probably* unchanged, cannot find any bugs yet...
* Kept old main function for posterity, for now.
*In utils:*
* Added a few compatibility functions (some string hax required chardet, new dependency):
- make_compat_bytes(in_str)-> (py3->bytes | py2->str)
- make_compat_str(in_str)-> (str)
- compatible_encode_method(bytesorstring, encoding, erraction)-> (str)
*In pdfdevice:*
* To handle different output filetypes in Py3, injected lots of calls to new utils methods,
as well as some six.PYX checks and logic. These changes are largely responsible for
enhanced Py2/Py3 consistency.
*In converter:*
* To handle output filetypes in Py2, injected a few checks and fixes particularly around the
py2 `str.encode` method and its *assumed* usual use-analogies in Py3.
2015-05-17 21:08:57 +01:00
cybjit
2639b15ef4
guess argv encoding in py2 using sys.stdin.encoding
2014-09-16 23:17:26 +02:00
cybjit
14585987c3
keep password api unicode, latin1 or utf-8 is encoded in handler
2014-09-16 22:58:25 +02:00
cybjit
714423883c
setup logging for pdf2txt and fix dumppdf
2014-09-12 00:29:31 +02:00
cybjit
ed13f7c47d
conv_cmap py3 compat
2014-09-12 00:29:30 +02:00
cybjit
0a2d90c051
pdf2txt: do not double encode stdout
2014-09-07 18:34:11 +02:00
unknown
28c2a4e6ad
2.7/3.4 encoding corrected
2014-09-04 10:31:33 +02:00
unknown
7b610b34be
tools must be a module to enable scripts tests
2014-09-04 09:47:33 +02:00
unknown
29c07ea770
Python 3.4 support and tests
2014-09-03 15:26:08 +02:00
unknown
a6475b61b4
Python 3.4 support added and tested
2014-09-03 13:17:41 +02:00
Yusuke Shinyama
fe86b4e64e
Changed: StringIO -> io.BytesIO
2014-06-25 19:55:41 +09:00
Yusuke Shinyama
44074b42ea
Added: stripcontrol for XMLConverter (-S option)
2014-06-22 00:33:00 +09:00
Yusuke Shinyama
bb866ae148
Changed: new except syntax (2.6 or above).
2014-06-16 18:50:07 +09:00
Yusuke Shinyama
28e96ba3d0
Use print as a function.
2014-06-15 12:14:33 +09:00
Yusuke Shinyama
1384a3fe8d
Code cleanup: removed some debug flags.
2014-06-14 15:43:10 +09:00
Yusuke Shinyama
17b9b19a26
Fixed for newer version: pdf2html.cgi
2014-04-02 18:54:50 +09:00
Yusuke Shinyama
340387bfc6
Cleanup: isinstance
2014-03-28 17:50:59 +09:00
Yusuke Shinyama
f9079e4c0a
Fixed dumppdf.py issues.
2014-03-24 20:55:00 +09:00
Yusuke Shinyama
bb6f9b6fc9
Added: -R option.
2013-11-25 18:21:19 +09:00
Alex Rothberg
af8c4a6b8f
- only visit each objid once when dumping all objects
2013-11-18 20:41:09 -05:00
Yusuke Shinyama
2b56b2eedf
Merged.
2013-11-07 19:50:41 +09:00
Matthew Duggan
c1da8b835c
PEP8: Remove trailing whitespace
2013-11-07 16:14:53 +09:00
Matthew Duggan
10a68c83bd
Remove unused imports identified by pyflakes
2013-11-07 16:09:44 +09:00
Yusuke Shinyama
d3730a29ec
API change: process_pdf -> PDFPage.get_pages
2013-10-22 18:59:16 +09:00
Yusuke Shinyama
8a70a9f657
fixed: encoding problem with vertical characters.
2013-10-22 18:44:40 +09:00
Yusuke Shinyama
32844507ea
Fixed some style issues.
2013-10-19 08:41:01 +09:00
Yusuke Shinyama
28cb424f8f
Merge pull request #21 from eug48/master
...
dumppdf: support for extracting embedded files using the -E option
2013-10-18 16:23:09 -07:00
Yusuke Shinyama
6ca9ac5434
chmod fix.
2013-10-17 23:06:07 +09:00
Yusuke Shinyama
0ea08890d4
renamed: python2 -> python.
2013-10-17 23:05:27 +09:00
Yusuke Shinyama
6ad82e355c
Beating the codepage dragon.
2013-10-17 22:57:48 +09:00
Yusuke Shinyama
774827b4ce
Code cleanup: conv_cmap.py
2013-10-12 13:20:40 +09:00
Yusuke Shinyama
f85c374cae
Separated PDFPage to pdfpage.py.
2013-10-10 19:54:55 +09:00
Yusuke Shinyama
c926874d20
API Change: the PDFDocument cstr now takes PDFParser. set_parser() is removed.
2013-10-10 18:40:06 +09:00
Yusuke Shinyama
2221163b94
Split pdfparser.py and pdfdocument.py.
2013-10-10 18:29:30 +09:00
Yusuke Shinyama
1467fc674c
Added fallback for broken PDFs.
2013-10-09 22:45:54 +09:00
Yusuke Shinyama
06425bba00
Introducing PDFObjectNotFound
2013-10-09 21:39:23 +09:00
eug
925845b172
dumppdf: support for extracting embedded files using the -E option
2013-01-20 13:29:35 +10:00
Yusuke Shinyama
82ff98c7b3
imagewriter now works with text output
2011-11-07 01:15:10 +10:00
Yusuke Shinyama
dc8fde0e47
added CCITTFaxFilter support and a very crude image extraction.
2011-07-18 21:07:00 +10:00
Yusuke Shinyama
fcf0d74ecc
tweaks for debugging
2011-04-21 22:07:52 +09:00
Yusuke Shinyama
4918d59bc2
disable caching support
2011-03-03 00:04:43 +09:00
Yusuke Shinyama
7dbb664db3
code cleanup and more debugging options
2011-02-14 23:42:05 +09:00
Yusuke Shinyama
cbd58121e3
fix aggressive vertical writing detection (which ruins layout)
2011-02-02 23:09:34 +09:00
Yusuke Shinyama
d3bcc0eef5
another minor fix
2010-12-26 19:30:46 +09:00
Yusuke Shinyama
a24c452ba2
boxes_flow patch by Daniel Gerber
2010-12-26 17:26:39 +09:00
Yusuke Shinyama
bf44e52cf7
merged
2010-12-25 17:54:17 +09:00
yusuke.shinyama.dummy
866f2bbb75
webapp fixed
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@283 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-12-25 08:41:35 +00:00
yusuke.shinyama.dummy
5d98a27d9c
test cases updated
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@282 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-12-25 08:41:11 +00:00
Yusuke Shinyama
432b3829d3
test cases updated
2010-12-24 22:30:25 +09:00
yusuke.shinyama.dummy
2bf9c23801
check_extractable paramater added
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@276 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-11-23 10:53:28 +00:00
yusuke.shinyama.dummy
7374b81383
htmlconverter improved
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@274 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-11-14 15:04:28 +00:00
yusuke.shinyama.dummy
509ab66319
stay with python2
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@264 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-10-19 09:57:01 +00:00
yusuke.shinyama.dummy
afe33312c6
outline bug fixed
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@249 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-10-17 05:14:52 +00:00
yusuke.shinyama.dummy
ca5588a702
bugfix by Humberto Pereira
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@241 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-08-29 06:59:50 +00:00
yusuke.shinyama.dummy
4554705881
glyphlist bug (due to my misunderstanding of spec.)
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@237 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-08-26 15:02:46 +00:00
yusuke.shinyama.dummy
a0dd46bd8e
cmap compression patch. thanks to Jakub Wilk
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@228 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-06-13 13:50:24 +00:00
yusuke.shinyama.dummy
f9c9357547
pdf2html.cgi code cleanup
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@218 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-05-29 11:51:15 +00:00
yusuke.shinyama.dummy
8e92ddca30
latin2ascii.py was moved as a utility
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@215 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-05-05 05:51:11 +00:00
yusuke.shinyama.dummy
eb535d4106
change PDFPageAggregator -> PDFLayoutAnalyzer
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@213 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-04-24 13:31:21 +00:00
yusuke.shinyama.dummy
32d65b70f8
trivial change
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@211 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-04-24 13:31:03 +00:00
yusuke.shinyama.dummy
97848409e5
fix xobject resources bug, thanks to Jose Maria
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@209 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-04-24 04:32:03 +00:00
yusuke.shinyama.dummy
9052cd1ea7
better TOC extraction
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@207 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-04-24 01:34:18 +00:00
yusuke.shinyama.dummy
e77a6ba997
-A (all_texts) option added for layout analysis
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@205 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-04-10 11:30:03 +00:00
yusuke.shinyama.dummy
2e5b92c18a
writing mode detection
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@196 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-03-25 11:38:47 +00:00
yusuke.shinyama.dummy
ee34d8d549
bugfix (thanks to Brian Berry).
...
Remaining TODOs: automatic testing for vertical texts. Various layout analysis tuning.
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@193 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-03-22 08:36:39 +00:00
yusuke.shinyama.dummy
2555b38836
fix typos (patches by sm)
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@183 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-02-15 14:50:19 +00:00
yusuke.shinyama.dummy
2dee2efad9
apply more patches
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@181 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-02-13 15:00:43 +00:00
yusuke.shinyama.dummy
538a605ac0
several bugfixes.
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@179 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-02-07 03:14:00 +00:00
yusuke.shinyama.dummy
0f8fe3f19e
Page rotation bug fixed.
...
Various minor fixes.
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@176 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-01-31 02:09:28 +00:00
yusuke.shinyama.dummy
dc6e5c366d
jpeg extraction support added.
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@174 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-01-30 07:30:01 +00:00
yusuke.shinyama.dummy
a9d7a00ccd
trivial grammar errors
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@173 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-01-10 07:18:05 +00:00
yusuke.shinyama.dummy
9486303103
pdf2html.cgi
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@169 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-01-01 14:15:25 +00:00
yusuke.shinyama.dummy
98c8367339
warning removal.
...
code cleanup.
cmap bug fixed.
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@168 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-01-01 03:09:26 +00:00
yusuke.shinyama.dummy
fb05e4b990
for release 20091219
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@164 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-12-19 15:10:58 +00:00
yusuke.shinyama.dummy
e4b089e327
include cmap
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@162 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-12-19 14:17:00 +00:00
yusuke.shinyama.dummy
ed8a5362b9
renamed cmap.py -> cmapdb.py (avoiding future name changes)
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@161 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-12-19 06:52:02 +00:00
yusuke.shinyama.dummy
61d4872c3a
add -n option to pdf2txt.py
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@157 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-11-07 09:12:54 +00:00
yusuke.shinyama.dummy
faa775897c
another bugfix
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@156 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-11-07 09:01:11 +00:00
yusuke.shinyama.dummy
f444c88e3d
testing against None with "is", not using "=="
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@153 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-11-06 15:10:29 +00:00
yusuke.shinyama.dummy
77986b8273
fix CMapDB initialization stuff. more code cleanup.
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@148 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-11-03 13:39:34 +00:00
yusuke.shinyama.dummy
78f7866554
sgml to xml
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@146 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-10-31 03:04:56 +00:00
yusuke.shinyama.dummy
23b8058ad4
outfp closing bug fixed
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@145 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-10-31 02:09:36 +00:00
yusuke.shinyama.dummy
7790808560
to 4-space indentation
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@142 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-10-24 04:41:59 +00:00
yusuke.shinyama.dummy
8a5bec5065
layout analysis improved.
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@120 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-07-21 07:55:19 +00:00
yusuke.shinyama.dummy
787ae4f814
documentation fix
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@117 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-07-11 12:42:12 +00:00
yusuke.shinyama.dummy
97dd4dda5e
improved clustering
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@116 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-06-20 10:44:00 +00:00
yusuke.shinyama.dummy
c7a0894182
auto detect output type
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@115 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-06-20 10:00:51 +00:00
yusuke.shinyama.dummy
8cae56a555
documentation fix.
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@108 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-05-17 06:21:08 +00:00
yusuke.shinyama.dummy
173d095522
text spacing bug fixed
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@106 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-05-16 10:42:35 +00:00
yusuke.shinyama.dummy
3e12268bf6
rename package pdflib -> pdfminer.
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@103 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-05-16 06:12:01 +00:00
yusuke.shinyama.dummy
f628c0d3fe
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@101 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-05-15 14:34:53 +00:00
yusuke.shinyama.dummy
43e5c05307
handle error when an object was not found in dumpxml()
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@92 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-04-26 15:03:47 +00:00
yusuke.shinyama.dummy
6d91453187
text positioning got right.
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@87 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-04-18 17:15:49 +00:00
yusuke.shinyama.dummy
f8510edffc
AsciiHexDecode filter patch incorporated. Thanks to Troy Bollinger.
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@86 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-04-08 10:55:01 +00:00
yusuke.shinyama.dummy
d11012d9f7
delete unused file
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@85 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-04-08 10:37:13 +00:00
yusuke.shinyama.dummy
162c5f0bfa
webapp fixed
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@83 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-04-02 14:24:57 +00:00
yusuke.shinyama.dummy
70e42bff04
encoding bug fixed.
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@74 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-03-24 16:26:59 +00:00
yusuke.shinyama.dummy
b432a3f4ae
patch from Troy Bollinger.
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@71 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-02-28 05:44:08 +00:00
yusuke.shinyama.dummy
91770edd46
foo
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@59 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-01-10 09:25:03 +00:00
yusuke.shinyama.dummy
24bdd33557
various bugfixes
...
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@56 1aa58f4a-7d42-0410-adbc-911cccaed67c
2009-01-05 04:40:50 +00:00