Cathal Garvey
1b47bed306
Many changes to make pdf2txt.py work better in Py3, some in that script, others in module!
...
Sorry, changes should have been more atomic.
*In pdf2txt.py:*
* Re-wrote main function to use argparse instead of optparse.
* Manually tested in Py2/Py3 to get partial consistency.
* Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway.
* Py2 mode *probably* unchanged, cannot find any bugs yet...
* Kept old main function for posterity, for now.
*In utils:*
* Added a few compatibility functions (some string hax required chardet, new dependency):
- make_compat_bytes(in_str)-> (py3->bytes | py2->str)
- make_compat_str(in_str)-> (str)
- compatible_encode_method(bytesorstring, encoding, erraction)-> (str)
*In pdfdevice:*
* To handle different output filetypes in Py3, injected lots of calls to new utils methods,
as well as some six.PYX checks and logic. These changes are largely responsible for
enhanced Py2/Py3 consistency.
*In converter:*
* To handle output filetypes in Py2, injected a few checks and fixes particularly around the
py2 `str.encode` method and its *assumed* usual use-analogies in Py3.
2015-05-17 21:08:57 +01:00
Yusuke Shinyama
14fd0fd2d6
Fixed : #84 (fontname was in unicode)
2015-04-05 19:02:02 +09:00
speedplane
806ee603ff
More fixes to layout. The compute neighbors function for horizontal lines is only intended to find neighbors on differing lines. However, it's entirely possible that horizontal neighbors could appear.
...
This commit finds horizontal neighbors in a horizonal line and merges them together into a single horizontal line if necessary. This leads to much better text extraction if the PDF was created in a funky way.
For example (test case coming), I have seen PDFs which are written almost like vertical columns, but the text is entirely horizontal.
2014-12-12 00:36:59 -05:00
speedplane
45170e7183
There are a number of relatively complex changes here. Comments are in order of where the change appears.
...
1.
When detecting text in a horizontal line, we already add a space between words if separated by more than word_margin apart. However now, we only do it if there is not already an existing space. This prevents multiple spaces being placed between words.
2.
Detect a horizontal line if the line is zero width. This improves our detection of horizonal lines when looking for both horizontal and vertical.
3.
Don't detect a vertical line if the previous letter is whitspace. Prevents double spaces being caught as vert lines.
4.
Improve upon an unfortunate O(N^2) algorithm which I have seen taking many minutes to execute. Unfortunately, while the "fix" reduces algorithmic complexity, it isn't technically correct, so we only do it when we know things will take a long time.
2014-12-12 00:36:59 -05:00
Yusuke Shinyama
0112112458
Fixed: crash on invalid chr number.
2014-12-09 22:55:47 +09:00
enkore
d0379a2c44
Fix utils.decode_text
2014-12-04 17:09:52 +01:00
speedplane
36977fbe08
Add debug flags for much of the debug output.
2014-11-11 23:36:58 -05:00
speedplane
ecc4d05675
Fix a unicode conversion bug.
...
See https://github.com/euske/pdfminer/issues/75
2014-11-11 23:34:33 -05:00
cybjit
515687e1bb
more xrange to range
2014-09-16 23:17:31 +02:00
cybjit
9b2e29396b
apply_png_predictor py3
2014-09-16 22:59:29 +02:00
cybjit
ad05121c69
password py3
2014-09-16 22:59:00 +02:00
cybjit
14585987c3
keep password api unicode, latin1 or utf-8 is encoded in handler
2014-09-16 22:58:25 +02:00
cybjit
2260f77b19
fix dict_value usage in strict mode
2014-09-16 22:57:29 +02:00
cybjit
51a361c145
clean up HTMLConverter and XMLConverter encoding
2014-09-16 22:57:00 +02:00
Goulu
8861d7e0ed
version 20140915 pushed to PyPi as pdfminer_six
2014-09-15 10:33:04 +02:00
cybjit
39942b6642
avoid string formating when not logging
2014-09-12 00:29:31 +02:00
cybjit
01821c7d1e
rename bytes to avoid built-in collision
2014-09-12 00:29:31 +02:00
cybjit
31e6afc7cf
faster and simpler bytes implementation
2014-09-12 00:29:30 +02:00
cybjit
cba5a42ba8
decipher_all bytes
2014-09-12 00:29:30 +02:00
cybjit
6357e2da80
code2cid uses int, not byte
2014-09-12 00:29:27 +02:00
cybjit
9b0a3ee53e
decode cmap font name
2014-09-11 23:30:02 +02:00
cybjit
a6f31a713d
cmap bytes and decode
2014-09-07 18:41:04 +02:00
cybjit
cc733c8217
fixes for ARC4
2014-09-07 18:38:22 +02:00
cybjit
f9a67db89b
change xrange to range
2014-09-07 18:36:12 +02:00
cybjit
0a2d90c051
pdf2txt: do not double encode stdout
2014-09-07 18:34:11 +02:00
unknown
58b8492783
no logging in travis.ci
2014-09-04 10:19:50 +02:00
unknown
1c93468c7e
faster, less verbose tests
2014-09-04 10:02:29 +02:00
unknown
4ab48d1803
Python 3.4 compatibility + tests
2014-09-04 09:36:19 +02:00
unknown
29c07ea770
Python 3.4 support and tests
2014-09-03 15:26:08 +02:00
unknown
a6475b61b4
Python 3.4 support added and tested
2014-09-03 13:17:41 +02:00
unknown
846cd18186
Python 3.4 support
2014-09-02 15:49:46 +02:00
unknown
faea7291a8
tests pass under Py 2.7 and 3.4
2014-09-01 14:16:49 +02:00
Yusuke Shinyama
b0e035c24f
Style fix: always have an explicit return.
2014-07-15 21:38:29 +09:00
Yusuke Shinyama
f5b5e31921
Fixed: DecodeParms array support.
2014-07-09 19:07:27 +09:00
Yusuke Shinyama
137fc3a1ae
Use KWD instead of token.name.
2014-06-30 19:15:21 +09:00
Yusuke Shinyama
1ccfaff411
String-Bytes distinction (first attempt).
2014-06-30 19:05:56 +09:00
Yusuke Shinyama
8791355e1d
Cleanup imports. Use relative imports.
2014-06-26 18:12:39 +09:00
Yusuke Shinyama
2e900e5d10
Fixed for consistent test results. (hopefully...)
2014-06-26 17:41:31 +09:00
Yusuke Shinyama
fe86b4e64e
Changed: StringIO -> io.BytesIO
2014-06-25 19:55:41 +09:00
Yusuke Shinyama
44074b42ea
Added: stripcontrol for XMLConverter (-S option)
2014-06-22 00:33:00 +09:00
Yusuke Shinyama
81391c09f4
Fixed : #56 (with a derpy fix)
2014-06-18 19:11:45 +09:00
Yusuke Shinyama
bb866ae148
Changed: new except syntax (2.6 or above).
2014-06-16 18:50:07 +09:00
Yusuke Shinyama
28e96ba3d0
Use print as a function.
2014-06-15 12:14:33 +09:00
Yusuke Shinyama
0387a6c260
Removed: tuple-unpacking args.
2014-06-15 12:12:13 +09:00
Yusuke Shinyama
a8ec99a848
More autotest tweaks.
2014-06-15 10:52:59 +09:00
Yusuke Shinyama
1384a3fe8d
Code cleanup: removed some debug flags.
2014-06-14 15:43:10 +09:00
Yusuke Shinyama
d9680fca7e
Plane: preserve the object order so that the test result is always consistent.
2014-06-14 14:44:53 +09:00
Yusuke Shinyama
aed248610c
Fixed: dependency on pygame in a unittest.
2014-06-14 12:05:26 +09:00
Yusuke Shinyama
8e14ebf4e1
Use logging module instead of print.
2014-06-14 12:00:49 +09:00
Yusuke Shinyama
8e8e22c095
Fixed a layout bug introduced at c97ec304
.
2014-06-13 23:05:04 +09:00
numion
a4997d6f10
Implement revision 4 and 5 encryption handler.
2014-05-19 16:27:43 +02:00
Michael R. Hines
ae2547b0f2
Stop throwing exception on LITERALS_DCT_DECODE
...
I have PDF documents with images stream and two filters, don't throw exceptions on the second one (DCT).
2014-05-14 13:25:30 +08:00
Yusuke Shinyama
6b6fc264ff
Code refactoring: CMap and UnicodeMap both inherit CMapBase.
2014-04-16 18:57:16 +09:00
Yusuke Shinyama
b09c37902f
Fixed: issue #48 (thanks to speedplane)
2014-04-09 17:55:50 +09:00
Yusuke Shinyama
7b354c7ab3
Version 20140328
2014-03-28 22:49:18 +09:00
Yusuke Shinyama
340387bfc6
Cleanup: isinstance
2014-03-28 17:50:59 +09:00
Yusuke Shinyama
7849c8724a
Fixed: PDFXRefStream.get_objids returns invalid objids.
2014-03-28 17:29:26 +09:00
Yusuke Shinyama
57adad55d7
Revert the wrong fix.
2014-03-28 17:24:03 +09:00
Yusuke Shinyama
b18e8c549d
Version 20140327
2014-03-28 00:19:52 +09:00
Yusuke Shinyama
ee47a6603a
Fixed: issues #45
2014-03-28 00:18:17 +09:00
Yusuke Shinyama
ab03037444
Version 20140324
2014-03-24 21:03:46 +09:00
Yusuke Shinyama
4b2beba398
Code cleanup.
2014-03-24 20:59:24 +09:00
Yusuke Shinyama
f9079e4c0a
Fixed dumppdf.py issues.
2014-03-24 20:55:00 +09:00
Yusuke Shinyama
607be269ab
Applied a patch by Axel Kaiser.
2014-03-24 20:45:35 +09:00
Yusuke Shinyama
d7c4ff28e9
Applied a patch by Axel Kaiser.
2014-03-24 20:39:30 +09:00
Yusuke Shinyama
636d4caeb3
Fixed the PNG predictor bug. Thanks to Gabor Molnar.
2014-03-24 19:57:05 +09:00
Yusuke Shinyama
c97ec3048e
Changed / to // for clarity.
2013-11-26 21:35:16 +09:00
Yusuke Shinyama
b589da51b7
Fix for malformed PDFs.
2013-11-26 21:27:45 +09:00
Yusuke Shinyama
cf1e3c9973
Version bump!
2013-11-13 14:52:01 +09:00
Yusuke Shinyama
acad011e3f
Code cleanup.
2013-11-11 20:46:30 +09:00
Yusuke Shinyama
cbef967fbf
Renamed: LTAnon -> LTAnno
2013-11-11 19:17:45 +09:00
Yusuke Shinyama
c8b6d4112a
Fixed: crash with negative layout bbox.
2013-11-09 15:10:14 +09:00
Yusuke Shinyama
2b56b2eedf
Merged.
2013-11-07 19:50:41 +09:00
Matthew Duggan
2caa5edc25
PEP8: Whitespace changes to match pep8
2013-11-07 17:35:04 +09:00
Matthew Duggan
c1da8b835c
PEP8: Remove trailing whitespace
2013-11-07 16:14:53 +09:00
Matthew Duggan
024b821056
Make pyflakes happy by defining variable
2013-11-07 16:10:14 +09:00
Matthew Duggan
10a68c83bd
Remove unused imports identified by pyflakes
2013-11-07 16:09:44 +09:00
Yusuke Shinyama
4ef81ae9d8
Improved word spacing.
2013-11-05 18:25:19 +09:00
Yusuke Shinyama
02ad086f6a
fixed: HTMLConverter.
2013-10-25 18:10:40 +09:00
Yusuke Shinyama
87842233b3
Version bump!
2013-10-22 22:19:38 +09:00
Yusuke Shinyama
d3730a29ec
API change: process_pdf -> PDFPage.get_pages
2013-10-22 18:59:16 +09:00
Yusuke Shinyama
e927bd307e
fixed: https://github.com/euske/pdfminer/issues/8
2013-10-22 18:24:39 +09:00
Yusuke Shinyama
2aa757978b
Reverted to Python2.x syntax. Fixed LZW decoding.
2013-10-19 08:19:40 +09:00
Yusuke Shinyama
bfd9e93c12
Merge branch 'master' of https://github.com/JordanReiter/pdfminer into JordanReiter-master
2013-10-19 07:46:45 +09:00
Yusuke Shinyama
8e4c0c88e3
fixed: https://github.com/euske/pdfminer/issues/26
2013-10-17 23:20:08 +09:00
Yusuke Shinyama
0ea08890d4
renamed: python2 -> python.
2013-10-17 23:05:27 +09:00
Yusuke Shinyama
8d42eec94d
in_cmap is on by default.
2013-10-17 21:40:43 +09:00
Yusuke Shinyama
de9f9715e3
Added: Adobe-UCS
2013-10-17 21:35:25 +09:00
Yusuke Shinyama
1455f134c6
Fixed: missing ObjStm due to invalid seek.
2013-10-10 20:10:57 +09:00
Yusuke Shinyama
f85c374cae
Separated PDFPage to pdfpage.py.
2013-10-10 19:54:55 +09:00
Yusuke Shinyama
2df67d85ae
Expand ObjStm in XRefFallback.
2013-10-10 19:40:43 +09:00
Yusuke Shinyama
e4bc4e43b1
Code cleanup.
2013-10-10 19:17:58 +09:00
Yusuke Shinyama
cfd60eafbf
Removed PDFDocument.read_xref().
2013-10-10 18:57:08 +09:00
Yusuke Shinyama
658be970b8
Separated PDFXRefFallback.
2013-10-10 18:44:12 +09:00
Yusuke Shinyama
c926874d20
API Change: the PDFDocument cstr now takes PDFParser. set_parser() is removed.
2013-10-10 18:40:06 +09:00
Yusuke Shinyama
557c2c72e6
Removed ObjIdRange for terseness.
2013-10-10 18:34:43 +09:00
Yusuke Shinyama
2221163b94
Split pdfparser.py and pdfdocument.py.
2013-10-10 18:29:30 +09:00
Yusuke Shinyama
1467fc674c
Added fallback for broken PDFs.
2013-10-09 22:45:54 +09:00
Yusuke Shinyama
eabe72ee63
Prevent crash with empty layout box.
2013-10-09 22:13:22 +09:00
Yusuke Shinyama
87143cb36f
Fallback when /Pages does not exist.
2013-10-09 22:08:16 +09:00