Commit Graph

427 Commits (548b933a84fef9fe2c21c78702fae0cc8a713923)

Author SHA1 Message Date
Friedrich Lindenberg 447adcf02f fix STRICT reference 2016-09-24 12:03:22 +02:00
Friedrich Lindenberg 70918095cc Return an empty list when no `Differences` are found. 2016-09-24 11:57:11 +02:00
Friedrich Lindenberg 865246bd0c fix print, upstream: 0112112458 2016-09-23 15:04:07 +02:00
Friedrich Lindenberg 0cb13983f7 Backport LICENSE. 2016-09-23 14:57:28 +02:00
Friedrich Lindenberg 1820f96481 backport changes for upstream: #145, #95, #111, #117, #129, #132. 2016-09-23 14:31:31 +02:00
Jakub Wilk 5ddbecb551 Fix typos 2016-09-13 16:25:09 +02:00
Yusuke Shinyama 3068dcdb4a Merge pull request #145 from vinayak-mehta/glyphlist_link
Replace old Adobe glyphlist link
2016-09-12 00:18:24 +09:00
Yusuke Shinyama c753dbac4c Merge pull request #117 from native-api/png_pred_errors
make ValueError's descriptive
2016-09-11 23:55:34 +09:00
Yusuke Shinyama f1dd9ea6d2 Merge pull request #129 from lucanaso/lucanaso-patch-1
Fixed for rendering non breaking spaces (cid:160)
2016-09-11 23:53:03 +09:00
Yusuke Shinyama 177a4ab937 Fixed: #132 (PDFStream.get_filters: support multiple parameterless filters) 2016-09-11 23:52:13 +09:00
Yusuke Shinyama e95a483790 Merge pull request #134 from speedplane/feature/Fix-Get-Filters
Fix Bug with PDF Stream Decoder
2016-09-11 23:48:42 +09:00
Yusuke Shinyama 64fe538b24 Fixed: #114 (UnicodeEncodeError in PSLiteral) 2016-09-11 23:43:22 +09:00
Vinayak Mehta 2926002017 Replace old Adobe glyphlist link 2016-09-08 16:34:53 +05:30
Philippe Guglielmetti 881ea17553 v 20160614 2016-06-14 19:02:07 +02:00
speedplane 2049462f6f Revert changes unrelated to this branch. 2016-06-13 23:42:21 -04:00
speedplane b0b8818a41 Fix a bug with pdfminer which occurs when two or more filters are applied to a stream, even though no parameters are specified. The code would previously drop all of the streams after the first due to misapplication of the zip function. 2016-06-13 23:35:11 -04:00
Friedrich Lindenberg 1d54ecd31c Make the logger run in a namespace. 2016-05-20 21:12:05 +02:00
Philippe Guglielmetti 21fd2bbd23 v 20160202 with Py 2.6 & Py 3.5 support 2016-02-02 15:38:51 +01:00
Goulu 5a23fad6fd Merge pull request #14 from orangain/close-device
Close device to write footer of xml/html files
2016-01-18 11:22:35 +01:00
Goulu 2103e5875e Merge pull request #13 from orangain/include-cmap
Include compiled cmap resources to simplify installation for CJK languages
2016-01-18 11:22:08 +01:00
Steve Hair 92c71436b9 Improved settings management 2016-01-10 12:17:38 -05:00
orangain f8a051adbd Close device to write footer of xml/html files 2015-12-27 20:57:00 +09:00
orangain f1d5d681b6 Include compiled cmap resources to simplify installation for CJK languages
* Run `make cmap` and `git add pdfminer/cmap`.
* Modify MANIFEST.in not to include cmaprsrc dir in the sdist package.
* Add pdfminer/cmap/README.txt to include license in the sdist package.
* Remove installation guide specific to CJK languages from README.md.
2015-12-27 13:32:29 +09:00
lucanaso 63bb3caec2 Fixed for rendering non breaking spaces (cid:160)
As stated in the PDF specification ISO 32000-1, table in Annex D.2 Latin Character Set and Encodings page 653 to 656 (available here: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf):
"The SPACE character shall also be encoded as 312 in MacRomanEncoding and as 240 in WinAnsiEncoding. This duplicate code shall signify a nonbreaking space; it shall be typographically the same as (U+003A) SPACE."
The duplicate key was missing, therefore PDFMiner was returning the string "(cid:160)". 

This fix adds the duplicate key in latin_enc.py
glyphlist.py does not need to be modified as it already contains a key for non breaking space https://github.com/lucanaso/pdfminer/blob/master/pdfminer/glyphlist.py#L2755.
2015-12-09 16:47:32 +01:00
Chris Hager 8149be1669 bugfixes 2015-12-06 00:17:58 +01:00
Chris Hager 2e1be5721f removed settings.ENFORCE_CHECK_EXTRACTABLE 2015-11-01 22:34:18 +01:00
Chris Hager b686dd0139 pdfminer/settings.py for STRICT and added ENFORCE_CHECK_EXTRACTABLE 2015-11-01 22:28:08 +01:00
Ivan Pozdeev 63c9378b8b make ValueError's descriptive 2015-08-10 03:14:51 +03:00
Alex Zagorodniuk 131cb1ea92 change STRICT to be a settings attribute 2015-06-22 19:08:35 -04:00
Goulu 623bd98452 Update __init__.py
version 20150601
2015-06-01 10:21:51 +02:00
Cathal Garvey 403711ed13 Whoops, forgot to version-gate chardet in the actual code. Thanks Travis! 2015-05-30 19:33:35 +01:00
Cathal Garvey a2ad7a6d03 Fixed some bugs preventing all tests from passing in Py2. 2015-05-30 18:02:29 +01:00
Cathal Garvey 79c97ac221 Docstrings. 2015-05-30 17:16:06 +01:00
Cathal Garvey 3b7edba48c Forgot to add the actual compartmentalised function.. 2015-05-30 17:04:28 +01:00
Cathal Garvey 08cb217983 Progress, progress.. not nearly atomic enough, sorry. 2015-05-30 16:14:24 +01:00
Cathal Garvey 1b47bed306 Many changes to make pdf2txt.py work better in Py3, some in that script, others in module!
Sorry, changes should have been more atomic.

*In pdf2txt.py:*

* Re-wrote main function to use argparse instead of optparse.
* Manually tested in Py2/Py3 to get partial consistency.
* Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway.
* Py2 mode *probably* unchanged, cannot find any bugs yet...
* Kept old main function for posterity, for now.

*In utils:*

* Added a few compatibility functions (some string hax required chardet, new dependency):
    - make_compat_bytes(in_str)-> (py3->bytes | py2->str)
    - make_compat_str(in_str)-> (str)
    - compatible_encode_method(bytesorstring, encoding, erraction)-> (str)

*In pdfdevice:*

* To handle different output filetypes in Py3, injected lots of calls to new utils methods,
  as well as some six.PYX checks and logic. These changes are largely responsible for
  enhanced Py2/Py3 consistency.

*In converter:*

* To handle output filetypes in Py2, injected a few checks and fixes particularly around the
  py2 `str.encode` method and its *assumed* usual use-analogies in Py3.
2015-05-17 21:08:57 +01:00
Yusuke Shinyama 14fd0fd2d6 Fixed: #84 (fontname was in unicode) 2015-04-05 19:02:02 +09:00
speedplane 806ee603ff More fixes to layout. The compute neighbors function for horizontal lines is only intended to find neighbors on differing lines. However, it's entirely possible that horizontal neighbors could appear.
This commit finds horizontal neighbors in a horizonal line and merges them together into a single horizontal line if necessary.  This leads to much better text extraction  if the PDF was created in a funky way.

For example (test case coming), I have seen PDFs which are written almost like vertical columns, but the text is entirely horizontal.
2014-12-12 00:36:59 -05:00
speedplane 45170e7183 There are a number of relatively complex changes here. Comments are in order of where the change appears.
1.
When detecting text in a horizontal line, we already add a space between words if separated by more than word_margin apart.  However now, we only do it if there is not already an existing space. This prevents multiple spaces being placed between words.

2.
Detect a horizontal line if the line is zero width. This improves our detection of horizonal lines when looking for both horizontal and vertical.

3.
Don't detect a vertical line if the previous letter is whitspace. Prevents double spaces being caught as vert lines.

4.
Improve upon an unfortunate O(N^2) algorithm which I have seen taking many minutes to execute.  Unfortunately, while the "fix" reduces algorithmic complexity, it isn't technically correct, so we only do it when we know things will take a long time.
2014-12-12 00:36:59 -05:00
Yusuke Shinyama 0112112458 Fixed: crash on invalid chr number. 2014-12-09 22:55:47 +09:00
enkore d0379a2c44 Fix utils.decode_text 2014-12-04 17:09:52 +01:00
speedplane 36977fbe08 Add debug flags for much of the debug output. 2014-11-11 23:36:58 -05:00
speedplane ecc4d05675 Fix a unicode conversion bug.
See https://github.com/euske/pdfminer/issues/75
2014-11-11 23:34:33 -05:00
cybjit 515687e1bb more xrange to range 2014-09-16 23:17:31 +02:00
cybjit 9b2e29396b apply_png_predictor py3 2014-09-16 22:59:29 +02:00
cybjit ad05121c69 password py3 2014-09-16 22:59:00 +02:00
cybjit 14585987c3 keep password api unicode, latin1 or utf-8 is encoded in handler 2014-09-16 22:58:25 +02:00
cybjit 2260f77b19 fix dict_value usage in strict mode 2014-09-16 22:57:29 +02:00
cybjit 51a361c145 clean up HTMLConverter and XMLConverter encoding 2014-09-16 22:57:00 +02:00
Goulu 8861d7e0ed version 20140915 pushed to PyPi as pdfminer_six 2014-09-15 10:33:04 +02:00
cybjit 39942b6642 avoid string formating when not logging 2014-09-12 00:29:31 +02:00
cybjit 01821c7d1e rename bytes to avoid built-in collision 2014-09-12 00:29:31 +02:00
cybjit 31e6afc7cf faster and simpler bytes implementation 2014-09-12 00:29:30 +02:00
cybjit cba5a42ba8 decipher_all bytes 2014-09-12 00:29:30 +02:00
cybjit 6357e2da80 code2cid uses int, not byte 2014-09-12 00:29:27 +02:00
cybjit 9b0a3ee53e decode cmap font name 2014-09-11 23:30:02 +02:00
cybjit a6f31a713d cmap bytes and decode 2014-09-07 18:41:04 +02:00
cybjit cc733c8217 fixes for ARC4 2014-09-07 18:38:22 +02:00
cybjit f9a67db89b change xrange to range 2014-09-07 18:36:12 +02:00
cybjit 0a2d90c051 pdf2txt: do not double encode stdout 2014-09-07 18:34:11 +02:00
unknown 58b8492783 no logging in travis.ci 2014-09-04 10:19:50 +02:00
unknown 1c93468c7e faster, less verbose tests 2014-09-04 10:02:29 +02:00
unknown 4ab48d1803 Python 3.4 compatibility + tests 2014-09-04 09:36:19 +02:00
unknown 29c07ea770 Python 3.4 support and tests 2014-09-03 15:26:08 +02:00
unknown a6475b61b4 Python 3.4 support added and tested 2014-09-03 13:17:41 +02:00
unknown 846cd18186 Python 3.4 support 2014-09-02 15:49:46 +02:00
unknown faea7291a8 tests pass under Py 2.7 and 3.4 2014-09-01 14:16:49 +02:00
Yusuke Shinyama b0e035c24f Style fix: always have an explicit return. 2014-07-15 21:38:29 +09:00
Yusuke Shinyama f5b5e31921 Fixed: DecodeParms array support. 2014-07-09 19:07:27 +09:00
Yusuke Shinyama 137fc3a1ae Use KWD instead of token.name. 2014-06-30 19:15:21 +09:00
Yusuke Shinyama 1ccfaff411 String-Bytes distinction (first attempt). 2014-06-30 19:05:56 +09:00
Yusuke Shinyama 8791355e1d Cleanup imports. Use relative imports. 2014-06-26 18:12:39 +09:00
Yusuke Shinyama 2e900e5d10 Fixed for consistent test results. (hopefully...) 2014-06-26 17:41:31 +09:00
Yusuke Shinyama fe86b4e64e Changed: StringIO -> io.BytesIO 2014-06-25 19:55:41 +09:00
Yusuke Shinyama 44074b42ea Added: stripcontrol for XMLConverter (-S option) 2014-06-22 00:33:00 +09:00
Yusuke Shinyama 81391c09f4 Fixed: #56 (with a derpy fix) 2014-06-18 19:11:45 +09:00
Yusuke Shinyama bb866ae148 Changed: new except syntax (2.6 or above). 2014-06-16 18:50:07 +09:00
Yusuke Shinyama 28e96ba3d0 Use print as a function. 2014-06-15 12:14:33 +09:00
Yusuke Shinyama 0387a6c260 Removed: tuple-unpacking args. 2014-06-15 12:12:13 +09:00
Yusuke Shinyama a8ec99a848 More autotest tweaks. 2014-06-15 10:52:59 +09:00
Yusuke Shinyama 1384a3fe8d Code cleanup: removed some debug flags. 2014-06-14 15:43:10 +09:00
Yusuke Shinyama d9680fca7e Plane: preserve the object order so that the test result is always consistent. 2014-06-14 14:44:53 +09:00
Yusuke Shinyama aed248610c Fixed: dependency on pygame in a unittest. 2014-06-14 12:05:26 +09:00
Yusuke Shinyama 8e14ebf4e1 Use logging module instead of print. 2014-06-14 12:00:49 +09:00
Yusuke Shinyama 8e8e22c095 Fixed a layout bug introduced at c97ec304. 2014-06-13 23:05:04 +09:00
numion a4997d6f10 Implement revision 4 and 5 encryption handler. 2014-05-19 16:27:43 +02:00
Michael R. Hines ae2547b0f2 Stop throwing exception on LITERALS_DCT_DECODE
I have PDF documents with images stream and two filters, don't throw exceptions on the second one (DCT).
2014-05-14 13:25:30 +08:00
Yusuke Shinyama 6b6fc264ff Code refactoring: CMap and UnicodeMap both inherit CMapBase. 2014-04-16 18:57:16 +09:00
Yusuke Shinyama b09c37902f Fixed: issue #48 (thanks to speedplane) 2014-04-09 17:55:50 +09:00
Yusuke Shinyama 7b354c7ab3 Version 20140328 2014-03-28 22:49:18 +09:00
Yusuke Shinyama 340387bfc6 Cleanup: isinstance 2014-03-28 17:50:59 +09:00
Yusuke Shinyama 7849c8724a Fixed: PDFXRefStream.get_objids returns invalid objids. 2014-03-28 17:29:26 +09:00
Yusuke Shinyama 57adad55d7 Revert the wrong fix. 2014-03-28 17:24:03 +09:00
Yusuke Shinyama b18e8c549d Version 20140327 2014-03-28 00:19:52 +09:00
Yusuke Shinyama ee47a6603a Fixed: issues #45 2014-03-28 00:18:17 +09:00
Yusuke Shinyama ab03037444 Version 20140324 2014-03-24 21:03:46 +09:00
Yusuke Shinyama 4b2beba398 Code cleanup. 2014-03-24 20:59:24 +09:00
Yusuke Shinyama f9079e4c0a Fixed dumppdf.py issues. 2014-03-24 20:55:00 +09:00
Yusuke Shinyama 607be269ab Applied a patch by Axel Kaiser. 2014-03-24 20:45:35 +09:00
Yusuke Shinyama d7c4ff28e9 Applied a patch by Axel Kaiser. 2014-03-24 20:39:30 +09:00
Yusuke Shinyama 636d4caeb3 Fixed the PNG predictor bug. Thanks to Gabor Molnar. 2014-03-24 19:57:05 +09:00
Yusuke Shinyama c97ec3048e Changed / to // for clarity. 2013-11-26 21:35:16 +09:00
Yusuke Shinyama b589da51b7 Fix for malformed PDFs. 2013-11-26 21:27:45 +09:00
Yusuke Shinyama cf1e3c9973 Version bump! 2013-11-13 14:52:01 +09:00
Yusuke Shinyama acad011e3f Code cleanup. 2013-11-11 20:46:30 +09:00
Yusuke Shinyama cbef967fbf Renamed: LTAnon -> LTAnno 2013-11-11 19:17:45 +09:00
Yusuke Shinyama c8b6d4112a Fixed: crash with negative layout bbox. 2013-11-09 15:10:14 +09:00
Yusuke Shinyama 2b56b2eedf Merged. 2013-11-07 19:50:41 +09:00
Matthew Duggan 2caa5edc25 PEP8: Whitespace changes to match pep8 2013-11-07 17:35:04 +09:00
Matthew Duggan c1da8b835c PEP8: Remove trailing whitespace 2013-11-07 16:14:53 +09:00
Matthew Duggan 024b821056 Make pyflakes happy by defining variable 2013-11-07 16:10:14 +09:00
Matthew Duggan 10a68c83bd Remove unused imports identified by pyflakes 2013-11-07 16:09:44 +09:00
Yusuke Shinyama 4ef81ae9d8 Improved word spacing. 2013-11-05 18:25:19 +09:00
Yusuke Shinyama 02ad086f6a fixed: HTMLConverter. 2013-10-25 18:10:40 +09:00
Yusuke Shinyama 87842233b3 Version bump! 2013-10-22 22:19:38 +09:00
Yusuke Shinyama d3730a29ec API change: process_pdf -> PDFPage.get_pages 2013-10-22 18:59:16 +09:00
Yusuke Shinyama e927bd307e fixed: https://github.com/euske/pdfminer/issues/8 2013-10-22 18:24:39 +09:00
Yusuke Shinyama 2aa757978b Reverted to Python2.x syntax. Fixed LZW decoding. 2013-10-19 08:19:40 +09:00
Yusuke Shinyama bfd9e93c12 Merge branch 'master' of https://github.com/JordanReiter/pdfminer into JordanReiter-master 2013-10-19 07:46:45 +09:00
Yusuke Shinyama 8e4c0c88e3 fixed: https://github.com/euske/pdfminer/issues/26 2013-10-17 23:20:08 +09:00
Yusuke Shinyama 0ea08890d4 renamed: python2 -> python. 2013-10-17 23:05:27 +09:00
Yusuke Shinyama 8d42eec94d in_cmap is on by default. 2013-10-17 21:40:43 +09:00
Yusuke Shinyama de9f9715e3 Added: Adobe-UCS 2013-10-17 21:35:25 +09:00
Yusuke Shinyama 1455f134c6 Fixed: missing ObjStm due to invalid seek. 2013-10-10 20:10:57 +09:00
Yusuke Shinyama f85c374cae Separated PDFPage to pdfpage.py. 2013-10-10 19:54:55 +09:00
Yusuke Shinyama 2df67d85ae Expand ObjStm in XRefFallback. 2013-10-10 19:40:43 +09:00
Yusuke Shinyama e4bc4e43b1 Code cleanup. 2013-10-10 19:17:58 +09:00
Yusuke Shinyama cfd60eafbf Removed PDFDocument.read_xref(). 2013-10-10 18:57:08 +09:00
Yusuke Shinyama 658be970b8 Separated PDFXRefFallback. 2013-10-10 18:44:12 +09:00
Yusuke Shinyama c926874d20 API Change: the PDFDocument cstr now takes PDFParser. set_parser() is removed. 2013-10-10 18:40:06 +09:00
Yusuke Shinyama 557c2c72e6 Removed ObjIdRange for terseness. 2013-10-10 18:34:43 +09:00
Yusuke Shinyama 2221163b94 Split pdfparser.py and pdfdocument.py. 2013-10-10 18:29:30 +09:00
Yusuke Shinyama 1467fc674c Added fallback for broken PDFs. 2013-10-09 22:45:54 +09:00
Yusuke Shinyama eabe72ee63 Prevent crash with empty layout box. 2013-10-09 22:13:22 +09:00
Yusuke Shinyama 87143cb36f Fallback when /Pages does not exist. 2013-10-09 22:08:16 +09:00
Yusuke Shinyama 06425bba00 Introducing PDFObjectNotFound 2013-10-09 21:39:23 +09:00
Yusuke Shinyama 3c3cba2ecc Moved: import PIL. 2013-04-09 18:42:32 +09:00
Yusuke Shinyama 19e7d70ac1 Merge pull request #15 from jcushman/patch-1
2x faster layout analysis: Use set instead of list for Plane's internal collection of objects.
2013-04-09 02:39:46 -07:00
Yusuke Shinyama 4faccff9c9 Merge pull request #16 from jcushman/master
2x faster group_textboxes function.
2013-04-09 01:58:56 -07:00
Yusuke Shinyama d8bc13b3af Merge pull request #13 from gendoc/master
PDFDocument.lookup_name.lookup isn't searching for 'Names' key.
2013-04-09 01:55:54 -07:00
Jordan Reiter e28b75a462 StringIO 2013-03-27 13:14:58 -04:00
Jordan Reiter 44653071c3 Fixes for LZW error (see https://bitbucket.org/hsoft/pdfminer3k/commits/ae9a4ca0691a/) 2013-03-27 13:05:29 -04:00
jcushman f77f196cd3 2x faster group_textboxes function. 2012-06-22 18:11:45 -03:00
jcushman da3f023b2d Use set instead of list for Plane's internal collection of objects. 2012-06-22 16:36:33 -03:00
Humberto Pereira 89c81db295 PDFDocument.lookup_names.lookup didn't find 'Names' in some files 2012-03-19 16:42:58 -03:00
Jim Morrison 6413eb7de4 Deal with CMYK images by converting them to RGB. PIL does not invert CMYK images as of PIL 1.1.7, so the invert happens in ImageWriter. 2012-01-24 16:18:36 -08:00
Yusuke Shinyama c7709045e9 fixed: invalid bmp file output 2011-11-08 00:29:24 +10:00
Yusuke Shinyama 82ff98c7b3 imagewriter now works with text output 2011-11-07 01:15:10 +10:00
Yusuke Shinyama 91174b5665 avoid crash when colorspace is null. 2011-11-06 20:10:48 +10:00
Yusuke Shinyama 3d1652963a Merge github.com:euske/pdfminer 2011-10-30 15:44:49 +10:00