Commit Graph

515 Commits (1bf3c42b59125f4491d863e1c11dca7ebbe96adc)

Author SHA1 Message Date
Vinayak Mehta 2926002017 Replace old Adobe glyphlist link 2016-09-08 16:34:53 +05:30
Philippe Guglielmetti 881ea17553 v 20160614 2016-06-14 19:02:07 +02:00
speedplane 2049462f6f Revert changes unrelated to this branch. 2016-06-13 23:42:21 -04:00
speedplane b0b8818a41 Fix a bug with pdfminer which occurs when two or more filters are applied to a stream, even though no parameters are specified. The code would previously drop all of the streams after the first due to misapplication of the zip function. 2016-06-13 23:35:11 -04:00
Friedrich Lindenberg 1d54ecd31c Make the logger run in a namespace. 2016-05-20 21:12:05 +02:00
Philippe Guglielmetti 21fd2bbd23 v 20160202 with Py 2.6 & Py 3.5 support 2016-02-02 15:38:51 +01:00
Goulu 5a23fad6fd Merge pull request #14 from orangain/close-device
Close device to write footer of xml/html files
2016-01-18 11:22:35 +01:00
Goulu 2103e5875e Merge pull request #13 from orangain/include-cmap
Include compiled cmap resources to simplify installation for CJK languages
2016-01-18 11:22:08 +01:00
Steve Hair 92c71436b9 Improved settings management 2016-01-10 12:17:38 -05:00
orangain f8a051adbd Close device to write footer of xml/html files 2015-12-27 20:57:00 +09:00
orangain f1d5d681b6 Include compiled cmap resources to simplify installation for CJK languages
* Run `make cmap` and `git add pdfminer/cmap`.
* Modify MANIFEST.in not to include cmaprsrc dir in the sdist package.
* Add pdfminer/cmap/README.txt to include license in the sdist package.
* Remove installation guide specific to CJK languages from README.md.
2015-12-27 13:32:29 +09:00
lucanaso 63bb3caec2 Fixed for rendering non breaking spaces (cid:160)
As stated in the PDF specification ISO 32000-1, table in Annex D.2 Latin Character Set and Encodings page 653 to 656 (available here: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf):
"The SPACE character shall also be encoded as 312 in MacRomanEncoding and as 240 in WinAnsiEncoding. This duplicate code shall signify a nonbreaking space; it shall be typographically the same as (U+003A) SPACE."
The duplicate key was missing, therefore PDFMiner was returning the string "(cid:160)". 

This fix adds the duplicate key in latin_enc.py
glyphlist.py does not need to be modified as it already contains a key for non breaking space https://github.com/lucanaso/pdfminer/blob/master/pdfminer/glyphlist.py#L2755.
2015-12-09 16:47:32 +01:00
Chris Hager 8149be1669 bugfixes 2015-12-06 00:17:58 +01:00
Chris Hager 2e1be5721f removed settings.ENFORCE_CHECK_EXTRACTABLE 2015-11-01 22:34:18 +01:00
Chris Hager b686dd0139 pdfminer/settings.py for STRICT and added ENFORCE_CHECK_EXTRACTABLE 2015-11-01 22:28:08 +01:00
Ivan Pozdeev 63c9378b8b make ValueError's descriptive 2015-08-10 03:14:51 +03:00
Alex Zagorodniuk 131cb1ea92 change STRICT to be a settings attribute 2015-06-22 19:08:35 -04:00
Goulu 623bd98452 Update __init__.py
version 20150601
2015-06-01 10:21:51 +02:00
Cathal Garvey 403711ed13 Whoops, forgot to version-gate chardet in the actual code. Thanks Travis! 2015-05-30 19:33:35 +01:00
Cathal Garvey a2ad7a6d03 Fixed some bugs preventing all tests from passing in Py2. 2015-05-30 18:02:29 +01:00
Cathal Garvey 79c97ac221 Docstrings. 2015-05-30 17:16:06 +01:00
Cathal Garvey 3b7edba48c Forgot to add the actual compartmentalised function.. 2015-05-30 17:04:28 +01:00
Cathal Garvey 08cb217983 Progress, progress.. not nearly atomic enough, sorry. 2015-05-30 16:14:24 +01:00
Cathal Garvey 1b47bed306 Many changes to make pdf2txt.py work better in Py3, some in that script, others in module!
Sorry, changes should have been more atomic.

*In pdf2txt.py:*

* Re-wrote main function to use argparse instead of optparse.
* Manually tested in Py2/Py3 to get partial consistency.
* Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway.
* Py2 mode *probably* unchanged, cannot find any bugs yet...
* Kept old main function for posterity, for now.

*In utils:*

* Added a few compatibility functions (some string hax required chardet, new dependency):
    - make_compat_bytes(in_str)-> (py3->bytes | py2->str)
    - make_compat_str(in_str)-> (str)
    - compatible_encode_method(bytesorstring, encoding, erraction)-> (str)

*In pdfdevice:*

* To handle different output filetypes in Py3, injected lots of calls to new utils methods,
  as well as some six.PYX checks and logic. These changes are largely responsible for
  enhanced Py2/Py3 consistency.

*In converter:*

* To handle output filetypes in Py2, injected a few checks and fixes particularly around the
  py2 `str.encode` method and its *assumed* usual use-analogies in Py3.
2015-05-17 21:08:57 +01:00
Yusuke Shinyama 14fd0fd2d6 Fixed: #84 (fontname was in unicode) 2015-04-05 19:02:02 +09:00
speedplane 806ee603ff More fixes to layout. The compute neighbors function for horizontal lines is only intended to find neighbors on differing lines. However, it's entirely possible that horizontal neighbors could appear.
This commit finds horizontal neighbors in a horizonal line and merges them together into a single horizontal line if necessary.  This leads to much better text extraction  if the PDF was created in a funky way.

For example (test case coming), I have seen PDFs which are written almost like vertical columns, but the text is entirely horizontal.
2014-12-12 00:36:59 -05:00
speedplane 45170e7183 There are a number of relatively complex changes here. Comments are in order of where the change appears.
1.
When detecting text in a horizontal line, we already add a space between words if separated by more than word_margin apart.  However now, we only do it if there is not already an existing space. This prevents multiple spaces being placed between words.

2.
Detect a horizontal line if the line is zero width. This improves our detection of horizonal lines when looking for both horizontal and vertical.

3.
Don't detect a vertical line if the previous letter is whitspace. Prevents double spaces being caught as vert lines.

4.
Improve upon an unfortunate O(N^2) algorithm which I have seen taking many minutes to execute.  Unfortunately, while the "fix" reduces algorithmic complexity, it isn't technically correct, so we only do it when we know things will take a long time.
2014-12-12 00:36:59 -05:00
Yusuke Shinyama 0112112458 Fixed: crash on invalid chr number. 2014-12-09 22:55:47 +09:00
enkore d0379a2c44 Fix utils.decode_text 2014-12-04 17:09:52 +01:00
speedplane 36977fbe08 Add debug flags for much of the debug output. 2014-11-11 23:36:58 -05:00
speedplane ecc4d05675 Fix a unicode conversion bug.
See https://github.com/euske/pdfminer/issues/75
2014-11-11 23:34:33 -05:00
cybjit 515687e1bb more xrange to range 2014-09-16 23:17:31 +02:00
cybjit 9b2e29396b apply_png_predictor py3 2014-09-16 22:59:29 +02:00
cybjit ad05121c69 password py3 2014-09-16 22:59:00 +02:00
cybjit 14585987c3 keep password api unicode, latin1 or utf-8 is encoded in handler 2014-09-16 22:58:25 +02:00
cybjit 2260f77b19 fix dict_value usage in strict mode 2014-09-16 22:57:29 +02:00
cybjit 51a361c145 clean up HTMLConverter and XMLConverter encoding 2014-09-16 22:57:00 +02:00
Goulu 8861d7e0ed version 20140915 pushed to PyPi as pdfminer_six 2014-09-15 10:33:04 +02:00
cybjit 39942b6642 avoid string formating when not logging 2014-09-12 00:29:31 +02:00
cybjit 01821c7d1e rename bytes to avoid built-in collision 2014-09-12 00:29:31 +02:00
cybjit 31e6afc7cf faster and simpler bytes implementation 2014-09-12 00:29:30 +02:00
cybjit cba5a42ba8 decipher_all bytes 2014-09-12 00:29:30 +02:00
cybjit 6357e2da80 code2cid uses int, not byte 2014-09-12 00:29:27 +02:00
cybjit 9b0a3ee53e decode cmap font name 2014-09-11 23:30:02 +02:00
cybjit a6f31a713d cmap bytes and decode 2014-09-07 18:41:04 +02:00
cybjit cc733c8217 fixes for ARC4 2014-09-07 18:38:22 +02:00
cybjit f9a67db89b change xrange to range 2014-09-07 18:36:12 +02:00
cybjit 0a2d90c051 pdf2txt: do not double encode stdout 2014-09-07 18:34:11 +02:00
unknown 58b8492783 no logging in travis.ci 2014-09-04 10:19:50 +02:00
unknown 1c93468c7e faster, less verbose tests 2014-09-04 10:02:29 +02:00
unknown 4ab48d1803 Python 3.4 compatibility + tests 2014-09-04 09:36:19 +02:00
unknown 29c07ea770 Python 3.4 support and tests 2014-09-03 15:26:08 +02:00
unknown a6475b61b4 Python 3.4 support added and tested 2014-09-03 13:17:41 +02:00
unknown 846cd18186 Python 3.4 support 2014-09-02 15:49:46 +02:00
unknown faea7291a8 tests pass under Py 2.7 and 3.4 2014-09-01 14:16:49 +02:00
Yusuke Shinyama b0e035c24f Style fix: always have an explicit return. 2014-07-15 21:38:29 +09:00
Yusuke Shinyama f5b5e31921 Fixed: DecodeParms array support. 2014-07-09 19:07:27 +09:00
Yusuke Shinyama 137fc3a1ae Use KWD instead of token.name. 2014-06-30 19:15:21 +09:00
Yusuke Shinyama 1ccfaff411 String-Bytes distinction (first attempt). 2014-06-30 19:05:56 +09:00
Yusuke Shinyama 8791355e1d Cleanup imports. Use relative imports. 2014-06-26 18:12:39 +09:00
Yusuke Shinyama 2e900e5d10 Fixed for consistent test results. (hopefully...) 2014-06-26 17:41:31 +09:00
Yusuke Shinyama fe86b4e64e Changed: StringIO -> io.BytesIO 2014-06-25 19:55:41 +09:00
Yusuke Shinyama 44074b42ea Added: stripcontrol for XMLConverter (-S option) 2014-06-22 00:33:00 +09:00
Yusuke Shinyama 81391c09f4 Fixed: #56 (with a derpy fix) 2014-06-18 19:11:45 +09:00
Yusuke Shinyama bb866ae148 Changed: new except syntax (2.6 or above). 2014-06-16 18:50:07 +09:00
Yusuke Shinyama 28e96ba3d0 Use print as a function. 2014-06-15 12:14:33 +09:00
Yusuke Shinyama 0387a6c260 Removed: tuple-unpacking args. 2014-06-15 12:12:13 +09:00
Yusuke Shinyama a8ec99a848 More autotest tweaks. 2014-06-15 10:52:59 +09:00
Yusuke Shinyama 1384a3fe8d Code cleanup: removed some debug flags. 2014-06-14 15:43:10 +09:00
Yusuke Shinyama d9680fca7e Plane: preserve the object order so that the test result is always consistent. 2014-06-14 14:44:53 +09:00
Yusuke Shinyama aed248610c Fixed: dependency on pygame in a unittest. 2014-06-14 12:05:26 +09:00
Yusuke Shinyama 8e14ebf4e1 Use logging module instead of print. 2014-06-14 12:00:49 +09:00
Yusuke Shinyama 8e8e22c095 Fixed a layout bug introduced at c97ec304. 2014-06-13 23:05:04 +09:00
numion a4997d6f10 Implement revision 4 and 5 encryption handler. 2014-05-19 16:27:43 +02:00
Michael R. Hines ae2547b0f2 Stop throwing exception on LITERALS_DCT_DECODE
I have PDF documents with images stream and two filters, don't throw exceptions on the second one (DCT).
2014-05-14 13:25:30 +08:00
Yusuke Shinyama 6b6fc264ff Code refactoring: CMap and UnicodeMap both inherit CMapBase. 2014-04-16 18:57:16 +09:00
Yusuke Shinyama b09c37902f Fixed: issue #48 (thanks to speedplane) 2014-04-09 17:55:50 +09:00
Yusuke Shinyama 7b354c7ab3 Version 20140328 2014-03-28 22:49:18 +09:00
Yusuke Shinyama 340387bfc6 Cleanup: isinstance 2014-03-28 17:50:59 +09:00
Yusuke Shinyama 7849c8724a Fixed: PDFXRefStream.get_objids returns invalid objids. 2014-03-28 17:29:26 +09:00
Yusuke Shinyama 57adad55d7 Revert the wrong fix. 2014-03-28 17:24:03 +09:00
Yusuke Shinyama b18e8c549d Version 20140327 2014-03-28 00:19:52 +09:00
Yusuke Shinyama ee47a6603a Fixed: issues #45 2014-03-28 00:18:17 +09:00
Yusuke Shinyama ab03037444 Version 20140324 2014-03-24 21:03:46 +09:00
Yusuke Shinyama 4b2beba398 Code cleanup. 2014-03-24 20:59:24 +09:00
Yusuke Shinyama f9079e4c0a Fixed dumppdf.py issues. 2014-03-24 20:55:00 +09:00
Yusuke Shinyama 607be269ab Applied a patch by Axel Kaiser. 2014-03-24 20:45:35 +09:00
Yusuke Shinyama d7c4ff28e9 Applied a patch by Axel Kaiser. 2014-03-24 20:39:30 +09:00
Yusuke Shinyama 636d4caeb3 Fixed the PNG predictor bug. Thanks to Gabor Molnar. 2014-03-24 19:57:05 +09:00
Yusuke Shinyama c97ec3048e Changed / to // for clarity. 2013-11-26 21:35:16 +09:00
Yusuke Shinyama b589da51b7 Fix for malformed PDFs. 2013-11-26 21:27:45 +09:00
Yusuke Shinyama cf1e3c9973 Version bump! 2013-11-13 14:52:01 +09:00
Yusuke Shinyama acad011e3f Code cleanup. 2013-11-11 20:46:30 +09:00
Yusuke Shinyama cbef967fbf Renamed: LTAnon -> LTAnno 2013-11-11 19:17:45 +09:00
Yusuke Shinyama c8b6d4112a Fixed: crash with negative layout bbox. 2013-11-09 15:10:14 +09:00
Yusuke Shinyama 2b56b2eedf Merged. 2013-11-07 19:50:41 +09:00
Matthew Duggan 2caa5edc25 PEP8: Whitespace changes to match pep8 2013-11-07 17:35:04 +09:00
Matthew Duggan c1da8b835c PEP8: Remove trailing whitespace 2013-11-07 16:14:53 +09:00
Matthew Duggan 024b821056 Make pyflakes happy by defining variable 2013-11-07 16:10:14 +09:00
Matthew Duggan 10a68c83bd Remove unused imports identified by pyflakes 2013-11-07 16:09:44 +09:00
Yusuke Shinyama 4ef81ae9d8 Improved word spacing. 2013-11-05 18:25:19 +09:00
Yusuke Shinyama 02ad086f6a fixed: HTMLConverter. 2013-10-25 18:10:40 +09:00
Yusuke Shinyama 87842233b3 Version bump! 2013-10-22 22:19:38 +09:00
Yusuke Shinyama d3730a29ec API change: process_pdf -> PDFPage.get_pages 2013-10-22 18:59:16 +09:00
Yusuke Shinyama e927bd307e fixed: https://github.com/euske/pdfminer/issues/8 2013-10-22 18:24:39 +09:00
Yusuke Shinyama 2aa757978b Reverted to Python2.x syntax. Fixed LZW decoding. 2013-10-19 08:19:40 +09:00
Yusuke Shinyama bfd9e93c12 Merge branch 'master' of https://github.com/JordanReiter/pdfminer into JordanReiter-master 2013-10-19 07:46:45 +09:00
Yusuke Shinyama 8e4c0c88e3 fixed: https://github.com/euske/pdfminer/issues/26 2013-10-17 23:20:08 +09:00
Yusuke Shinyama 0ea08890d4 renamed: python2 -> python. 2013-10-17 23:05:27 +09:00
Yusuke Shinyama 8d42eec94d in_cmap is on by default. 2013-10-17 21:40:43 +09:00
Yusuke Shinyama de9f9715e3 Added: Adobe-UCS 2013-10-17 21:35:25 +09:00
Yusuke Shinyama 1455f134c6 Fixed: missing ObjStm due to invalid seek. 2013-10-10 20:10:57 +09:00
Yusuke Shinyama f85c374cae Separated PDFPage to pdfpage.py. 2013-10-10 19:54:55 +09:00
Yusuke Shinyama 2df67d85ae Expand ObjStm in XRefFallback. 2013-10-10 19:40:43 +09:00
Yusuke Shinyama e4bc4e43b1 Code cleanup. 2013-10-10 19:17:58 +09:00
Yusuke Shinyama cfd60eafbf Removed PDFDocument.read_xref(). 2013-10-10 18:57:08 +09:00
Yusuke Shinyama 658be970b8 Separated PDFXRefFallback. 2013-10-10 18:44:12 +09:00
Yusuke Shinyama c926874d20 API Change: the PDFDocument cstr now takes PDFParser. set_parser() is removed. 2013-10-10 18:40:06 +09:00
Yusuke Shinyama 557c2c72e6 Removed ObjIdRange for terseness. 2013-10-10 18:34:43 +09:00
Yusuke Shinyama 2221163b94 Split pdfparser.py and pdfdocument.py. 2013-10-10 18:29:30 +09:00
Yusuke Shinyama 1467fc674c Added fallback for broken PDFs. 2013-10-09 22:45:54 +09:00
Yusuke Shinyama eabe72ee63 Prevent crash with empty layout box. 2013-10-09 22:13:22 +09:00
Yusuke Shinyama 87143cb36f Fallback when /Pages does not exist. 2013-10-09 22:08:16 +09:00
Yusuke Shinyama 06425bba00 Introducing PDFObjectNotFound 2013-10-09 21:39:23 +09:00
Yusuke Shinyama 3c3cba2ecc Moved: import PIL. 2013-04-09 18:42:32 +09:00
Yusuke Shinyama 19e7d70ac1 Merge pull request #15 from jcushman/patch-1
2x faster layout analysis: Use set instead of list for Plane's internal collection of objects.
2013-04-09 02:39:46 -07:00
Yusuke Shinyama 4faccff9c9 Merge pull request #16 from jcushman/master
2x faster group_textboxes function.
2013-04-09 01:58:56 -07:00
Yusuke Shinyama d8bc13b3af Merge pull request #13 from gendoc/master
PDFDocument.lookup_name.lookup isn't searching for 'Names' key.
2013-04-09 01:55:54 -07:00
Jordan Reiter e28b75a462 StringIO 2013-03-27 13:14:58 -04:00
Jordan Reiter 44653071c3 Fixes for LZW error (see https://bitbucket.org/hsoft/pdfminer3k/commits/ae9a4ca0691a/) 2013-03-27 13:05:29 -04:00
jcushman f77f196cd3 2x faster group_textboxes function. 2012-06-22 18:11:45 -03:00
jcushman da3f023b2d Use set instead of list for Plane's internal collection of objects. 2012-06-22 16:36:33 -03:00
Humberto Pereira 89c81db295 PDFDocument.lookup_names.lookup didn't find 'Names' in some files 2012-03-19 16:42:58 -03:00
Jim Morrison 6413eb7de4 Deal with CMYK images by converting them to RGB. PIL does not invert CMYK images as of PIL 1.1.7, so the invert happens in ImageWriter. 2012-01-24 16:18:36 -08:00
Yusuke Shinyama c7709045e9 fixed: invalid bmp file output 2011-11-08 00:29:24 +10:00
Yusuke Shinyama 82ff98c7b3 imagewriter now works with text output 2011-11-07 01:15:10 +10:00
Yusuke Shinyama 91174b5665 avoid crash when colorspace is null. 2011-11-06 20:10:48 +10:00
Yusuke Shinyama 3d1652963a Merge github.com:euske/pdfminer 2011-10-30 15:44:49 +10:00
dwilson 60dbf6bb69 avoids crash in pdf syntax error for missing ids
when an object id is out of range, rather than crashing, only raise a
pdf syntax error if STRICT is enabled and return None otherwise
2011-08-31 17:03:10 -04:00
Yusuke Shinyama f638784e1d experimental layout analysis improvements 2011-08-14 09:44:21 +09:00
Yusuke Shinyama cbb8d869c7 removed initial cmap/ directory 2011-07-31 18:05:07 +10:00
Yusuke Shinyama cdef0d7883 Merge github.com:euske/pdfminer 2011-07-31 17:47:20 +10:00
Yusuke Shinyama 46bb0107aa fixed: crash due to small layout elements (thanks to hsoft) 2011-07-31 17:44:09 +10:00
Yusuke Shinyama eec317ae10 Merge pull request #6 from rsennrich/master
cleaner widths for Adobe core 14 fonts. (thanks to rsennrich)
2011-07-31 00:39:36 -07:00
Yusuke Shinyama 24cd161fb7 CCITTFaxFilter.reversed fix 2011-07-31 17:36:02 +10:00
Rico 6e4f36d9a1 get width based on utf-8 char.
fills some gaps and fixes inconsistencies between standard encodings
2011-07-23 16:34:11 +02:00
Yusuke Shinyama dc8fde0e47 added CCITTFaxFilter support and a very crude image extraction. 2011-07-18 21:07:00 +10:00
Yusuke Shinyama 2707ba75df added CCITTFaxFilter support and a very crude image extraction. 2011-07-18 21:06:50 +10:00
Yusuke Shinyama fda6f7ba5d ccitt.py added. 2011-07-18 17:36:37 +10:00
Yusuke Shinyama 0278076ea8 PNG predictor added 2011-06-07 00:46:33 +09:00
Yusuke Shinyama 18a5058af6 separated predictor functions. 2011-06-07 00:31:03 +09:00
Yusuke Shinyama 170c97a12b colorspace patch by Lieb Simon 2011-06-06 17:10:12 +09:00
Yusuke Shinyama 2e8180ddee documentation update and version bump 2011-05-15 01:37:14 +09:00
Yusuke Shinyama c134596e2f code cleanup and testcase stabilization 2011-05-15 01:22:19 +09:00
Yusuke Shinyama e5d02f8653 fixed the infinite recursion bug. 2011-05-14 16:32:09 +09:00
Yusuke Shinyama 0c41b8348e code cleanup 2011-05-14 15:51:40 +09:00
Yusuke Shinyama 038ce4cd0c added LTText.get_text() and .text property is no longer accessible. 2011-05-14 15:45:08 +09:00
Yusuke Shinyama 5004e4b28d layout analysis speedup. 2011-05-14 14:17:39 +09:00
Yusuke Shinyama 095534b294 figure object now does not call analyze. 2011-05-14 14:17:22 +09:00
Yusuke Shinyama b8d516fc52 extended Plane class. 2011-05-14 14:16:40 +09:00
Yusuke Shinyama fcf0d74ecc tweaks for debugging 2011-04-21 22:07:52 +09:00
Yusuke Shinyama 8f9684f6a6 code cleanup: layout analysis 2011-04-21 22:07:04 +09:00
Yusuke Shinyama 0e660dd385 rename: LTPolygon -> LTCurve 2011-04-20 22:05:25 +09:00
Yusuke Shinyama dab70855bf LTLine is now strictly horizontal or vertical. 2011-04-20 22:01:54 +09:00
Jonathan J Hunt ec682539da Optimized memory usage in TextConverter by ignoring all drawing commands. 2011-03-07 15:11:31 +10:00
Yusuke Shinyama 4918d59bc2 disable caching support 2011-03-03 00:04:43 +09:00
Yusuke Shinyama 18e782f330 canonicalize package names 2011-03-02 23:43:03 +09:00
Yusuke Shinyama bb26cf9180 eliminate empty textboxes 2011-03-01 20:47:20 +09:00
Yusuke Shinyama dfd621b98c minor bugfix. thanks to Hiroshi Manabe. 2011-02-28 19:50:07 +09:00
Yusuke Shinyama f22b056454 release-20110227 2011-02-27 19:53:12 +09:00
Yusuke Shinyama a8bf9b159e docstring fix 2011-02-27 13:09:12 +09:00
Yusuke Shinyama cabaa10e4f layout analysis improvement 2011-02-27 12:56:28 +09:00
Yusuke Shinyama 7dbb664db3 code cleanup and more debugging options 2011-02-14 23:42:05 +09:00
Yusuke Shinyama f00f1dbd04 better layout analysis 2011-02-14 23:41:23 +09:00
Yusuke Shinyama b2d13db29a code cleanup 2011-02-14 22:51:20 +09:00
Yusuke Shinyama cd412308bd text flow detection bug fix (thanks to fujimoto-san) 2011-02-14 22:32:55 +09:00
Yusuke Shinyama cbd58121e3 fix aggressive vertical writing detection (which ruins layout) 2011-02-02 23:09:34 +09:00
Yusuke Shinyama 109aedeb43 cfffont extension with no luck 2011-01-25 00:19:07 +09:00
Yusuke Shinyama 4eb6083c09 code cleanup 2011-01-03 18:11:22 +09:00
Yusuke Shinyama 16b2a87b24 CMAP_PATH environment variable support 2011-01-03 18:11:16 +09:00
Yusuke Shinyama 420169a692 release 20101226 2010-12-26 19:06:47 +09:00
Yusuke Shinyama a24c452ba2 boxes_flow patch by Daniel Gerber 2010-12-26 17:26:39 +09:00
Yusuke Shinyama 3da3adad9b method renamed: finish(self) -> analyze(self, laparams). 2010-12-26 16:56:21 +09:00
yusuke.shinyama.dummy 84ed94aec0 another bugfix
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@281 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-12-25 08:41:03 +00:00
yusuke.shinyama.dummy 9bba7ac08b oops, forgot to fix this
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@280 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-12-25 08:40:58 +00:00
yusuke.shinyama.dummy f4ced29713 bugfix by Kevin Brubeck Unhammer
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@278 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-12-25 08:40:45 +00:00
yusuke.shinyama.dummy 2bf9c23801 check_extractable paramater added
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@276 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-11-23 10:53:28 +00:00
yusuke.shinyama.dummy 9f78915ea6 show cid for unknown characters
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@275 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-11-23 10:53:19 +00:00
yusuke.shinyama.dummy 7374b81383 htmlconverter improved
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@274 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-11-14 15:04:28 +00:00
yusuke.shinyama.dummy fb4ce96309 add font-family
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@273 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-11-14 10:07:50 +00:00
yusuke.shinyama.dummy 476ecf7e32 add html exect layout mode; default changed.
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@272 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-11-14 10:07:41 +00:00
yusuke.shinyama.dummy 08c5c66917 add debugging features
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@271 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-11-14 10:07:34 +00:00
yusuke.shinyama.dummy 434b24b6e5 remove unused method
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@270 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-11-14 10:07:27 +00:00
yusuke.shinyama.dummy 0d1f00fa9b improved layout analysis for vertical script
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@269 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-11-09 10:40:14 +00:00
yusuke.shinyama.dummy 9584845358 layout analysis improved
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@268 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-11-09 10:40:05 +00:00
yusuke.shinyama.dummy edbd3764a7 html layout output fix
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@267 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-11-09 10:39:48 +00:00
yusuke.shinyama.dummy 1904b61355 documentation
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@266 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-11-09 10:39:40 +00:00
yusuke.shinyama.dummy 1a25c61a9f fix empty hexstring bug and test cases.
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@265 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-10-27 12:29:00 +00:00
yusuke.shinyama.dummy 509ab66319 stay with python2
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@264 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-10-19 09:57:01 +00:00
yusuke.shinyama.dummy 438b4953be documentation bit and code cleanup
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@263 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-10-18 15:04:49 +00:00
yusuke.shinyama.dummy 71863aec67 minor bugfix
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@262 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-10-18 15:04:43 +00:00
yusuke.shinyama.dummy 6a4b70f54a code cleanup
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@261 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-10-18 15:04:38 +00:00
yusuke.shinyama.dummy 98442ed943 update the version number and documentation
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@256 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-10-17 05:15:58 +00:00
yusuke.shinyama.dummy cc139db8a7 bugfix LTChar.is_vertical undefined. verticality is now handled by LTTextBox
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@254 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-10-17 05:15:23 +00:00
yusuke.shinyama.dummy 21f6cf8fb6 removed PDFStream.decomp(). turned out zlib can handle trailing bytes.
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@253 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-10-17 05:15:18 +00:00
yusuke.shinyama.dummy 0ecd0b8f9d attempt to recover encoding info from texfont
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@252 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-10-17 05:15:12 +00:00
yusuke.shinyama.dummy afe33312c6 outline bug fixed
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@249 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-10-17 05:14:52 +00:00
yusuke.shinyama.dummy 0b962443ed patch by Alexander Garden
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@248 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-10-17 05:14:46 +00:00
yusuke.shinyama.dummy 69d9d85685 nunpack TypeError fix
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@246 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-10-17 05:13:52 +00:00
yusuke.shinyama.dummy 3305c07ba2 layout analysis improved
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@245 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-10-17 05:13:39 +00:00
yusuke.shinyama.dummy bc1303e901 layout analysis improvement 1
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@244 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-10-17 05:13:33 +00:00
yusuke.shinyama.dummy 3b2aabaa10 version bump
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@243 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-08-29 07:00:01 +00:00
yusuke.shinyama.dummy 0944cfaded test file simple3.pdf added.
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@240 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-08-29 06:39:41 +00:00
yusuke.shinyama.dummy 83d2086f19 fix minor layout issue
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@239 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-08-29 06:39:31 +00:00
yusuke.shinyama.dummy b871331659 improvement in fallback
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@238 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-08-29 06:39:24 +00:00
yusuke.shinyama.dummy 4554705881 glyphlist bug (due to my misunderstanding of spec.)
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@237 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-08-26 15:02:46 +00:00
yusuke.shinyama.dummy ac74542d1f minor bugfixes
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@234 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-08-26 15:02:29 +00:00
yusuke.shinyama.dummy 1a8692124f version bump
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@233 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-06-19 04:31:12 +00:00
yusuke.shinyama.dummy 2d02833936 release 20100619
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@230 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-06-19 03:58:20 +00:00
yusuke.shinyama.dummy f5aff374fc some wordings and documentations
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@229 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-06-19 03:56:50 +00:00
yusuke.shinyama.dummy a0dd46bd8e cmap compression patch. thanks to Jakub Wilk
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@228 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-06-13 13:50:24 +00:00
yusuke.shinyama.dummy 3f831c8104 bugfixes. thanks to Jakub Wilk
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@226 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-06-13 04:02:30 +00:00
yusuke.shinyama.dummy 702f3088ae unittest failure fix
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@222 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-06-06 05:16:29 +00:00
yusuke.shinyama.dummy cf52476f5e remove redundancy
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@221 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-06-06 05:16:21 +00:00
yusuke.shinyama.dummy fe3bdbfce0 text rise support added
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@217 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-05-18 14:57:04 +00:00
yusuke.shinyama.dummy 8e92ddca30 latin2ascii.py was moved as a utility
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@215 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-05-05 05:51:11 +00:00
yusuke.shinyama.dummy 7f587cafec some usage document added
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@214 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-04-24 13:31:31 +00:00
yusuke.shinyama.dummy eb535d4106 change PDFPageAggregator -> PDFLayoutAnalyzer
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@213 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-04-24 13:31:21 +00:00
yusuke.shinyama.dummy 833f859449 move TagExtractor
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@212 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-04-24 13:31:11 +00:00
yusuke.shinyama.dummy a16eba30b7 release 20100424
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@210 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-04-24 04:32:21 +00:00
yusuke.shinyama.dummy 97848409e5 fix xobject resources bug, thanks to Jose Maria
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@209 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-04-24 04:32:03 +00:00
yusuke.shinyama.dummy 9052cd1ea7 better TOC extraction
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@207 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-04-24 01:34:18 +00:00
yusuke.shinyama.dummy e77a6ba997 -A (all_texts) option added for layout analysis
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@205 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-04-10 11:30:03 +00:00
yusuke.shinyama.dummy 609c6e1f5f rename: LayoutItem -> LTItem, LayoutContainer -> LTContainer
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@203 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-04-10 11:29:30 +00:00
yusuke.shinyama.dummy c81142aa44 image handling addition (untested)
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@202 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-04-10 11:05:02 +00:00
yusuke.shinyama.dummy 71defb2272 documentation bit, ready for release-20100327
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@198 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-03-27 06:06:09 +00:00
yusuke.shinyama.dummy 5f822f6dcb improved layout analysis.
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@197 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-03-26 11:11:35 +00:00
yusuke.shinyama.dummy 2e5b92c18a writing mode detection
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@196 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-03-25 11:38:47 +00:00
yusuke.shinyama.dummy e536b3ef11 more bugfixes.
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@194 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-03-23 10:29:52 +00:00
yusuke.shinyama.dummy ee34d8d549 bugfix (thanks to Brian Berry).
Remaining TODOs: automatic testing for vertical texts. Various layout analysis tuning.


git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@193 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-03-22 08:36:39 +00:00
yusuke.shinyama.dummy 25636d7c08 release-20100322
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@192 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-03-22 06:22:33 +00:00
yusuke.shinyama.dummy 40b36a7c42 consistent test results
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@191 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-03-22 06:04:54 +00:00
yusuke.shinyama.dummy a6523d1a9a patch from pietvo.
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@190 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-03-22 04:46:59 +00:00
yusuke.shinyama.dummy fa13122f09 add regression tests.
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@189 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-03-22 04:34:52 +00:00
yusuke.shinyama.dummy cd39642abe code cleanup
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@188 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-03-22 04:00:18 +00:00
yusuke.shinyama.dummy e01cb43e31 add novel layout analysis
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@187 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-03-21 02:21:37 +00:00
yusuke.shinyama.dummy ffaaea0bac layout analysis changed drastically.
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@186 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-03-20 05:43:34 +00:00
yusuke.shinyama.dummy 85c5476623 A couple of bugfixes. Thanks to Sean Manefield.
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@185 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-03-12 13:47:39 +00:00
yusuke.shinyama.dummy 23be96c49e CAUTION! changed the way of internal layout handling.
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@184 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-02-27 03:59:25 +00:00
yusuke.shinyama.dummy 2555b38836 fix typos (patches by sm)
git-svn-id: https://pdfminerr.googlecode.com/svn/trunk/pdfminer@183 1aa58f4a-7d42-0410-adbc-911cccaed67c
2010-02-15 14:50:19 +00:00