Commit Graph

637 Commits (ec8530f6cf992ebbb8f23a3fcdbee729f4163689)

Author SHA1 Message Date
Cathal Garvey e2d3adc8c1 Adding chardet to Travis 2015-05-30 19:35:05 +01:00
Cathal Garvey 403711ed13 Whoops, forgot to version-gate chardet in the actual code. Thanks Travis! 2015-05-30 19:33:35 +01:00
Cathal Garvey a2ad7a6d03 Fixed some bugs preventing all tests from passing in Py2. 2015-05-30 18:02:29 +01:00
Cathal Garvey 79c97ac221 Docstrings. 2015-05-30 17:16:06 +01:00
Cathal Garvey 268e9fb2bd Removed typechecking, nothing's exploded yet and argparse does lots of heavy lifting already. 2015-05-30 17:05:28 +01:00
Cathal Garvey 3b7edba48c Forgot to add the actual compartmentalised function.. 2015-05-30 17:04:28 +01:00
Cathal Garvey b3553cef10 Cleaning up pdf2txt.py after the partition/move. 2015-05-30 17:03:55 +01:00
Cathal Garvey cbe270a4bf Killed the old main function for pdf2txt.py 2015-05-30 16:37:22 +01:00
Cathal Garvey ead8e778a6 Successfully compartmentalised code, getting closer to moving pdf->text as a module function. 2015-05-30 16:27:58 +01:00
Cathal Garvey 08cb217983 Progress, progress.. not nearly atomic enough, sorry. 2015-05-30 16:14:24 +01:00
Cathal Garvey 1b47bed306 Many changes to make pdf2txt.py work better in Py3, some in that script, others in module!
Sorry, changes should have been more atomic.

*In pdf2txt.py:*

* Re-wrote main function to use argparse instead of optparse.
* Manually tested in Py2/Py3 to get partial consistency.
* Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway.
* Py2 mode *probably* unchanged, cannot find any bugs yet...
* Kept old main function for posterity, for now.

*In utils:*

* Added a few compatibility functions (some string hax required chardet, new dependency):
    - make_compat_bytes(in_str)-> (py3->bytes | py2->str)
    - make_compat_str(in_str)-> (str)
    - compatible_encode_method(bytesorstring, encoding, erraction)-> (str)

*In pdfdevice:*

* To handle different output filetypes in Py3, injected lots of calls to new utils methods,
  as well as some six.PYX checks and logic. These changes are largely responsible for
  enhanced Py2/Py3 consistency.

*In converter:*

* To handle output filetypes in Py2, injected a few checks and fixes particularly around the
  py2 `str.encode` method and its *assumed* usual use-analogies in Py3.
2015-05-17 21:08:57 +01:00
Yusuke Shinyama 14fd0fd2d6 Fixed: #84 (fontname was in unicode) 2015-04-05 19:02:02 +09:00
Ashley Blackmore 1dbe9ff7e7 Update setup.py
Install missing pycrypto lib
2015-02-18 18:35:53 +01:00
speedplane 5609418351 Add gz to gitignore. 2014-12-14 01:29:39 -05:00
speedplane 69afd3dd30 Use a .gitignore file. 2014-12-14 01:23:44 -05:00
speedplane 2199c25493 Add my own .gitignore. 2014-12-12 00:37:54 -05:00
speedplane 806ee603ff More fixes to layout. The compute neighbors function for horizontal lines is only intended to find neighbors on differing lines. However, it's entirely possible that horizontal neighbors could appear.
This commit finds horizontal neighbors in a horizonal line and merges them together into a single horizontal line if necessary.  This leads to much better text extraction  if the PDF was created in a funky way.

For example (test case coming), I have seen PDFs which are written almost like vertical columns, but the text is entirely horizontal.
2014-12-12 00:36:59 -05:00
speedplane 45170e7183 There are a number of relatively complex changes here. Comments are in order of where the change appears.
1.
When detecting text in a horizontal line, we already add a space between words if separated by more than word_margin apart.  However now, we only do it if there is not already an existing space. This prevents multiple spaces being placed between words.

2.
Detect a horizontal line if the line is zero width. This improves our detection of horizonal lines when looking for both horizontal and vertical.

3.
Don't detect a vertical line if the previous letter is whitspace. Prevents double spaces being caught as vert lines.

4.
Improve upon an unfortunate O(N^2) algorithm which I have seen taking many minutes to execute.  Unfortunately, while the "fix" reduces algorithmic complexity, it isn't technically correct, so we only do it when we know things will take a long time.
2014-12-12 00:36:59 -05:00
speedplane c32550dd4a Merge branch 'fix-makefile' 2014-12-11 00:54:14 -05:00
speedplane 5cbdd915c7 Remove the dependancy on python2. Also, allow tests to be run on cygwin by checking for it, and converting unix2dos line endings. 2014-12-11 00:53:33 -05:00
speedplane 830b2403e2 Merge branch 'euske-main/master' 2014-12-11 00:06:46 -05:00
Yusuke Shinyama 0112112458 Fixed: crash on invalid chr number. 2014-12-09 22:55:47 +09:00
Yusuke Shinyama 75206ba18d Removed: .gitignore 2014-12-09 22:49:13 +09:00
Yusuke Shinyama 4b585221e2 Merge pull request #76 from speedplane/master
Fix Unicode Bug + Add GitIgnore + Add Debug Flags
2014-12-09 22:22:33 +09:00
Philippe Guglielmetti 448aa08bc4 Merge pull request #4 from enkore/master
Fix utils.decode_text
2014-12-05 09:58:58 +01:00
enkore d0379a2c44 Fix utils.decode_text 2014-12-04 17:09:52 +01:00
speedplane 36977fbe08 Add debug flags for much of the debug output. 2014-11-11 23:36:58 -05:00
speedplane 1067cb9f9f Use a .gitignore file. 2014-11-11 23:36:26 -05:00
speedplane ecc4d05675 Fix a unicode conversion bug.
See https://github.com/euske/pdfminer/issues/75
2014-11-11 23:34:33 -05:00
Philippe Guglielmetti 0e40264071 Merge pull request #3 from Cybjit/master
Samples and latin1 passwords
2014-09-17 07:22:52 +02:00
cybjit 515687e1bb more xrange to range 2014-09-16 23:17:31 +02:00
cybjit 2639b15ef4 guess argv encoding in py2 using sys.stdin.encoding 2014-09-16 23:17:26 +02:00
cybjit 9b2e29396b apply_png_predictor py3 2014-09-16 22:59:29 +02:00
cybjit ad05121c69 password py3 2014-09-16 22:59:00 +02:00
cybjit 14585987c3 keep password api unicode, latin1 or utf-8 is encoded in handler 2014-09-16 22:58:25 +02:00
cybjit 2260f77b19 fix dict_value usage in strict mode 2014-09-16 22:57:29 +02:00
cybjit 51a361c145 clean up HTMLConverter and XMLConverter encoding 2014-09-16 22:57:00 +02:00
cybjit 2ee7153f6e add python3 in sample Makefile 2014-09-16 22:56:13 +02:00
Goulu f577f76c52 renamed as pdfminer.six in PyPi 2014-09-15 11:10:00 +02:00
Goulu 03de0f4db8 forgot 'six' requirement ... 2014-09-15 10:42:08 +02:00
Goulu 8861d7e0ed version 20140915 pushed to PyPi as pdfminer_six 2014-09-15 10:33:04 +02:00
Philippe Guglielmetti 4f8aa9ff5b Merge pull request #2 from Cybjit/master
CMap fixes and speed improvements
2014-09-12 07:33:06 +02:00
cybjit 714423883c setup logging for pdf2txt and fix dumppdf 2014-09-12 00:29:31 +02:00
cybjit 39942b6642 avoid string formating when not logging 2014-09-12 00:29:31 +02:00
cybjit 01821c7d1e rename bytes to avoid built-in collision 2014-09-12 00:29:31 +02:00
cybjit 31e6afc7cf faster and simpler bytes implementation 2014-09-12 00:29:30 +02:00
cybjit ed13f7c47d conv_cmap py3 compat 2014-09-12 00:29:30 +02:00
cybjit cba5a42ba8 decipher_all bytes 2014-09-12 00:29:30 +02:00
cybjit 6357e2da80 code2cid uses int, not byte 2014-09-12 00:29:27 +02:00
cybjit 9b0a3ee53e decode cmap font name 2014-09-11 23:30:02 +02:00