Commit Graph

794 Commits (410d7ecac304100b6d2c2a08aeb3b80510dbcb96)

Author SHA1 Message Date
Ivan Pozdeev 63c9378b8b make ValueError's descriptive 2015-08-10 03:14:51 +03:00
orangain e143ad7ba8 Ensure to install required libraries on installation 2015-08-06 20:55:57 +09:00
Goulu bc8d631a7c Merge pull request #6 from GreenLightGo/hotfix/strict-setting
change STRICT to be a settings attribute
2015-07-21 10:43:39 +02:00
Alex Zagorodniuk 131cb1ea92 change STRICT to be a settings attribute 2015-06-22 19:08:35 -04:00
Pablo Castellano 9af4fe85e1 README: Changed line about Python 3 support 2015-06-14 17:02:12 +02:00
Goulu 623bd98452 Update __init__.py
version 20150601
2015-06-01 10:21:51 +02:00
Goulu 30e14ddf65 Merge pull request #5 from cathalgarvey/master
Lots of changes to improve compatibility and modularity
2015-06-01 10:18:49 +02:00
Cathal Garvey e2d3adc8c1 Adding chardet to Travis 2015-05-30 19:35:05 +01:00
Cathal Garvey 403711ed13 Whoops, forgot to version-gate chardet in the actual code. Thanks Travis! 2015-05-30 19:33:35 +01:00
Cathal Garvey a2ad7a6d03 Fixed some bugs preventing all tests from passing in Py2. 2015-05-30 18:02:29 +01:00
Cathal Garvey 79c97ac221 Docstrings. 2015-05-30 17:16:06 +01:00
Cathal Garvey 268e9fb2bd Removed typechecking, nothing's exploded yet and argparse does lots of heavy lifting already. 2015-05-30 17:05:28 +01:00
Cathal Garvey 3b7edba48c Forgot to add the actual compartmentalised function.. 2015-05-30 17:04:28 +01:00
Cathal Garvey b3553cef10 Cleaning up pdf2txt.py after the partition/move. 2015-05-30 17:03:55 +01:00
Cathal Garvey cbe270a4bf Killed the old main function for pdf2txt.py 2015-05-30 16:37:22 +01:00
Cathal Garvey ead8e778a6 Successfully compartmentalised code, getting closer to moving pdf->text as a module function. 2015-05-30 16:27:58 +01:00
Cathal Garvey 08cb217983 Progress, progress.. not nearly atomic enough, sorry. 2015-05-30 16:14:24 +01:00
Cathal Garvey 1b47bed306 Many changes to make pdf2txt.py work better in Py3, some in that script, others in module!
Sorry, changes should have been more atomic.

*In pdf2txt.py:*

* Re-wrote main function to use argparse instead of optparse.
* Manually tested in Py2/Py3 to get partial consistency.
* Errors abound including Tags mode, but most modes weren't working at all in Py3 anyway.
* Py2 mode *probably* unchanged, cannot find any bugs yet...
* Kept old main function for posterity, for now.

*In utils:*

* Added a few compatibility functions (some string hax required chardet, new dependency):
    - make_compat_bytes(in_str)-> (py3->bytes | py2->str)
    - make_compat_str(in_str)-> (str)
    - compatible_encode_method(bytesorstring, encoding, erraction)-> (str)

*In pdfdevice:*

* To handle different output filetypes in Py3, injected lots of calls to new utils methods,
  as well as some six.PYX checks and logic. These changes are largely responsible for
  enhanced Py2/Py3 consistency.

*In converter:*

* To handle output filetypes in Py2, injected a few checks and fixes particularly around the
  py2 `str.encode` method and its *assumed* usual use-analogies in Py3.
2015-05-17 21:08:57 +01:00
Yusuke Shinyama 14fd0fd2d6 Fixed: #84 (fontname was in unicode) 2015-04-05 19:02:02 +09:00
Ashley Blackmore 1dbe9ff7e7 Update setup.py
Install missing pycrypto lib
2015-02-18 18:35:53 +01:00
speedplane 5609418351 Add gz to gitignore. 2014-12-14 01:29:39 -05:00
speedplane 69afd3dd30 Use a .gitignore file. 2014-12-14 01:23:44 -05:00
speedplane 2199c25493 Add my own .gitignore. 2014-12-12 00:37:54 -05:00
speedplane 806ee603ff More fixes to layout. The compute neighbors function for horizontal lines is only intended to find neighbors on differing lines. However, it's entirely possible that horizontal neighbors could appear.
This commit finds horizontal neighbors in a horizonal line and merges them together into a single horizontal line if necessary.  This leads to much better text extraction  if the PDF was created in a funky way.

For example (test case coming), I have seen PDFs which are written almost like vertical columns, but the text is entirely horizontal.
2014-12-12 00:36:59 -05:00
speedplane 45170e7183 There are a number of relatively complex changes here. Comments are in order of where the change appears.
1.
When detecting text in a horizontal line, we already add a space between words if separated by more than word_margin apart.  However now, we only do it if there is not already an existing space. This prevents multiple spaces being placed between words.

2.
Detect a horizontal line if the line is zero width. This improves our detection of horizonal lines when looking for both horizontal and vertical.

3.
Don't detect a vertical line if the previous letter is whitspace. Prevents double spaces being caught as vert lines.

4.
Improve upon an unfortunate O(N^2) algorithm which I have seen taking many minutes to execute.  Unfortunately, while the "fix" reduces algorithmic complexity, it isn't technically correct, so we only do it when we know things will take a long time.
2014-12-12 00:36:59 -05:00
speedplane c32550dd4a Merge branch 'fix-makefile' 2014-12-11 00:54:14 -05:00
speedplane 5cbdd915c7 Remove the dependancy on python2. Also, allow tests to be run on cygwin by checking for it, and converting unix2dos line endings. 2014-12-11 00:53:33 -05:00
speedplane 830b2403e2 Merge branch 'euske-main/master' 2014-12-11 00:06:46 -05:00
Yusuke Shinyama 0112112458 Fixed: crash on invalid chr number. 2014-12-09 22:55:47 +09:00
Yusuke Shinyama 75206ba18d Removed: .gitignore 2014-12-09 22:49:13 +09:00
Yusuke Shinyama 4b585221e2 Merge pull request #76 from speedplane/master
Fix Unicode Bug + Add GitIgnore + Add Debug Flags
2014-12-09 22:22:33 +09:00
Philippe Guglielmetti 448aa08bc4 Merge pull request #4 from enkore/master
Fix utils.decode_text
2014-12-05 09:58:58 +01:00
enkore d0379a2c44 Fix utils.decode_text 2014-12-04 17:09:52 +01:00
speedplane 36977fbe08 Add debug flags for much of the debug output. 2014-11-11 23:36:58 -05:00
speedplane 1067cb9f9f Use a .gitignore file. 2014-11-11 23:36:26 -05:00
speedplane ecc4d05675 Fix a unicode conversion bug.
See https://github.com/euske/pdfminer/issues/75
2014-11-11 23:34:33 -05:00
Philippe Guglielmetti 0e40264071 Merge pull request #3 from Cybjit/master
Samples and latin1 passwords
2014-09-17 07:22:52 +02:00
cybjit 515687e1bb more xrange to range 2014-09-16 23:17:31 +02:00
cybjit 2639b15ef4 guess argv encoding in py2 using sys.stdin.encoding 2014-09-16 23:17:26 +02:00
cybjit 9b2e29396b apply_png_predictor py3 2014-09-16 22:59:29 +02:00
cybjit ad05121c69 password py3 2014-09-16 22:59:00 +02:00
cybjit 14585987c3 keep password api unicode, latin1 or utf-8 is encoded in handler 2014-09-16 22:58:25 +02:00
cybjit 2260f77b19 fix dict_value usage in strict mode 2014-09-16 22:57:29 +02:00
cybjit 51a361c145 clean up HTMLConverter and XMLConverter encoding 2014-09-16 22:57:00 +02:00
cybjit 2ee7153f6e add python3 in sample Makefile 2014-09-16 22:56:13 +02:00
Goulu f577f76c52 renamed as pdfminer.six in PyPi 2014-09-15 11:10:00 +02:00
Goulu 03de0f4db8 forgot 'six' requirement ... 2014-09-15 10:42:08 +02:00
Goulu 8861d7e0ed version 20140915 pushed to PyPi as pdfminer_six 2014-09-15 10:33:04 +02:00
Philippe Guglielmetti 4f8aa9ff5b Merge pull request #2 from Cybjit/master
CMap fixes and speed improvements
2014-09-12 07:33:06 +02:00
cybjit 714423883c setup logging for pdf2txt and fix dumppdf 2014-09-12 00:29:31 +02:00