pdfminer.six/samples
Jeremy Singer-Vine 016239c146
Fix .paint_path handling of single line segments (#530)
* Fix .paint_path handling of single line segments

- Fixes typo ("ml" should have been "mlh")

- Removes if-statement that required individual line segments to be
  strictly horizontal or vertical.

* Treat 'ml'-shape paths as lines not curves

Althoguh 'mlh' is the canonical implementation for a single line
segment, 'ml' is fairly common.

Adds tests and sample PDF.

* Fix trailing whitespace

* Fix point-extraction from Beziér path commands

This commit corrects the manner in which "pts" are extracted from Beziér
path commands. See Table 4.9 of PDF reference manual, and new comments
in code for details. Previously, depending on whether the command (c,
v, or y) the code was extracting some combination of control points (not
on curve) and the actual points-on-curve.

This commit also refactors .paint_path, so that apply_matrix_pt is only
called in one place, and to treat the "h" command in a manner more
consistent with other path commands.

* Add comments to test_paint_path_quadrilaterals

* Parse rect-forming mllll paths as rects not curves

Now that .paint_path has been refactored, adding support for
rect-forming mllll paths requires no extra code, beyond a minor tweak to
the relevant elif statement.

* One changelog line with ref to mr

* Remove PDFLayoutAnalyzer._create_curve because implementation has become trivial due to refactoring

* Extract variables from if statement to make it easier to read

* Optimize imports order

* Trigger travis build

* Revert "Trigger travis build"

This reverts commit 41c05184

* Update travis badge

* Update travis badge

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
2021-07-27 18:27:32 +02:00
..
acroform Add section to documentation with howto for AcroForm fields extraction (#458) 2020-09-10 19:18:41 +02:00
contrib Fix .paint_path handling of single line segments (#530) 2021-07-27 18:27:32 +02:00
encryption Change pycryptodome dependency to the faster, smaller, and industry standard cryptography package (#456) 2020-07-20 22:00:54 +02:00
nonfree Fix failing test on develop & cleaning up test files (#319) 2019-10-26 18:42:33 +02:00
scancode Add a test for the previous fix 2017-10-16 12:35:16 +02:00
README Added: tests for extracting tests from pdfs with Type3 fonts (#205) 2019-10-22 18:15:59 +02:00
font-size-test.pdf Fix bug in computing character bounding box (#348) 2020-01-16 22:15:50 +01:00
jo.pdf add samples, fixed silly bugs. 2007-12-31 05:02:15 +00:00
sampleOneByteIdentityEncode.pdf Adds Test Case 2019-08-10 10:19:20 +05:30
simple1.pdf testcase added 2009-10-24 02:50:07 +00:00
simple2.pdf various cleanup for release. 2008-04-27 11:47:38 +00:00
simple3.pdf test file simple3.pdf added. 2010-08-29 06:39:41 +00:00
simple4.pdf Fix ordering of textlines within a textbox when boxes_flow is disabled (#412) 2020-05-09 15:37:49 +02:00

README

This directory contains sample PDF files.

These files (including ones in nonfree/ subdirectory) can be
distributed freely but does not come with explicit licensing 
terms or source files.

Here are the credits of the original files:

simple1.pdf:
  (Originally taken from PDF Specification 1.7, 
  Appendix G. "Simple Text String Example" and modified)

simple2.pdf:
  (Originally taken from PDF Specification 1.7, 
  Appendix G. "Simple Graphics Example" and modified)

jo.pdf:
  Kenji Miyazawa (1896-1933, copyright expired)
  Preface of "Haru to Shura"
  (File generated from jo.tex by LaTeX and dvi2pdfm)

--
contrib/matplotlib.pdf
  Copyright 2018, James R Barlow
  Example file created in matplotlib to add a Type3 font to the samples
  Released under the terms of the "LICENSE" file

--
nonfree/cmp_itext_logo.pdf
  Bruno Lowagie
  "iText Logo - Type 3 font"
  http://gitlab.itextsupport.com/itext/sandbox/raw/master/cmpfiles/fonts/cmp_itext_logo.pdf

nonfree/dmca.pdf: 
  U.S. Copyright Office
  The Digital Millenium Copyright Act
  http://www.copyright.gov/legislation/dmca.pdf

nonfree/f1040nr.pdf:
  U.S. Department of the Treasury Internal Revenue Service
  Form 1040-NR, U.S. Nonresident Alien Income Tax Return
  http://www.irs.gov/pub/irs-pdf/f1040nr.pdf

nonfree/i1040nr.pdf:
  U.S. Department of the Treasury Internal Revenue Service
  Instructions for Form 1040-NR, U.S. Nonresident Alien Income Tax Return
  http://www.irs.gov/pub/irs-pdf/i1040nr.pdf

nonfree/kampo.pdf:
  National Priting Bureau of Japan
  Official Gazette, Vol. 4817
  http://kanpou.npb.go.jp/

nonfree/nlp2004slides.pdf:
  Yusuke Shinyama and Satoshi Sekine
  "Named Entity Discovery from Comparable News Corpora"

nonfree/naacl06-shinyama.pdf:
  Yusuke Shinyama and Satoshi Sekine
  "Preemptive Information Extraction using Unrestircted Relation Discovery"

--
Files in the encryption folder have been generated with cpdf 1.7 [http://www.coherentpdf.com/]
from the base.pdf file generated with LibreOffice 4.1.1.2 as follows:

cpdf -encrypt 40bit foo baz base.pdf -o rc4-40.pdf
cpdf -encrypt 128bit foo baz base.pdf -o rc4-128.pdf
cpdf -encrypt AES foo baz base.pdf -o aes-128.pdf
cpdf -encrypt AES foo baz base.pdf -no-encrypt-metadata -o aes-128-m.pdf
cpdf -encrypt AES256 foo baz base.pdf -o aes-256.pdf
cpdf -encrypt AES256 foo baz base.pdf -no-encrypt-metadata -o aes-256-m.pdf