The document-format
PDF ("Portable Document Format")
has been around for 26 years now.
Developed by Adobe, an
open format since 2008,
based on the printer language
PostScript.
This is a page-oriented "move-to / line-to" language.
Apache PDFBox
is an open-source Java project that makes it possible to read and write PDF-documents.
In this Blog I want to sneak into the way how to write a PDF document programmatically from Java.
What You Need to Know About PDFBox
- Its 0/0 coordinate lies on bottom-left of the page. The coordinate system is bound to just one page.
- It doesn't do automatic page breaks.
- If you want to draw some text, you need to move to the font's baseline, not to top left corner.
- It doesn't support components (or document-elements) like paragraph, table, list etc.
- There is no predefined layout.
- There is no underscore text-attribute, you need to draw a line to get the underscore effect.
There is no bold text-attribute, you need to change the font.
- The
PDPageContentStream
writer is a statful object that requires calls in a certain order, else it will throw exceptions.
All of these things you need to implement when you want to use
PDFBox.
There are some github-projects
that build on PDFBox
and provide more, but they cover just pieces of the missing things.
Mind that you can also use the
"Open HTML to PDF"
library to generate PDF from HTML (its two required JAR files have about 1.5 MB).
If you can facilitate this for your project, it may save you from the complexity of PDFBox.
Hello PDF World
Following source will create a PDF file like this:
Here is the Java code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53 | import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.common.PDRectangle;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.font.PDType1Font;
public class HelloWorld
{
public static void main(String[] args) throws IOException {
final PDDocument document = new PDDocument();
final PDFont font = PDType1Font.HELVETICA;
final float fontSize = 10;
final float ascent = fontSize * font.getFontDescriptor().getAscent() / 1000;
final float descent = fontSize * -font.getFontDescriptor().getDescent() / 1000;
final float widthOfM = fontSize * font.getStringWidth("M") / 1000;
System.out.println("Font size: "+fontSize+", ascent: "+ascent+", descent: "+descent+", width of 'M': "+widthOfM);
final PDRectangle pageSize = PDRectangle.A4;
final PDPage page = new PDPage(pageSize);
document.addPage(page);
final PDRectangle pageBox = page.getMediaBox();
final float width = pageBox.getWidth();
final float height = pageBox.getHeight();
System.out.println("PDF page w="+Math.round(width)+"/h="+Math.round(height));
final float lineHeight = (ascent + descent) * 1.5f; // standard line height is 150% of font height
float startX = pageBox.getLowerLeftX();
float startY = pageBox.getUpperRightY();
final PDPageContentStream contentStream = new PDPageContentStream(document, page);
contentStream.beginText();
contentStream.setFont(font, fontSize);
contentStream.newLineAtOffset(startX, startY - lineHeight + descent); // set position to baseline of text
contentStream.setLeading(lineHeight);
contentStream.showText("First Line");
contentStream.newLine();
contentStream.showText("Second Line");
contentStream.endText();
contentStream.close();
final String fileName = "HelloWorld.pdf";
document.save(fileName);
document.close();
System.out.println("Wrote file: "+fileName);
}
}
|
Mind that the document must be saved before it gets closed!
Output of this program is:
Font size: 10.0, ascent: 7.18, descent: 2.07, width of 'M': 8.33
PDF page w=595/h=842
Wrote file: HelloWorld.pdf
Some say this is the space between text-lines, i.e. what is below the lower end of a 'g'
but still above the upper end of an 'G' that is directly below it.
This is the older meaning of the word.
Some say this is the line height, i.e. what is below the lower end of an 'a'
and the lower end of an 'a' that is directly below it, thus it includes line spacing.
This is the modern meaning that software goes with.
Don't use this term, it is mis-leading :-)
Recipes
This is how you can get a font's geometry.
Mind that with font.getStringWidth(string)
you can measure all kinds of strings except those that contain newlines.
final PDFont font = PDType1Font.HELVETICA;
final float fontSize = 10;
final float ascent = fontSize * font.getFontDescriptor().getAscent() / 1000f;
final float descent = fontSize * -font.getFontDescriptor().getDescent() / 1000f;
// standard line height is 150% of font height
final float lineHeight = (ascent + descent) * 1.5f; // also called "leading" sometimes
The descent
is the part that hangs down from the text baseline for 'g', 'y', 'p', 'j', 'q'.
Don't forget to subtract it after a text block, else these letters may be partially covered when being on last line.
Following is the way to get a PDPageContentStream
to write text or draw lines:
final PDDocument document = new PDDocument();
final PDRectangle pageSize = PDRectangle.A4;
final PDPage page = new PDPage(pageSize);
document.addPage(page);
final PDPageContentStream contentStream = new PDPageContentStream(document, page);
Here is how to find out the page geometry:
final PDRectangle pageBox = page.getMediaBox();
final float width = pageBox.getWidth();
final float height = pageBox.getHeight();
final float startX = pageBox.getLowerLeftX();
final float startY = pageBox.getUpperRightY();
When you advance from startY
at top of page towards bottom,
you need to subtract the line height, not add it: nextY = currentY - lineHeigth
!
Resume
Lots of work to do with PDFBox.
Hard parts are pagination (text across multiple pages) and tables, and the combination of the two.