Vous êtes sur la page 1sur 12

Understanding the Portable Document Format (PDF)

Preface: I wish to acknowledge that this article was written with full reference to http://www.adobe.com/devnet/acrobat/pdfs/PDF32000

Surprisingly, it is easy and interesting to read! I am writing this tutorial out of my interest in knowing the PDF specification. My quest star

(steve@printmyfolders.com) if you find any errors. I have relied on the PDF specification (link on page top) to create this tutorial. This tut

Imagine you are alone on a island with no internet, no means of communication apart from a phone where you can only make voice call

will arrive in a week's time to take you home. All of a sudden, you remember that an important document, currently with you has to be s

your printer?

Here is one way of doing it. You would grab a ruler, measure the width and height of your page and note it down. You will measure the e

give her all these details. Provided you have given her enough details she should be able to replicate your document exactly as it is.

PDF works in a similar way. We give instructions so that, the PDF viewing app such as, Adobe Reader can understand our instructions an

a software. So we need structure, and follow specific rules.

Keep this imagery in mind and it will help you understand better.

PDF files are interesting. If you were to open a PDF file in a text editor like Notepad, the contents may look like junk and probably not ve

"At the core of PDF is an advanced imaging model derived from the PostScript® page description language. This PDF Imaging Model en

interactive viewing, PDF defines a more structured format than that used by most PostScript language programs. Unlike Postscript, whic

PDF also includes objects, such as annotations and hypertext links, that are not part of the page content itself but are useful for interactiv

The basic building blocks of a PDF files are objects. There are eight types of objects that are used in PDF files. Before we look at them we

White-space characters: Null, Horizontal tab, Line feed, Form feed, Carriage return and Space. Tip: If you need to remember them, imag

and other objects from each other. Interestingly, PDF treats all white-space characters outside a comment, string or stream the same. Ou

is that you may have 5 spaces but in reality it is considered as one. Note that this does not apply to white-space characters within strings

role in showing where a new line starts. Carriage return followed immediately by a Line feed is considered as one EOL marker.

Delimiter characters: (, ), <, >, [, ], {, }, / and % (4 pairs and 2 unique). These are used in the objects we would look at later. They basically

and then stretched [ ] and then bent again { } and then eventually made flat / %.

Regular characters: All characters other than White-space and Delimiter characters including those that are not part of the standard AS
An interesting fact to note is that a PDF may consist entirely of just ASCII characters or can consist of ASCII characters and Binary data. In

This allows a possibility of 128 unique characters for ASCII files and 256 unique characters for Binary files. Most PDF files that are encrypt

or even opened and saved in normal text editors like Notepad. It is critical that we understand the difference between ASCII and Binary f

files. http://www.cs.umd.edu/class/spring2003/cmsc311/Notes/BitOp/asciiBin.html

You may also wonder why you don't see any text (that can be seen when opened using a reader) or its equivalent when opening a PDF f

text is stored/kept) is encoded (transformed/changed) to conserve space. This is what happens with most files.The second reason could

Objects:

Here are the objects that make use of the characters we looked at above.

Each of these objects have a corresponding class in PDFBox. Have a look at this link https://pdfbox.apache.org/1.8/architecture.htm

1. Boolean objects: The keywords true and false.

2. Numeric objects: There are two types; integer & real. Integers are numbers without any decimal points and can have a + or – symbol

be expressed in exponential format.

3. String Objects: Strings contain characters (can be zero characters as well). They can be literal characters within parenthesis or hexadec

(I love Java (and PDF))

There are escape characters that can be used. Refer to the PDF specification for more details. The sequence \ddd where ddd is an octal c

especially the ones that cause characters to move, for instance \n (newline), did not have any visual effects when I added them to a string

We can also use octal characters (usually to represent character outside the printable ASCII character set) when using parenthesis. An oc

(I love Java (and \052PDF))

Here is an example of a String represented with hexadecimal numbers.

<48454C4C4F>

Each pair is taken as a value. In the above example, the hexadecimal value 48 is decimal 72 which is the ASCII equivalent of H. Likewise 4

make a pair) a zero is attached to the end.

<4845C>

will be considered as
<4845C0>

4. Name Objects: Names consist of a sequence of characters (except null). A forward slash / must be used to introduce the name. In case

followed by the hexadecimal value. All characters that are not regular characters have to be represented by the # followed by their hexad

/MyName represents MyName

/My#20Name represents My Name

/All#20#23Numbers represents All #Numbers

5. Array Objects: These are similar to the arrays found in computer languages but differ in that they can contain different object types (in

spaces.

[false 170 85.5 (Hello) /My#20Name]

Arrays are single dimensional but can include other arrays which can hold arrays themselves.

6. Dictionary Objects:

These are similar to an actual dictionary, where a description follows a word. The description here can be any object (including another d

dictionary is represented within << >>.

<< /Name (Steve)

/Age 99

/Address << /Number 1234

/Street (Java Street)

/Suburb (Java Town)

/PostCode (4321)

>>

>>
7. Stream Objects: Streams are similar to strings except that can be of unlimited length. They are usually used to represent large data th

represented as a 'Contents' stream. It consists of a dictionary followed by the keyword ‘stream’, newline, the stream’s data and the k

dictionary

stream

……….

endstream

While the stream consists of the data, the dictionary contains information about the stream itself. Here are the keys that are common for

Length - Mandatory entry that contains the length (number of bytes) of the stream. An error occurs if the stream has more byte

Filter - The name(s) of the filter(s) used to decode the data on the stream. Can be a name or an array. The filters are used in the o

DecodeParms - Parameter dictionary used by the filters. Can be a dictionary or array. Parameters are values used by the filters. I

F - From PDF specification version 1.2, the stream contents can be stored in an external file. This shows the file where it is stored

FFilter - Similar to Filter entry but for the stream's external file.

FDecodeParms - Similar to DecodeParms but for stream's external file's filter

DL - An approximate size of the contents (after decoding) in the stream. This will help determine if there is enough disk space fo

8. Null object - Refers to a non existent object and denoted by the keyword null.

Other components of a PDF file.

Comments: The comment is represented by the percentage sign i.e. %. This is commonly used to describe the version of PDF specificatio

%PDF-1.7

Indirect Objects:

An object (for example, a string) that has been given a unique object identifier, that other objects can use to refer to itself is called an ind
opening it in a text editor) you will notice lots of indirect objects.

The identifier has two parts. The first part is the object number that can be any positive integer. The second part is the generation numbe

and ’endobj’. Be aware that the combination of the object number and generation number has to be unique.

1 0 obj

(my biography)

endobj

The object number is 1 and the generation number is 0. The indirect object is the string object (my biography). If another object wanted

10R

It is not an error if one was to refer to a non existing indirect object.

File Structure: A PDF file will initially have the following structure. However, if the file is updated or edited, additional elements may be a

1. A Header (not more than a line) showing the PDF specification this file follows.

2. A File Body that has the objects of the file

3. A Cross-reference table that has info about indirect objects

4. A Trailer that shows the location of the cross-reference table and special objects within the body of the file.

File Header: This denotes the PDF specification version of the PDF file. %PDF- followed by version number 1.N, where N is a digit betwee

%PDF-1.2

As mentioned earlier this is actually a comment that is used to specify the PDF version.

Beginning with version 1.4 the document's catalog dictionary is used instead of this. If the file has binary data, there will be at least four b

Again, when opening some PDF files in their raw form (as in a text editor) you may notice the four binary characters just after the first co
File Body: The File body consists of indirect objects (discussed earlier). These objects represent text and other details (like font type etc.

Cross-Reference Table: This table is similar to a directory. It contains the location of each object within the PDF file. By looking at the en

object is accessed in a random manner (rather than reading every line of the file). The cross-reference table can have one or more sectio

Note: From PDF version 1.5 onward the cross-reference table can be stored as a stream and if so you will not be able to view the table as

Each section begins with the word 'xref'. Following this line are two numbers separated by a single space. The first number is the object n

file that has been created for the first time or a PDF file that has not been incrementally updated, there shall be only one subsection and

entries for objects 1, 2, 3 & 4 will follow.

xref

05

Following this are the entries for each object. Each entry shall be exactly 20 bytes long. The entries are of the format

nnnnnnnnnn ggggg meol

nnnnnnnnnn - This is a ten digit value. This reveals how far the object is from the start of the file. For instance, the value

ggggg - 5-digit generation number

m - can be either 'n' or 'f'. 'n' denotes that the object is still in use and 'f' denotes that the object has been deleted and is

eol - end of line. Consists of 2 chars.

The ten digits, followed by space, followed by five digits, followed by space, followed by a single character and the eol make exactly 20 d

Let's come back to the 0 5 that we saw in the example earlier.The 0 denotes the object number of the first object in this subsection. The v

3 and 4. The first entry at the cross-reference table is for object 0. Object 0 will have 0000000000 as its first ten digits (if there are no othe

xref

02

0000000000 65535 f

...........

If there are object(s) that have been deleted and are free then the ten digit number will be changed to denote the nth entry of the next fr

xref
04

0000000003 65535 f

0000000015 00000 n

0000000075 00000 n

0000000000 00005 f

The 0 4 denotes that there are four entries - Entry for object 0 followed by entries for objects 1, 2 & 3. The first ten digits (0000000003) o

have the object number of the next free object. In this case, as there are no other free objects it points back to object 0. Objects 1 & 2 are

therefore it can be used to refer to another data.

Let's look at another example

xref

04

0000000003 65535 f

0000000015 00000 n

0000000075 00000 n

0000000000 00005 f

92

0000000099 00000 n

0000000150 00000 n

In the above case, in addition to objects 0, 1, 2, 3 there are two other objects 9 & 10.

An object cannot be entered in more than one subsection within a section.

When an indirect object is deleted, its entry is marked as free (by changing the n to f) and linked to the linked list of free objects. It's gen

maximum of 65,535. For instance, an indirect object that was referenced as 1 0 will become 1 1 when reused.

Trailer: The end of a PDF file is read first by the PDF reading application. The trailer holds information about the location and details of th

certain fields.

The second part has the keyword startxref, and in the next line, a number. The number denotes how far (in bytes) the keyword xref (of th

A random PDF taken from my computer has this trailer. Looking at this trailer I can assume that the xref of the last section of the cross-re

trailer

<< /Size 62 /Root 1 0 R /Info 2 0 R

/ID [(X$X@>66...)(X$X@>66...)]
>>

startxref

361441

%%EOF

The following keys are mandatory for the trailer dictionary.

Size - Total number of entries in the cross reference table (combination of original & update sections). The value has to be an integer. In

Root - Is an indirect reference to the PDF's catalog (which we will learn later). In the example above, I can assume that the indirect object

Some keys are mandatory when certain capabilities are used. We may look at the Info & ID keys later.

Incremental Updates: The team that developed the PDF specification was smart enough to include a special feature. When a PDF gets u

need not be modified). However, this also makes me wonder what happens when many changes take place. Does the size of the PDF file

When a PDF gets an incremental update, in addition to the data being added, a new cross-reference section is created. This new section

of the cross-reference entry. This means that if say object 5, existed before and was deleted during the update the new cross section will

When the PDF file gets updated, along with a new cross-reference section a new trailer is added. This contains all the entries from the pr

cross-reference section.

%%EOF will continue to be the last line for the new trailer as well. Hopefully we will discuss this in detail later.

PDF Document Structure:

The structure of a PDF file is like the different levels of hierarchy found in a typical company. Similar to the CEO, the Document Catalog d

details in this webpage http://www.printmyfolders.com/view-structure-of-a-pdf

As we saw earlier a PDF reading application will look at the trailer of the PDF first. The trailer will have a Root entry that has the location o

details via the contact section (Trailer) on the company's website (PDF file).

Document Catalog: The Document Catalog is a dictionary that refers to other objects that define the PDF file. Basically, the Document C

will for the time being only look at the mandatory keys.

Type - will always be Catalog (type Name)

Pages - An indirect reference to the object that is the root of the page tree (will look at this later)
A PDF file that I created using a free PDF creating software has this Catalog Dictionary

1 0 obj

<</Type /Catalog /Pages 3 0 R

>>

endobj

You will notice that each of these dictionaries always start with a '/Type' entry that descirbes what type of dictionary it is. In this case, it is

An application that reads the above Catolog dictionary will know that it needs to read the 'Pages' dictionary (indirect object 3) to get info

Page Tree: Page Tree is the name of the structure used to describe the pages in a PDF file. It has two type of nodes - page tree nodes a

Page Tree Nodes: The mandatory keys are

Type - will always be Pages for a Page Tree node

Parent - the page tree node which is this node's parent. Not allowed in root node.

Kids - an array referring to the children of this node. The children can only be page tree nodes or page objects

Count - the number of page objects that are descendants of this node

The PDF that I had created earlier has this page tree (remember that the Catalog Dictionary was pointing to indirect object 3).

3 0 obj

<< /Type /Pages /Kids [

40R

] /Count 1

/Rotate 0>>

endobj

This Page tree node has only one kid which is object 4. The Parent key is missing and therefore this is the root node.

As the /Count is 1, we can safely assume that there is only 1 page under this Page tree (which based on the /Kids array is indirect object 4

As menioned earlier, you will notice that this dictionary too has an entry '/Type' that reveals what type of dictionary it is.
Page Objects: This is a dictionary that reveals the page itself characteristics. Some of the keys are

Note: Most of the keys are new to me. I have purposefully left out keys that make no sense to me at this moment. As I learn more about

Type - Will always be Page

Parent - An indirect reference to the parent of this page

LastModified - Date and time when this page was last modified

Resources - The resources required by this page. This usually refers to the font used on this page and other info.

MediaBox - A rectangle that defines the boundary inside which the page has to be displayed.

Contents - A content stream that describes the contents of this page.

Rotate - In multiples of 90. Rotates the page by the number of degrees before displaying.

Thumb - A stream object that gives the thumbnail image for this page.

Dur - the number of seconds the page will be displayed in presentations before automatically moving on to the next page.

Trans - A dictionary advising what transition to use when displaying the page during presentation.

Annots - This is an array of dictionaries containing references to all the annotations for this page

AA - This is the short form for additional-actions. This dictionary defines the actions that need to be taken when the file is open or closed

Metadata - A stream that contains metadata for this page

Here is a grab from a sample PDF that I created using a free PDF creating software.

4 0 obj

<</Type/Page/MediaBox [0 0 595 842]

/Rotate 0/Parent 3 0 R

/Resources<</ProcSet[/PDF /Text]

/ExtGState 10 0 R

/Font 11 0 R

>>

/Contents 5 0 R

>>

endobj

3 0 obj

<< /Type /Pages /Kids [

40R

] /Count 1

/Rotate 0>>
endobj

1 0 obj

<</Type /Catalog /Pages 3 0 R

>>

endobj

As you can see Object 1 is the catalog that directs the PDF reading application to the root of the page tree (Object 3). Object 3, the root

rotated (Rotate 0) and has Object 3 as its parent. It's 'resources' as well as its contents (Object 5) are included. Here is Object 5 from my f

As we had discussed earlier, the stream in this object starts with a dictionary that shows the length of the stream (which is stored in Obje

5 0 obj

<</Length 6 0 R/Filter /FlateDecode>>

stream

xMK.1.鶵 u*j.czi2& 7KnSK..Z?]."6.3w>^&s@MQ...K.d\>}q... ).|.ѣ'o1lA.‫ۥ‬

S-lE,.C.W.&#xf01d2;YGKM\vjEG.'|F[j.:..2f2Ź^.uujNWnjY::si/.L9,ČGPY1k/.%'f!endstream

endobj

Page attributes are inherited: Here is an interesting fact. Certain attributes in a page can be inherited from its parent or any of its ances

value for an attribute, that value can be replaced or changed by the child.

Name Dictionary: Rather than referring to the objects by their references, some objects can be referred to by their names. The link betw

used to specify the Name Dictionary. Please refer to the PDF specification for more details.

Content Streams: This is a stream (an object in PDF, if you remember) that has instructions on how to display text & graphics on the cor

(for instance Adobe Reader) the instructions on how to display.

5 0 obj

<</Length 6 0 R/Filter /FlateDecode>>

stream

xMK.1.鶵 u*j.czi2& 7KnSK..Z?]."6.3w>^&s@MQ...K.d\>}q... ).|.ѣ'o1lA.‫ۥ‬

S-lE,.C.W.&#xf01d2;YGKM\vjEG.'|F[j.:..2f2Ź^.uujNWnjY::si/.L9,ČGPY1k/.%'f!endstream

endobj

The data in the stream makes no sense because the data has been encoded (converted from its original form to another). In the followin

application to display the page.


Note that unlike other objects in a PDF file, the instructions in the object stream are read and followed sequentially (one after the other).

Before proceeding further we will try to create a simple PDF file from what we have learnt so far.

Sample PDF file: Here is a sample PDF file that I created with help from the specification. You can copy this file from here and save it in a

inclusive). You can then view it with a PDF reader (for instance using Acrobat Reader). Note: Not all PDF files are as simple as this. This PD

Article continues at Understanding the Portable Document Format (PDF): Part 2

I love your feedback and suggestions. Please leave a comment below or contact me at steve@printmyfolders.com.