Vous êtes sur la page 1sur 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/254463781

Challenges in generating bookmarks from TOC entries in e-books

Article · September 2012


DOI: 10.1145/2361354.2361363

CITATIONS READS
2 43

3 authors:

Yogalakshmi Jayabal Chandrashekar Ramanathan


International Institute of Information Technology Bangalore International Institute of Information Technology Bangalore
6 PUBLICATIONS   10 CITATIONS    30 PUBLICATIONS   77 CITATIONS   

SEE PROFILE SEE PROFILE

Mehul Sheth
International Institute of Information Technology Bangalore
2 PUBLICATIONS   3 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Urban Mobility View project

All content following this page was uploaded by Yogalakshmi Jayabal on 03 August 2016.

The user has requested enhancement of the downloaded file.


Challenges in generating bookmarks from TOC entries in
e-books
Chandrashekar Ramanathan Mehul Sheth Yogalakshmi J
International Institute of Information International Institute of Information International Institute of Information
Technology Bangalore Technology Bangalore Technology Bangalore
India India India
rc@iiitb.ac.in mehulkumarjayprakash.sheth@iiitb.net j.yogalakshmi@iiitb.org

ABSTRACT the TOC. The task of adding bookmarks, document


The task of extracting document structures from a digital e-book categorization, and document content analysis can also be carried
is difficult and is an active area of research. On the other hand, out using the TOC of the books. Specifically, extraction of TOC is
many e-books already have a table of contents (TOC) at the an important pre-requisite for the task of adding bookmarks
beginning of the document. This may lead us to believe that automatically. Most of the PDF based e-books come with built in
adding bookmarks into digital document (e-book) based on the bookmarks which points to just the page in which that content is
existing TOC would be trivial. In this paper, we highlight the present rather than pointing to exact location of the content within
challenges involved in this task of automatically adding a page which helps us to identify structure later on. In our
bookmarks to an existing e-book based on the TOC that exists approach we have considered a case where in PDF based e-books
within the document. If we are able to reliably identify the are of single column based PDF. Once we identify correct
specific locations of each TOC entry within the document, the location of each of the TOC entry in a page, we can then use this
algorithms can be easily extended to identify document structures information to extract structure by referring to the content as a
within e-books that have TOC. We describe a tool we have built section or chapter content between two subsequent section and
called Booky that tries to add automatic PDF bookmarks to chapter entry within a page. Earlier, there has been work done on
existing PDF based e-books as they have TOC as part of the the extraction of TOC from digital documents, but they were more
document content. The tool addresses most of the challenges that focused on scanned images. Also work on bookmark creation was
have been identified while still leaving a few tricky scenarios still targeted only for e-book readers. Our goal is to add bookmarks
open. automatically using the information extracted from the TOC for
any PDF based e-book. Also, the bookmarks are placed at the
exact location of the text inside the book as opposed to merely
Categories and Subject Descriptors linking it to the beginning of the page. Also, bookmarks are not
I.7.4 [Document And Text Processing]: Electronic Publishing. present for all section & subsections always. Since TOC is used,
bookmarks are created for all sections & subsections in our
General Terms approach.
Documentation The task of adding bookmarks automatically is divided into two
major sub-tasks. 1) Extraction of TOC content 2) Adding
Keywords Bookmarks (links).
PDF, bookmarks, structure analysis, table of contents, TOC. The paper is organized in the following way. In section 2, we
discuss the related work. In section 3, we discuss the proposed
1. INTRODUCTION approach & challenges involved in creation of bookmarks,
The document structure extraction and hierarchy detection is an extraction of TOC entries. In section 4, we discuss the
active area of ongoing research. In view of that, Table of contents implementation details. In section 5, we discuss the open issues
(TOC) is important, as it provides the index into the various and in section 6 we give a summary and draw our conclusions.
portions of books, journals, magazines, reports and so on. TOC
contains references to the different parts of the book and reflects 2. RELATED WORK
the natural and the logical structure of the book. Automatic A number of methods exist for TOC extraction, which is widely
extraction of the table of contents (TOC) is of importance as any reported. Mandal et. al [1] proposed to extract the TOC from the
further processing of the document content is made easier as it scanned documents. Their approach is primarily based on optical
eases the way to obtain the appropriate division information from character recognition (OCR), page heuristics and related
techniques. Young-Bin Kwon and Jaehwa Park[2] proposed a
segmentation algorithm to extract the author information, title and
page number from the table of contents of journals. From the
images, the segmentation algorithm identifies regions of interest
with some threshold and extracts table of contents without the use
of character recognition technique. Liangcai et al,[3] propose to
extract the table of contents by generating a statistical model for
the same and using the model to analyze the other TOC entries
with the use of clustering algorithms. This paper also gives a brief characterized by a triplet that is comprised of <division, title, page
analysis of the existing attempts on the extraction of the TOC and number>. For example, <”Chapter 1”, “Introduction”, 5> is an
categorizes the same. Prateek Sarkar and Eric Saund [4] example of a TOC triplet. The extraction of these triplets can be
extensively study the structure and styles of TOC. Here, they subdivided into the following tasks: 1) Identification of the start
propose a framework for understanding TOC of books, journals of the TOC page. 2) Identification of the end of the TOC page 3)
and magazines. They propose a universal logical structure for the Parsing the TOC entries to construct triplets. All the books need
entries in the TOC as a combination of the descriptor with its not necessarily contain the triplets; instead it could contain just
locator. They explain the triplets that could be obtained from the the <“Introduction”, 5> and not having any division number, or a
TOC entries and about the hierarchal tree generation from the combination of these. We rely mainly on title and a bit page
same. They also have proposed possible methods to retrieve the number for bookmarks creation.
book’s contents based on the retrieved TOC. Belaid [6] proposes
the recognition of TOC with part-of-speech tagging. Bourgeois et 3.1.1 Challenges in extraction of TOC entries
al [5] propose a stochastic model based on the text attributes and The challenges are described below:
the spatial relations. William S. Lovegrove et al, [11] proposed to
1. Identification of beginning of TOC: The words “Contents”
extract the logical structure / reading order using geometric layout
or “Table of Contents” in a single line, indicates the potential
of the content in the document. This is mainly addresses
start of the table of contents. To confirm the same, the
document classification and understanding. This approach will
occurrence of <division, title, page number> triplet
help in identification of the TOC pages but, still requires
throughout that page is checked. This shows that a particular
additional logic, as it mainly identifies: Text, Titles, Images,
page contains the TOC entries.
Header, Footer, Caption, and Unknown. In order to retrieve the
actual title from the TOC pages, additional parsing of the content 2. Identification of end of TOC: The first line item of the TOC
is required to identify them appropriately. S. Marinai. et al, [12] is stored as the “End Indicator” for TOC entries. For
propose heuristic approaches to retrieve TOC pages. The titles example, if the first TOC entry is “Chapter 1. Introduction”,
corresponding to a page number are built by working backwards, then all the content until the next occurrence of “Chapter 1.
till the previous page number is reached. This works fine in Introduction” is treated as the full table of contents. As this is
bookmark creation, only when the potential title built has title not always true, identification of the pattern between title and
alone without any division number and separator that separate page number in the TOC entry is also important.
title from page number. No information on, whether any more 3. Identification of TOC triplet: The TOC entries are parsed
processing is done on the potential title is not clearly available on to separate the page number, title and the division names and
the paper. The approach used to match these titles with the ones numbers into a separate data structure. The parsing makes
inside the content of the book is using word similarity with tri- use of regular expressions to populate the triplets. (i)
gram indexing and Title's Bounding box numbers, which is Separation of roman and Arabic numerals for division
different from our proposed approach below. Hervé Déjean and number is comparatively easier and regular expressions are
Jean-Luc Meunier [7] proposed a functional approach that relies
created accordingly to distinguish these two. (ii) Appearance
on the functional properties that a TOC. These properties are:
of alphanumeric in division part like “Chapter 1:
Contiguity, Textual similarity, Ordering, Optional elements, No
Introduction …….5” is little trickier. This is also handled in
self-reference. Their hypothesis is that these five properties are
tool through regular expression with some additional
sufficient for the entire characterization of a TOC, independently
constraints.
of the document class and language. Attempts have been made to
create bookmarks in e-books but none of them is targeting 4. Identification of separator between title and page
automatic creation of bookmarks. They are either device oriented number: The catch here is that, appearance of separator
or require manual activities. Johannes Schöning et al. [9] have should be unambiguously distinguished from any special
proposed bookmark creation for e-reader. Dongwook Yoon et al character that appears as part of the title. The regular
[10] propose bookmark creation for Touch based device. Jürgen expression comes handy, to distinguish these ambiguous
Steimle et al [8] propose bookmark creation using electronic pen situations. For example, the TOC entry like “Java 1.1 and
and physical color coded paper. Automatic bookmark creation is Earlier …….305” is parsed and results in <“Java 1.1 and
not widely reported partly due to the fact that the most of the e- Earlier”, 305>. Here, the “.” in the title (i.e., “Java 1.1”)
books come with the bookmarks at least at the chapter level. Even needs to be distinguished from the actual separator “.” that
those existing bookmarks are only at the page level. appears later in the line before the page number.
5. Identification of roman and Arabic numerals for page
3. PROPOSED APPROACH numbers: The page numbers could be roman numerals or
Adding automatic bookmarks to e-books is a 2-step process. The Arabic numerals. Regular expressions are used to
first step involves the process of extracting the table of content differentiate these.
(TOC) entries. The second step involves using them to create the 6. Identification of TOC triplets in multi-line entries: Most
bookmarks. Here, we describe these tasks in detail. TOC entries are single line, but some could span more than a
3.1 Extraction of TOC Entries single line. An additional constraint validates the same. A
Automatic parsing of TOC will be of great use to extract the check is made for the presence of the separator pattern. If a
structure of the document, document recognition etc. To extract separator pattern is not found, parser assumes that it could be
TOC entries, actually poses a big challenge, since TOC appears in potential candidate for a multi-line TOC entry and therefore
wide variety of styles and structures. Parsing TOC of a particular expects immediate next line to have separator pattern. If
known structure and style is less challenging, than parsing any found, then these two lines are concatenated to form a single
given TOC in a systematic way. An entry in TOC can be line TOC entry. From this, the actual triplet is extracted.
Point to be noted is that, when a separator pattern is not position for the same, we can create bookmarks that link to
found, there could be two possibilities: (a) the line could be specific position within a page.
page header i.e. something like the “Part I” as it indicates
beginning of “Part I”. Or (b) the line is actually the multi-line 4. IMPLEMENTATION DETAILS
TOC entry.
7. Identification of Non-TOC entry: Non-TOC entries such as 4.1 System Architecture
Header / Footer / decorative elements etc. can appear in
between TOC entries at the start of the page and at end of the Extract TOC triplet Target PDF
page respectively. These are ignored with the regular <Division, Title, with
expression for the same. Source Page number>
PDF bookmarks

3.2 Adding Bookmarks Create Bookmarks


Once table of content is extracted in the form of triplets, it is used
to create bookmarks. It might be possible that source file already
has bookmarks in which case bookmarks creation will be ignored. Figure 13: System Architecture of BOOKY
Also, those bookmarks can be deleted and this approach could be
used to create for the entire TOC. Bookmarks are placed at the
It consists of two modules: 1) Extract TOC Triplets 2) Create
exact location of the text inside the document.
Bookmarks. The input to tool and output from tool is both PDF
documents. Tool is implemented in Java. Open source library
3.2.1 Challenges in bookmark creation iText [http://itextpdf.com] is used to extract text from PDF. It is
Bookmark creation starts by first matching exact text followed by an open source library, with good documentation. Any other tool
finding page that has matching text and position of matching can be used to extract the text. iText retrieves the text and the
entry. The challenges faced are summarized below: images as per the order present in the PDF, which is useful during
1. Identification of physical page number based on page both the extraction of TOC and the creation of bookmarks. The
number from triplet: The page number extracted from TOC proposed algorithm is presented in plain english.
as a part of triplet might not be the exact page number. As Extract TOC Entries:
physical page number might be different from logical page 1. Extract content of a book one page at a time
number. So we cannot rely only on the page number to look 2. For each page, extract content line by line : Forming a
for the exact match. It requires intelligently finding out the line: iText gives glyph information for extracted text. For
page with matching text. PDF co-ordinates are like graph where origin is at left-
2. Extracting meaningful content from PDF: In PDF format, bottom corner. We form line by content having same y-
content is stored in the form of glyph and there is no notion co-ordinate.
of line, paragraph etc. In order to match content of TOC title 3. To check start of TOC, check if line contains:
within the page content, we need to extract contents, one line a) Matching regex pattern for TOC like "Content"
at a time which helps in identifying entries spanned across b) If it matches, then find out if it has other pattern like
multiple lines. iText API, which we used in our "table", "of" only and not something else
implementation, extracts text but not in line formats. We use c) It could be detailed TOC, if it satisfies the condition.
positional information of text extracted to form lines in our i) Has the Triplet pattern throughout the page
algorithm. d) Store the first item of the TOC in a global variable
3. Variations in TOC entries and corresponding actual e) For TOC candidate, divide line in the group of 3
entries in a page: Apart from the exact matches, there are Division: section no which is not present in all the
other cases which are quite interesting. (i) In some cases, books in which case it's left blank; Title: chapter or section
only chapters are associated with the division numbers, while name; Page No: Page no
the subsections are not associated with the division numbers. f) Add triplet entry into a file
4. The end of TOC is identified when matching entry of the
(ii) There are cases, where the actual entry corresponding to
global variable is encountered. An Enhancement, to
the TOC, spans across multiple lines. For e.g.: “Introducing
check for pattern in subsequent pages is on. If subsequent
AWT, Windows…” could actually span two or more lines
page do not have pattern (dots, spaces, dash) which is
which needs to be matched exactly.
available in TOC then assume end of TOC.
4. Discarding entries which are not candidates for 5. Stop, when the condition in line 4 satisfies.
bookmarks: Sometimes, TOC title may appear multiple
times inside the document. For example, in many books page Create Bookmarks:
header contains chapter title which exactly matches with 1. Retrieve triplets one by one formed from above step.
TOC title (of an entry) for a chapter. Obviously, all those 2. Retrieve title from triplet and page no
occurrences are not candidates for the bookmark target for 3. Start search from the given page no if it's a proper
that TOC entry. numeric entry or else from page 1. Search for matching
5. Finding exact position of a matching entry: Once we find entry for 'title' in triple by searching line by line.
an exact match, next step is to find position of matching 4. Check for multiline title while matching TOC entry: If
entry in a page. If actual content is present in multiline then entire line is part of triple 'title' then check for subsequent
we need to get the position of the first match of the substring line for remaining part of TOC 'title', continue until we
of TOC title. Once we have found exact match as well as get full match or we get unmatched entry. Also during
this process store position of the first matching entry
which will be used for creating bookmarks that links to
particular position.
5. If it's a single line entry keep track of position of it.
6. Use bookmark creation feature with stored positions.
The output is a PDF with bookmarks pointing to exact location of header then some intelligence is required to discard those
titles present in TOC. Challenges involved in extraction of TOC entries (e.g., as a part of decorative element).
entries and in creation of bookmarks are discussed in detail in
section 4; are implemented in Booky. There are a few open issues 6. SUMMARY AND CONCLUSION
in this context, which will be discussed in section 5. Tool was The automatic bookmark creation in e-books using the TOC
tested with over 2000 bookmarks spanning across several books entries are discussed at length in this paper. The approach here is
in PDF format. The results are tabulated in Table 1. novel and has significant advantages, that the extracted TOC
Table 1. Results of bookmark creation using TOC entries entries could be used for extracting different parts of the book
with the use of bookmarks. Bookmarks will give the exact
Correctly
No.of Correctly location of the content, which could be used in the extraction of a
extracted
Book TOC created Errors particular Chapter / Section / Sub-section of an e-book.
TOC
entries bookmarks Apart from this, these could be guide to extract the structure
entries
Book 1 781 696 696 5 (0.6%) information and extraction of hierarchy in e-books (e.g.,
Book 2 398 393 393 5 (1.2%) ChapterSectionSub-section). Various challenges involved are
Book 3 279 278 278 1 (0.3%) discussed at detail. Solutions to the challenges identified were
demonstrated using a PDF-specific tool named Booky. The
Book 4 72 71 71 1 (1.3%)
solution however could be generalized for any digital format.
Book 5 375 373 373 2 (0.5%) Some relatively minor issues still are still open. However, we have
Book 6 178 166 166 12 (6.7%) been able to achieve about 98% success rate in the accurate
Total 2083 2057 2057 26 (1.2%) extraction and linking of bookmarks automatically.
The results indicate that almost 98% of the bookmarks were
created correctly using this approach. However, there are a few
open issues, which are discussed in next section. An analysis 7. REFERENCES
shows that failures were caused by erroneous extraction of TOC [1] S. Mandal, et al. “Automated detection and segmentation of
triplets. In our assessment, if we are able to increase the reliability table of contents”, page from document images. In ICDAR
of TOC triplet extraction, then automatic creation of bookmark ’03, page 398, Washington, DC, USA, 2003.
will have even fewer errors. [2] Young-Bin Kwon; Jaehwa Park. "Implementation of Content
Analysis System for Recognition of Journals’ Table of
5. OPEN ISSUES Contents", ICDAR'07, page: 1018-1022.
There are some issues which are still open, are discussed here: [3] Liangcai Gao; Zhi Tang; Xiaofan Lin; Xin Tao; Yimin Chu.
1. Disambiguation Issue: Discriminating headings [main “Analysis of Book Documents’ Table of Content Based on
headings like PART I etc.] from the actual TOC line items is Clustering”, ICDAR’09, page: 911-915.
tricky. Such headings are characterized by lack of any page [4] Prateek Sarkar and Eric Saund “On the Reading of Table of
number information, and must be distinguished from a Contents”, The Eighth IAPR Workshop on Document
decorative element. This is an issue that is yet to be fully Analysis Systems,2008, pp-386-393.
addressed in our solution.
[5] A. Belaıd. “Recognition of table of contents for electronic
2. Identification of Multi-pattern TOC entries: While library consulting”. IJDAR, 4(1):35–45, 2001.
extracting TOC entries, if TOC contains more than one type [6] F. L. Bourgeois et al, “Document understanding using
of separator, then they must be matched accordingly. probabilistic relaxation: Application on tables of contents of
3. Matching of Symbols: PDF may contain symbols like α, β, periodicals.” In ICDAR ’01, page 508,Washington, DC,
Ω etc. as a part of TOC entry and should be matched with the USA, 2001. IEEE Computer Society.
actual entry in a page. We have used iText API for our [7] Hervé Déjean,Jean-Luc Meunier, "Structuring Documents
implementation which does not extract symbols. While this According to Their Table of Contents", In Proc. of DocEng
is an open issue in our solution, it might be possible that 2005, pp. 2-9.
some other API supports extraction of symbols along with
text. [8] Jürgen Steimle et al, "Digital Paper Bookmarks:
Collaborative Structuring, Indexing and Tagging of Paper
4. Identification of cases where division or title might be in Documents", In CHI EA’08, pp. 2895-2900.
the form of an image: There are cases where division from
TOC triplet appears as an image in actual entry. For e.g. [9] Johannes Schöning et al, "iBookmark: Locative Texts and
“Chapter 1” might appear as an image. Sometimes TOC title Placebased Authoring", In CHI EA’09, pp. 3775-3780.
along with division appears as an image. Our current [10] Dongwook Yoon et al, "Touch-Bookmark: A Lightweight
implementation relies only on text entries and not images. Navigation and Bookmarking Technique for E-Books", In
Along with basic text matching we might require to perform CHI EA ‘1, pp. 1189-1194.
OCR for such cases but bookmark in that case will link to an
[11] William S. Lovegrove et al, “Document analysis of PDF
image rather than text.
files: methods, results and implications”, Electronic
5. Discarding matching entries that are not candidates for Publishing, Vol. 8 (2 and 3), 227-220, (June & September
bookmarks: As discussed in challenges, we are able to 1995).
discard matching entries which appear as header. If the same
[12] S. Marinai et al, “Table of contents Recognition for
entry appears somewhere else apart from running text and
converting PDF Documents in E-book formats”, In Proc. of
DocEng ’10, pp.73-76.

View publication stats

Vous aimerez peut-être aussi