Text extraction from PDF files - part 1 « iPDFdev - Tips & Tricks for PDF development

Text extraction from PDF files – part 1

April 05th, 2016

Text extraction from PDF files is a requirement that many developers encounter in their software projects. While some people prefer to use a 3rd party library (PDFkitten for example) for this task, others want to implement it from the scratch.

This article is the first in a series of articles that will show how to implement this feature from scratch. While the code in the articles will use the CoreGraphics and CGPDF* API to parse the PDF files, the general concepts shown in the articles will apply to any programming language. Basic knowledge of PDF structure is a plus and will help.

Text showing operators

PDF specification includes several page content operators for displaying text on a PDF page.

Tj - shows a text string. It has a single operand: stringObject Tj
TJ - shows an array of strings. It has a single operand: arrayObject TJ
- the array can contain also numbers that let you adjust the spacing between characters
' (single quote) - moves to the next line and shows a text string. It has a single operand: stringobject '
" (double quote) - moves to the next line and shows a text string while setting the word and character spacing. It has 3 operands: wordSpacing characterSpacing stringObject "

The stringObject operand is a sequence of bytes, it is not an actual string using WinAnsi or UTF-8 encoding. This sequence of bytes is transformed into an actual string using the current font's encoding and its ToUnicode cmap. This leads to another operator that needs to be handled:

Tf - sets current font and size. It has 2 operands: fontResourceName fontSize Tf
The fontResourceName is a name object that we'll use to locate the font object in the Resources dictionary.

The page content is parsed using the CGPDFScanner* methods. The operators table for the operators above is setup like this:

// pdfPage is a CGPDFPageRef pointer representing the page from which we want to extract the text
CGPDFContentStreamRef contentStream = CGPDFContentStreamCreateWithPage(pdfPage);
CGPDFOperatorTableRef operatorTable = CGPDFOperatorTableCreate();
CGPDFOperatorTableSetCallback(operatorTable, "Tf", &op_Tf);
CGPDFOperatorTableSetCallback(operatorTable, "Tj", &op_Tj);
CGPDFOperatorTableSetCallback(operatorTable, "TJ", &op_TJ);
CGPDFOperatorTableSetCallback(operatorTable, "'", &op_singleQuote);
CGPDFOperatorTableSetCallback(operatorTable, "\"", &op_doubleQuote);
&nbsp;
CGPDFScannerRef contentStreamScanner = CGPDFScannerCreate(contentStream, operatorTable, self);
CGPDFScannerScan(contentStreamScanner);
&nbsp;
CGPDFScannerRelease(contentStreamScanner);
CGPDFOperatorTableRelease(operatorTable);

In the next article I'll show how to find the font object in the resources based on the name provided to Tf operator.

Tagged as: pdf Leave a comment

Comments (2) Trackbacks (0) ( subscribe to comments on this post )

Mahroof
April 12th, 2016 - 11:05

Very informative. Please post part two.

( REPLY )
Tim
May 13th, 2016 - 18:58

Great info as always. Can’t wait for Part 2.

( REPLY )

iPDFdev Tips & Tricks for PDF development

Text extraction from PDF files – part 1

Text showing operators

Related

Leave a comment Cancel reply

Recent Posts

Recent Comments

Meta