iPDFdev Tips & Tricks for PDF development

Text extraction from PDF files – part 1

April 05th, 2016

Text extraction from PDF files is a requirement that many developers encounter in their software projects. While some people prefer to use a 3rd party library (PDFkitten for example) for this task, others want to implement it from the scratch.

This article is the first in a series of articles that will show how to implement this feature from scratch. While the code in the articles will use the CoreGraphics and CGPDF* API to parse the PDF files, the general concepts shown in the articles will apply to any programming language. Basic knowledge of PDF structure is a plus and will help.

Text showing operators

PDF specification includes several page content operators for displaying text on a PDF page.

  • Tj - shows a text string. It has a single operand: stringObject Tj
  • TJ - shows an array of strings. It has a single operand: arrayObject TJ
    - the array can contain also numbers that let you adjust the spacing between characters
  • ' (single quote) - moves to the next line and shows a text string. It has a single operand: stringobject '
  • " (double quote) - moves to the next line and shows a text string while setting the word and character spacing. It has 3 operands: wordSpacing characterSpacing stringObject "

The stringObject operand is a sequence of bytes, it is not an actual string using WinAnsi or UTF-8 encoding. This sequence of bytes is transformed into an actual string using the current font's encoding and its ToUnicode cmap. This leads to another operator that needs to be handled:

  • Tf - sets current font and size. It has 2 operands: fontResourceName fontSize Tf
    The fontResourceName is a name object that we'll use to locate the font object in the Resources dictionary.

The page content is parsed using the CGPDFScanner* methods. The operators table for the operators above is setup like this:

// pdfPage is a CGPDFPageRef pointer representing the page from which we want to extract the text
CGPDFContentStreamRef contentStream = CGPDFContentStreamCreateWithPage(pdfPage);
CGPDFOperatorTableRef operatorTable = CGPDFOperatorTableCreate();
CGPDFOperatorTableSetCallback(operatorTable, "Tf", &op_Tf);
CGPDFOperatorTableSetCallback(operatorTable, "Tj", &op_Tj);
CGPDFOperatorTableSetCallback(operatorTable, "TJ", &op_TJ);
CGPDFOperatorTableSetCallback(operatorTable, "'", &op_singleQuote);
CGPDFOperatorTableSetCallback(operatorTable, "\"", &op_doubleQuote);
 
CGPDFScannerRef contentStreamScanner = CGPDFScannerCreate(contentStream, operatorTable, self);
CGPDFScannerScan(contentStreamScanner);
 
CGPDFScannerRelease(contentStreamScanner);
CGPDFOperatorTableRelease(operatorTable);

In the next article I'll show how to find the font object in the resources based on the name provided to Tf operator.

Tagged as: Leave a comment
Comments (2) Trackbacks (0)
  1. Very informative. Please post part two.

  2. Great info as always. Can’t wait for Part 2.


Leave a comment

No trackbacks yet.