Text extraction from PDF files – part 1
April 05th, 2016
Text extraction from PDF files is a requirement that many developers encounter in their software projects. While some people prefer to use a 3rd party library (PDFkitten for example) for this task, others want to implement it from the scratch.
This article is the first in a series of articles that will show how to implement this feature from scratch. While the code in the articles will use the CoreGraphics and CGPDF* API to parse the PDF files, the general concepts shown in the articles will apply to any programming language. Basic knowledge of PDF structure is a plus and will help.
Text showing operators
PDF specification includes several page content operators for displaying text on a PDF page.
- Tj - shows a text string. It has a single operand: stringObject Tj
- TJ - shows an array of strings. It has a single operand: arrayObject TJ
- the array can contain also numbers that let you adjust the spacing between characters - ' (single quote) - moves to the next line and shows a text string. It has a single operand: stringobject '
- " (double quote) - moves to the next line and shows a text string while setting the word and character spacing. It has 3 operands: wordSpacing characterSpacing stringObject "
The stringObject operand is a sequence of bytes, it is not an actual string using WinAnsi or UTF-8 encoding. This sequence of bytes is transformed into an actual string using the current font's encoding and its ToUnicode cmap. This leads to another operator that needs to be handled:
- Tf - sets current font and size. It has 2 operands: fontResourceName fontSize Tf
The fontResourceName is a name object that we'll use to locate the font object in the Resources dictionary.
The page content is parsed using the CGPDFScanner* methods. The operators table for the operators above is setup like this:
// pdfPage is a CGPDFPageRef pointer representing the page from which we want to extract the text CGPDFContentStreamRef contentStream = CGPDFContentStreamCreateWithPage(pdfPage); CGPDFOperatorTableRef operatorTable = CGPDFOperatorTableCreate(); CGPDFOperatorTableSetCallback(operatorTable, "Tf", &op_Tf); CGPDFOperatorTableSetCallback(operatorTable, "Tj", &op_Tj); CGPDFOperatorTableSetCallback(operatorTable, "TJ", &op_TJ); CGPDFOperatorTableSetCallback(operatorTable, "'", &op_singleQuote); CGPDFOperatorTableSetCallback(operatorTable, "\"", &op_doubleQuote); CGPDFScannerRef contentStreamScanner = CGPDFScannerCreate(contentStream, operatorTable, self); CGPDFScannerScan(contentStreamScanner); CGPDFScannerRelease(contentStreamScanner); CGPDFOperatorTableRelease(operatorTable); |
In the next article I'll show how to find the font object in the resources based on the name provided to Tf operator.
April 12th, 2016 - 11:05
Very informative. Please post part two.
May 13th, 2016 - 18:58
Great info as always. Can’t wait for Part 2.