DocsMarshal /ActivityLibrary /Pdf /PdfGetTextWithOcr

PdfGetTextWithOcr

Description

This activty takes as input a pdf file and returs string representing the content of the pdf document. This activity must be used if the pdf file was created from an image.

Examples of this type of file are files created by scanners or images converted to pdf.
The activity uses an open-source API called Tesseract. In order to work properly, you need to create a folder called 'tessdata' in the same directory of your Workflow Service, and add inside the folder the language models desired, available at this link.
Example: If we need to extract italian text from a PDF, then we will put inside the 'tessdata' directory all ita.* files

Input

Lang InArgument<String>

The language of the text in the file. For example the italian language code is 'ita'.

The list of the accepted values is available here.

PdfDmFile InArgument<IDmFile> REQUIRED

The pdf document from which you want to extract the text.

Misc

Result OutArgument<String>

TextInPage OutArgument<List<String>>

A list of strings representing the content of each page in the document.