PDF Extract Text



PDF file format is popular because file contents can not be modified easily, as compared to other word processing file formats. However, you can extract text from PDF documents with Aspose.PDF for .NET API. In this article, let us explore a few of the many possibilities for extracting text from PDF files using C#. Following is a list of features we will be discussing in this blog:

Extract Text from a Whole PDF Document using C#

Reading text contents from a PDF document is a frequently used feature. You can extract all text from all pages of the document by following the steps below:

Use “Extract PDF text” tool to extract contents of the text objects only. All the other types of objects will be dismissed. OCR feature is automatically applied if no text objects found. Extract Text from PDF Files The PDF File extraction tool allows users to extract data from multiple PDF documents at a same time. After extracting the PDF file text the data will be saved in.txt file format. Users can add single or multiple PDF documents to perform PDF text extraction process. Extract text from PDF. Extract text from a PDF file. Input Parameters. Argument Optional Accepts Default Value Description; PDF file: No: File: The PDF file to extract text from. Enter a file path, a variable containing a file or a text path: Password: Yes: Encrypted value: The password of the PDF file. Leave this blank if the PDF isn't.

  1. Load PDF document
  2. Create TextAbsorber object
  3. Accept the absorber for all pages
  4. Write extracted text to a file and save it
  1. Extract text from PDF. Copies all text from the PDF document and extracts it to a separate text file. Online, no installation or registration required. It's free, quick and easy to use.
  2. Extract text data from opened PDF file this time. Prepare a PDF file for working. Download Executive Order as before. It looks like below. There are three pages in all. Accessing to pages Accessing to arbitrary page. The following code describes accessing the specified page in read PDF file.

Below code snippet follows these steps and shows how to extract text from a whole PDF document using C#:

Pdf Extract Text Python

View the code on Gist.

Search and Extract Text from a Specific Page in PDF Document using C#

PDF Extract Text

After exploring the text extraction at the document level, let us extract text from a specific page of a PDF document. Simply follow the below steps to achieve your requirements:

  1. Load PDF document
  2. Create TextAbsorber object
  3. Accept the absorber for a specific page
  4. Write extracted text to a file and save it

Following code snippet follows these steps to read text from any page of a PDF file using C#:

View the code on Gist.

Search and Extract Text from Specific Region of PDF Page using C#

Extract

Let us take this text extraction another step further. Consider finding some text from a specific area of the page. It is important to understand here that the basic measuring unit in Aspose.PDF for .NET is point, where 72 points equal 1 inch. Moreover, the left bottom corner is considered as the origin (0,0) of the page. The following are the steps to read text from a specific area or section of Page in PDF document:

  1. Load source PDF document
  2. Create TextAbsorber object
  3. Set TextSearchOptions and specify rectangular coordinates
  4. Accept the absorber for specific page
  5. Write extracted text to a file

The code snippet below demonstrates these steps to extract text from specific region using C#:

View the code on Gist.

Search or Extract Specific Text from PDF file with Regular Expressions using C#

You can extract specific text from a PDF document that matches some pattern. For instance, you might want to extract some specific words or numbers. For this purpose, you would need to design a regular expression. The API will utilize that regex to find matching text in the PDF document. Below steps are the guidelines to search and extract specific text from PDF files:

  • Load input PDF document
  • Create TextAbsorber object
  • Design and set a regular expression to match some text
  • Loop through each extracted text fragment

Following C# code snippet uses a regex that searches for the text containing 4 digits, for instance, 1999,2000, etc. across all pages of the document.

View the code on Gist.

Extract Text from Table Cells in PDF Document using C#

Extracting text from Tables on a PDF page is a little different. We have been working with TextAbsorber class in previous examples, but extracting text from a Table is a bit different. Therefore, you need to follow the below steps to extract text from Table objects:

  1. Load source PDF file
  2. Create an object of TableAbsorber class
  3. Iterate through AbsorbedTable
  4. Loop through each cell in each row of Table

Below code snippet follows these steps and efficiently extracts the text from Table cells in PDF document using C#:

View the code on Gist.

Search & Extract Highlighted Text from PDF Document using C#

Highlighted text is present as Annotation in PDF files. They contain Marked Text which makes them different from the conventional text in a document. Below steps describe how to read highlighted text using C#:

  1. Load source document
  2. Loop through all the annotations
  3. Filter TextMarkupAnnotation
  4. Retrieve highlighted text fragments

The following is a code snippet based on the steps above, it can be used to get highlight text from PDF files:

View the code on Gist.
PDF Extract Text

Extract PDF Text with Low Memory Consumption using C#

The following are two different approaches to optimize memory consumption while extracting text from PDF documents using C# language.

i) Using Reset() and FreeMemory() Method

Sometimes the text extraction may consume huge memory and processor. Possibly when the input file is huge and contains a lot of text. Because TextFragmentAbsorber object stores all found text fragments in the memory.

Therefore, the solution we recommend here is to call absorber.Reset() method after processing each page. Moreover, if you are doing read operations only then you can also free the memory held by page objects, with page.FreeMemory() method. So you need to follow the below steps to utilize minimal resources:

  1. Load source PDF document
  2. Specify TextEditOptions
  3. Read TextFragment
  4. Call Reset() method
  5. Call FreeMemory() method

The following code snippet demonstrates text extraction from PDF documents using C#:

View the code on Gist.

We have tested this code snippet with a huge sample file containing 450 pages, 24500 text fragments, and a lot of raster and vector images. The process consumed a mere 500MB of memory. Another tip here is that you may charge .NET garbage collector to decrease maximum memory consumption to around 250MB with an additional cost of 10 seconds of processing time.

ii) Using MemorySaving enumeration

How To Extract Pages In Adobe Reader

Aspose.PDF for .NET API supports configuring memory saving mode while reading text from a PDF file. The TextExtractionOptions.TextFormattingMode enumeration serves the purpose of optimal consumption of resources. Following steps summarize the steps you need to follow for this approach:

  1. Load source PDF file
  2. Create TextDevice object
  3. Set TextExtractionOptions
  4. Extract text and write text to a file

Below code snippet demonstrates the memory saving approach using C#:

Extract Pages From Pdf Free

View the code on Gist.

Conclusion

Pdf Extract Text Position

We have explored a few of the many useful use cases in this article, there are endless possibilities of extracting text from PDF documents using C# with Aspose.PDF for .NET API. You can learn more via API documentation. In case of any inquiry, please reach out to us at Free Support Forums.

Free Pdf Extract

See Also





Comments are closed.