Load a scanned PDF document with help of Tesseract OCR (free OCR library) and save the result as DOCX document using C# and .NET

Processing scanned PDF files is a common task in document management automation, OCR projects, and digital archiving systems. Such documents often contain images with text that needs to be recognized in order to ensure full-fledged search, editing, and use of data. The SautinSoft library Document .Net component successfully solves this issue.

Modern companies are increasingly converting paper documents to digital format to enable quick search, editing, and sharing. Optical Character Recognition is an integral part in automating such processes. The number of solutions used is growing every year with the increasing volume of digitization and the need for automation. OCR integration into complex corporate systems, knowledge management systems, as well as in the public sector is particularly in demand.

Who can use this sample code for:

  • Developers of electronic document management systems.
  • Archivists and system administrators.
  • Companies that automate the processing of paper documents.
  • Creators of digital data conversion solutions.

Step-by-step guide:

  1. Add SautinSoft.Document from Nuget.
  2. Set OCR options to using Tesseract.
  3. Load a template document.
  4. Save the document in PDF format.
PDF to DOCX using OCR

Complete code

using System.IO;
using SautinSoft.Document;
using System;

namespace Example
{
    class Program
    {
        static void Main(string[] args)
        {
            // Get your free trial key here:   
            // https://sautinsoft.com/start-for-free/

            LoadScannedPdf();
        }

        /// <summary>
        /// Load a scanned PDF document with help of Tesseract OCR (free OCR library) and save the result as DOCX document.
        /// </summary>
        /// <remarks>
        /// Details: https://www.sautinsoft.com/products/document/help/net/developer-guide/ocr-load-scanned-pdf-using-tesseract-and-save-as-docx-net-csharp-vb.php
        /// </remarks>
        static void LoadScannedPdf()
        {
            // Here we'll load a scanned PDF document (perform OCR) containing a text on English, Russian and Vietnamese.
            // Next save the OCR result as a new DOCX document.

            // First steps:

            // 1. Download data files for English, Russian and Vietnamese languages.
            // Please download the files: eng.traineddata, rus.traineddata and vie.traineddata.
            // From here (good and fast): https://github.com/tesseract-ocr/tessdata_fast
            // or (best and slow): https://github.com/tesseract-ocr/tessdata_best

            // 2. Copy the files: eng.traineddata, rus.traineddata and vie.traineddata to
            // the folder "tessdata" in the Project root.

            // 3. Be sure that the folder "tessdata" also contains "pdf.ttf" file.

            // Let's start:
            string inpFile = @"..\..\..\scan.pdf";
            string outFile = "Result.docx";

            PdfLoadOptions lo = new PdfLoadOptions();
            lo.OCROptions.OCRMode = OCRMode.Enabled;
            // 'false' - Never load embedded fonts in PDF. Use the fonts with the same name installed at the system or similar by font metrics.
			// 'true' - Always load embedded fonts in PDF.
            lo.PreserveEmbeddedFonts = true;            

            // You can specify all Tesseract parameters inside the method PerformOCR.
            lo.OCROptions.Method = PerformOCRTesseract;
            DocumentCore dc = DocumentCore.Load(inpFile, lo);

            // Make all text visible after Tesseract OCR (change font color to Black).
            // The matter is that Tesseract returns OCR result PDF document with invisible text.
            // But with help of Document .Net, we can change the text color, 
            // char scaling and spacing to desired.
            foreach (Run r in dc.GetChildElements(true, ElementType.Run))
            {
                r.CharacterFormat.FontColor = SautinSoft.Document.Color.Black;
                r.CharacterFormat.Scaling = 100;
                r.CharacterFormat.Spacing = 0;
            }

            dc.Save(outFile);

            // Open the result for demonstration purposes.
            System.Diagnostics.Process.Start(new System.Diagnostics.ProcessStartInfo(outFile) { UseShellExecute = true });
        }
        public static byte[] PerformOCRTesseract(byte[] image)
        {
            // Specify that Tesseract use three 3 languages: English, Russian and Vietnamese.
            string tesseractLanguages = "eng+rus+vie";

            // A path to a folder which contains languages data files and font file "pdf.ttf".
            // Language data files can be found here:
            // Good and fast: https://github.com/tesseract-ocr/tessdata_fast
            // or
            // Best and slow: https://github.com/tesseract-ocr/tessdata_best
            // Also this folder must have write permissions.
            string tesseractData = Path.GetFullPath(@"..\..\..\tessdata\");

            // A path for a temporary PDF file (because Tesseract returns OCR result as PDF document)
            string tempFile = Path.Combine(tesseractData, Path.GetRandomFileName());            

            try
            {
                using (Tesseract.IResultRenderer renderer = Tesseract.PdfResultRenderer.CreatePdfRenderer(tempFile, tesseractData, true))
                {
                    using (renderer.BeginDocument("Serachablepdf"))
                    {
                        using (Tesseract.TesseractEngine engine = new Tesseract.TesseractEngine(tesseractData, tesseractLanguages, Tesseract.EngineMode.Default))
                        {
                            engine.DefaultPageSegMode = Tesseract.PageSegMode.Auto;
                            
                            byte[] imgBytes = image;
                            using (Tesseract.Pix img = Tesseract.Pix.LoadFromMemory(imgBytes))
                            {
                                using (var page = engine.Process(img, "Serachablepdf"))
                                {
                                    renderer.AddPage(page);
                                }
                            }
                        }
                    }
                }

                return File.ReadAllBytes(tempFile + ".pdf");
            }
            catch (Exception e)
            {
                Console.WriteLine();
                Console.WriteLine("Please be sure that you have Language data files (*.traineddata) in your folder \"tessdata\"");
                Console.WriteLine("The Language data files can be download from here: https://github.com/tesseract-ocr/tessdata_fast");
                Console.ReadKey();
                throw new Exception("Error Tesseract: " + e.Message);
            }
            finally
            {
               
            }
        }        
    }
}

Download

Imports System.IO
Imports SautinSoft.Document
Imports System

Namespace Example
    Friend Class Program
        Public Shared Sub Main(args As String())
            ' Get your free trial key here:   
            ' https://sautinsoft.com/start-for-free/

            Call LoadScannedPdf()
        End Sub

        ''' <summary>
        ''' Load a scanned PDF document with help of Tesseract OCR (free OCR library) and save the result as DOCX document.
        ''' </summary>
        ''' <remarks>
        ''' Details: https://www.sautinsoft.com/products/document/help/net/developer-guide/ocr-load-scanned-pdf-using-tesseract-and-save-as-docx-net-csharp-vb.php
        ''' </remarks>
        Private Shared Sub LoadScannedPdf()
            ' Here we'll load a scanned PDF document (perform OCR) containing a text on English, Russian and Vietnamese.
            ' Next save the OCR result as a new DOCX document.

            ' First steps:

            ' 1. Download data files for English, Russian and Vietnamese languages.
            ' Please download the files: eng.traineddata, rus.traineddata and vie.traineddata.
            ' From here (good and fast): https://github.com/tesseract-ocr/tessdata_fast
            ' or (best and slow): https://github.com/tesseract-ocr/tessdata_best

            ' 2. Copy the files: eng.traineddata, rus.traineddata and vie.traineddata to
            ' the folder "tessdata" in the Project root.

            ' 3. Be sure that the folder "tessdata" also contains "pdf.ttf" file.

            ' Let's start:
            Dim inpFile = "..\..\..\scan.pdf"
            Dim outFile = "Result.docx"

            Dim lo As PdfLoadOptions = New PdfLoadOptions()
            lo.OCROptions.OCRMode = OCRMode.Enabled
            ' 'false' - Never load embedded fonts in PDF. Use the fonts with the same name installed at the system or similar by font metrics.
            ' 'true' - Always load embedded fonts in PDF.
            lo.PreserveEmbeddedFonts = True

            ' You can specify all Tesseract parameters inside the method PerformOCR.
            lo.OCROptions.Method = AddressOf PerformOCRTesseract
            Dim dc = DocumentCore.Load(inpFile, lo)

            ' Make all text visible after Tesseract OCR (change font color to Black).
            ' The matter is that Tesseract returns OCR result PDF document with invisible text.
            ' But with help of Document .Net, we can change the text color, 
            ' char scaling and spacing to desired.
            For Each r As Run In dc.GetChildElements(True, ElementType.Run)
                r.CharacterFormat.FontColor = Color.Black
                r.CharacterFormat.Scaling = 100
                r.CharacterFormat.Spacing = 0
            Next

            dc.Save(outFile)

            ' Open the result for demonstration purposes.
			System.Diagnostics.Process.Start(New System.Diagnostics.ProcessStartInfo(outFile) With {.UseShellExecute = True})
        End Sub
        Public Shared Function PerformOCRTesseract(image As Byte()) As Byte()
            ' Specify that Tesseract use three 3 languages: English, Russian and Vietnamese.
            Dim tesseractLanguages = "eng+rus+vie"

            ' A path to a folder which contains languages data files and font file "pdf.ttf".
            ' Language data files can be found here:
            ' Good and fast: https://github.com/tesseract-ocr/tessdata_fast
            ' or
            ' Best and slow: https://github.com/tesseract-ocr/tessdata_best
            ' Also this folder must have write permissions.
            Dim tesseractData = Path.GetFullPath("..\..\..\tessdata\")

            ' A path for a temporary PDF file (because Tesseract returns OCR result as PDF document)
            Dim tempFile As String = Path.Combine(tesseractData, Path.GetRandomFileName())

            Try
                Using renderer = Tesseract.ResultRenderer.CreatePdfRenderer(tempFile, tesseractData, True)
                    Using renderer.BeginDocument("Serachablepdf")
                        Using engine As Tesseract.TesseractEngine = New Tesseract.TesseractEngine(tesseractData, tesseractLanguages, Tesseract.EngineMode.Default)
                            engine.DefaultPageSegMode = Tesseract.PageSegMode.Auto

                            Dim imgBytes = image
                            Using img = Tesseract.Pix.LoadFromMemory(imgBytes)
                                Using page = engine.Process(img, "Serachablepdf")
                                    renderer.AddPage(page)
                                End Using
                            End Using
                        End Using
                    End Using
                End Using

                Return File.ReadAllBytes(tempFile & ".pdf")
            Catch e As Exception
                Console.WriteLine()
                Console.WriteLine("Please be sure that you have Language data files (*.traineddata) in your folder ""tessdata""")
                Console.WriteLine("The Language data files can be download from here: https://github.com/tesseract-ocr/tessdata_fast")
                Console.ReadKey()
                Throw New Exception("Error Tesseract: " & e.Message)
            Finally

            End Try
        End Function
    End Class
End Namespace

Download


If you need a new code example or have a question: email us at support@sautinsoft.com or ask at Online Chat (right-bottom corner of this page) or use the Form below:


Captcha

Questions and suggestions from you are always welcome!

We are developing .Net components since 2002. We know PDF, DOCX, RTF, HTML, XLSX and Images formats. If you need any assistance with creating, modifying or converting documents in various formats, we can help you. We will write any code example for you absolutely free.