Extracting text from PDF using latest iText from nuGet
MemoryStream pdfStream = ...
pdfStream.Position = 0;
var strategy = new LocationTextExtractionStrategy();
var reader = new PdfReader(pdfStream);
using var pdfDocument = new PdfDocument(reader);
for (int i = 1; i <= pdfDocument.GetNumberOfPages(); ++i)
{
var page = pdfDocument.GetPage(i);
var text = PdfTextExtractor.GetTextFromPage(page, strategy);
}
throw exception at GetTextFromPage():
iText.IO.Exceptions.IOException: Error at file pointer 39747. --->
iText.IO.Exceptions.IOException: '>' not expected. --- End of inner
exception stack trace --- at
iText.IO.Source.PdfTokenizer.ThrowError(String error, Object[]
messageParams) at iText.IO.Source.PdfTokenizer.NextToken() at
iText.Kernel.Pdf.Canvas.Parser.Util.PdfCanvasParser.NextValidToken()
at iText.Kernel.Pdf.Canvas.Parser.Util.PdfCanvasParser.ReadObject()
at iText.Kernel.Pdf.Canvas.Parser.Util.PdfCanvasParser.Parse(IList`1
ls) at
iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.ProcessContent(Byte[]
contentBytes, PdfResources resources) at
iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.ProcessPageContent(PdfPage
page) at
iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(PdfPage
page, ITextExtractionStrategy strategy, IDictionary`2
additionalContentOperators) at
iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(PdfPage
page, ITextExtractionStrategy strategy)
PDF is at
https://wetransfer.com/downloads/340cd26ec3b600354b36f368ac4bf9af20240517055159/56d39d
How to extract text from this PDF ?