Tying to use iText7 to extract text from pdf file outputs question marks only:
???????? ??????????
?
???????????????????????? ???????????????????
???????? ????????????????????????????
???????????????????????? ????????????????????????????
?????????????? ?????????????????????
???????????? ????????????????????
??????????
????? ??????????
????????????????????? ???????????????????????????????????
????????????? ??????????
????????????????? ??????????????????????????????????
??????? ???????????????? ????????????????
...
In Adobe and Chrome text can copied from pdf properly. How to extract this text in C#
Code used:
MemoryStream pdfStream = ...
pdfStream.Position = 0;
var strategy = new LocationTextExtractionStrategy();
var reader = new PdfReader(pdfStream);
using var pdfDocument = new PdfDocument(reader);
for (int i = 1; i <= pdfDocument.GetNumberOfPages(); ++i)
{
var page = pdfDocument.GetPage(i);
var text = PdfTextExtractor.GetTextFromPage(page, strategy);
}
PDF is at
https://wetransfer.com/downloads/d0c0915f3416b4d9cc735a5b8a36443f20240604123056/c0768c
According to How to extract text instead of question marks from PDF file pdf file has bad unicode map for font.
How to force iText or other .NET code pdf text extractor to properly extract text?
Is it possible to pre-process pdf before text extraction to add proper maps or use proper mapping instead mapping to ? character ?
1