I am trying to use textract to do the obvious with docx files in a AWS Lambda using python. Textract library is included in the package, as is the dependency – docx2txt. I try getting the text out of the file, but still getting the ExtensionNotSupported stating that docx is not supported. I tried putting the doc2txt library in the parsers folder too – didn’t help.
All the info I could find only suggests “installing the dependency”, which was done, so I’m at a loss here….
Using:
- Textract version 1.6.3
- Python version 3.11
Are you using windows or linux? According to the solution suggested here, for Doc and Docx you have to set path in environment variable of Program files x86.