Does anyone know a python library to read docx files?
I have a word document that I am trying to read data from.
5
There are a couple of packages that let you do this.
Check
-
python-docx.
-
docx2txt (note that it does not seem to work with
.doc
). As per this, it seems to get more info than python-docx.
From original documentation:
import docx2txt
# extract text
text = docx2txt.process("file.docx")
# extract text and write images in /tmp/img_dir
text = docx2txt.process("file.docx", "/tmp/img_dir")
-
textract (which works via docx2txt).
-
Since
.docx
files are simply.zip
files with a changed extension, this shows how to access the contents.
This is a significant difference with.doc
files, and the reason why some (or all) of the above do not work with.doc
s.
In this case, you would likely have to convertdoc
->docx
first.antiword
is an option.
0
python-docx can read as well as write.
doc = docx.Document('myfile.docx')
allText = []
for docpara in doc.paragraphs:
allText.append(docpara.text)
Now all paragraphs will be in the list allText.
Thanks to Automate the Boring Stuff with Python by Al Sweigart for the pointer.
See this library that allows for reading docx files https://python-docx.readthedocs.io/en/latest/
You should use the python-docx library available on PyPi. Then you can use the following
doc = docx.Document('myfile.docx')
allText = []
for docpara in doc.paragraphs:
allText.append(docpara.text)
0
A quick search of PyPI turns up the docx package.
2
import docx
def main():
try:
doc = docx.Document('test.docx') # Creating word reader object.
data = ""
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
data = 'n'.join(fullText)
print(data)
except IOError:
print('There was an error opening the file!')
return
if __name__ == '__main__':
main()
and dont forget to install python-docx using (pip install python-docx)