I’m extracting information from DOCX documents via Python, and sometimes I have to take into account the layout of textual content as it appears in MS Word.
When I extract content that is represented e.g. like this
<w:p w14:paraId="7F6EC7CF" w14:textId="77777777" w:rsidR="004B648B" w:rsidRDefault="00D36BEF">
<w:pPr>
<w:pStyle w:val="TableParagraph"/>
<w:spacing w:before="30"/>
<w:ind w:right="3218"/>
<w:rPr>
<w:sz w:val="18"/>
</w:rPr>
</w:pPr>
<w:r>
<w:rPr>
<w:color w:val="0000FF"/>
<w:sz w:val="18"/>
</w:rPr>
<w:t>ABUNDANT SELECT LIMITED</w:t>
</w:r>
<w:r>
<w:rPr>
<w:color w:val="0000FF"/>
<w:spacing w:val="40"/>
<w:sz w:val="18"/>
</w:rPr>
<w:t xml:space="preserve"> </w:t>
</w:r>
<w:r>
<w:rPr>
<w:color w:val="0000FF"/>
<w:sz w:val="18"/>
</w:rPr>
<w:t>VISTRA CORPORATE SERVICES CENTRE, WICKHAMS CAY II, ROAD TOWN,</w:t>
</w:r>
<w:r>
<w:rPr>
<w:color w:val="0000FF"/>
<w:spacing w:val="-11"/>
<w:sz w:val="18"/>
</w:rPr>
<w:t xml:space="preserve"> </w:t>
</w:r>
<w:r>
<w:rPr>
<w:color w:val="0000FF"/>
<w:sz w:val="18"/>
</w:rPr>
<w:t>TORTOLA,</w:t>
</w:r>
it comes back as
ABUNDANT SELECT LIMITED VISTRA CORPORATE SERVICES CENTRE, WICKHAMS CAY II, ROAD TOWN, TORTOLA, VG1110, BRITISH VIRGIN ISLANDS
But I would really like to receive it as
ABUNDANT SELECT LIMITED
VISTRA CORPORATE SERVICES CENTRE, WICKHAMS CAY II, ROAD TOWN,
TORTOLA,
the way it appears in the word processor (because then it’s easy to pick the first intended line).
Can I send some options to textract to return a run per line instead of a paragraph per line?