I’m trying to extract some content from txt file (from pdf conversion).
You can notice space (or not) after label or before page_number and sometimes, there is no n between page_number and X.Y.Z code
Here is a sample :
Summary
30.1.3.1. Boite à eau………………………………………………………………………………………………………………………………. 29
30.1.3.2. Descentes d’eaux pluviales en façades ……………………………………………………………………………………….30
30.1.3.3. Lanterneau de désenfumage………………………………………………………………………………………………………30 30.1.3.4. Etanchéité résine………………………………………………………………………………………………………………………31
and later in the same doc :
The structure of the summary is :
X.Y.Z. Label …………………………….. Page_number
In the same document we can see :
X.Y.Z. Label : Description etc ……
My use case is to get only X.Y.Z Label From the summary only.
I tried this regex which is the best result I can get but it’s not the best :
(d+[.])(.*)?[.]*s*d+
My problem is about dot managing, label extraction and n missing.
Could you please healp me ?