I am establishing a .pdf data scraper for technical reports.
Original data is mostly in multiple-page .pdf form where useful data is only on the first page.
With the PyPDF2 module I merged all the first pages into a big single .pdf file that contains all the first pages of the tech reports.
I used PdfReader to append strings of text of each page as a string to a list.
For illustration the list will look like this =>
list_o_text= [ ‘Random string 1 2 3 45 6789 999999 22222’, ‘Example tech report 444444’ ]
Every string in list_o_text contains definitely contains one or more 5 or 6-digit numbers.
I recently found the RE module. Yet I’m having problems finding the proper function to search for them.
I would appreciate help.
#########################################################################
Attempt with findall()
IDLE IMPUT:_______
import re
list_o_text= [ 'Random string 1 2 3 45 6789 999999 22222', 'Example tech report 444444' ]
for n in range(len(list_o_text)):
find = re.findall('d{5}+',list_o_text[n])
print(find)
IDLE SHELL OUTPUT:___
[‘99999’, ‘22222’]
[‘44444’]
Note: the six-digit number ‘999999’ is not found in its entirety
Attempt with search()
IDLE INPUT:_______
import re
list_o_text= [ 'Random string 1 2 3 45 6789 999999 22222', 'Example tech report 444444' ]
for n in range(len(list_o_text)):
find = re.search('d{5}+',list_o_text[n])
print(find
IDLE SHELL OUTPUT:___
<re.Match object; span=(28, 33), match=’99999′>
<re.Match object; span=(20, 25), match=’44444′>
Note: gives positions and on top of that the ranges don’t account for 6-digit numbers
Attempt with search().group()
import re
list_o_text= [ 'Random string 1 2 3 45 6789 999999 22222', 'Example tech report 444444' ]
for n in range(len(list_o_text)):
find = re.search('d{5}+',list_o_text[n]).group()
print(find)
IDLE SHELL OUTPUT:___
99999
44444
Note: the six-digit number ‘999999’ is not found in its entirety
#################################################################
CONVOLUTED SOLUTION
I used all three methods, yet can’t shake the feeling that it could be simpler
IDLE INPUT:_______
`import re
list_o_text= [ 'Random string 1 2 3 45 6789 999999 22222', 'Example tech report 444444' ]
for n in range(len(list_o_text)):
find_all = re.findall('d{5}+',list_o_text[n])
#1st loop result is ['99999','22222']
for five_d_num in find_all:
find_start = re.search(five_d_num,list_o_text[n]).start()
find = re.search('d+',list_o_text[n][find_start: ]).group()
print(find)
IDLE SHELL OUTPUT:___
999999
22222
444444
That’s that.
Cenc is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
The pattern d{5}+
is not what you need, you want d{5,6}
.
I highly recommend regex101.com to construct and test regex pattern. The site comes with a detailed breakdown of the pattern’s components.