I have a huge file containing the following lines DDD-1126N|refseq:NP_285726|uniprotkb:P00112
and DDD-1081N|uniprotkb:P12121
, I want to grab the number after uniprotkb
.
Here’s my code:
x = 'uniprotkb:P'
f = open('m.txt')
for line in f:
print line.find(x)
print line[36:31 + len(x)]
The problem in line.find(x)
is 10 and 26, I grab the complete number when it is 26. I’m new to programming, so I’m looking for something to grab the complete number after the word.
x = 'uniprotkb:'
f = open('m.txt')
for line in f:
if x in line:
print the number after x
1
Use regular expressions:
import re
for line in open('m.txt'):
match = re.search('uniprotkb:P(d+)', line)
if match:
print match.group(1)
0
import re
regex = re.compile('uniprotkb:P([0-9]*)')
print regex.findall(string)
1
The re
module is quite unnecessary here if x
is static and always matches a substring at the end of each line (like "DDD-1126N|refseq:NP_285726|uniprotkb:P00112"
):
x = 'uniprotkb:'
f = open('m.txt')
for line in f:
if x in line:
print line[line.find(x)+len(x):]
Edit:
To answer you comment. If they are separated by the pipe character (|
), then you could do this:
sep = "|"
x = 'uniprotkb:'
f = open('m.txt')
for line in f:
if x in line:
matches = [l[l.find(x)+len(x):] for l in line.split(sep) if l[l.find(x)+len(x):]]
print matches
If m.txt has the following line:
DDD-1126N|uniprotkb:285726|uniprotkb:P00112
Then the above will output:
['285726', 'P00112']
Replace sep = "|"
with whatever the column separator would be.
0
Um, for one thing I’d suggest you use the csv
module to read a TSV file.
But generally, you can use a regular expression:
import re
regex = re.compile(r"(?<=buniprotkb:)w+")
for line in f:
match = regex.search(line)
if match:
print match.group()
The regular expression matches a string of alphanumeric characters if it’s preceded by uniprotkb:
.