I am trying to find a DNA sequence homologous to a given coding sequence in a specific organism (E. fergusoni, to root an E. coli genes phylogeny)
I want to blast one of my focal species’ sequence and get back all sequence from E. fergusoni that have between 80% and 98% similarity on the whole segment.
This work prety well on the NCBI online version of blast, by setting “Escherichia fergusonii (taxid:564)” in the “Organism” field. However, I want to automatise this process to be able to run it quickly on a large set of disinct E. coli genes.
I want to use the NCBIWWW.qblast function from the Blast module with argument “entrez_query” to specify the organism I want to restrain my research to.
This is what I tried :
Blast.email = [email protected]
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
sequence_data = open("gene0.fasta").read()
result_handle = NCBIWWW.qblast(
program="blastn",
database="nt",
sequence=sequence_data,
# entrez_query = "txid564[Organism]",
# entrez_query = "Escherichia fergusonii (taxid:564)[Organism]",
# entrez_query = "txid564[ORGN]",
# entrez_query = "564[Taxid]",
format_type="XML",
)
with open("gene0_results.xml", "w") as out_handle:
out_handle.write(result_handle.read())
for record in NCBIXML.parse(open("gene0_results.xml")):
if record.alignments:
print("n")
print("query: %s" % record.query[:100])
for align in record.alignments:
for hsp in align.hsps:
if (hsp.identities/len(hsp.query) < 0.98) and (hsp.identities/len(hsp.query) >= 0.8) :
print("match: %s " % align.title[:100])
print(f"length: {align.length}")
print(f"e value: {hsp.expect}")
print(f"identity : {hsp.identities/len(hsp.query)} ({hsp.identities} nucleotides over {len(hsp.query)})")
print(hsp.query[0:75] + "...")
print(hsp.match[0:75] + "...")
print(hsp.sbjct[0:75] + "...")
When I run this script without uncommenting the entrez_query arguments, I get the E. coli sequences that are 100% similar to my query, in about 1 mn.
But when I uncomment one of them, I don’t get any result at all (I also checked the XML file to see if my later filtering was not the reason), and the request can take up to 20mn. I am looking for the correct syntax to specify the organism with entrez_query (I tried different ones based on what I saw on other posts or what seemed intuitive) but none of them give any result, contrarily to the internet blast version.
I am using Python 3.11 and Biopython 1.83 in a Jupyter Notebook with VSC 1.89.1
mmischler is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.