so I am trying to fetch SNP data (MAF, rsNumbers, Alternate and reference allele) from this file:
1000GENOMES-phase_3.gvf
this is how my input looks like:
Primer_0|CYP2C19 NC_000010.11 100.000 23 0 0 1 23 94781749 94781771 2.65e-04 46.1 23 CCAGAGCTTGGCATATTATCT Within specified genomic region
and this is the code snippet that is supposed to do this Job:
def fetch_snps_for_primer(chrom, start, end, sequence, gvf_file):
"""Fetch SNPs from the compressed GVF file by processing it line by line in Python."""
results = []
# Open the compressed GVF file using gzip
with gzip.open(gvf_file, 'rt') as gvf: # 'rt' means read text mode
for line in gvf:
if line.startswith("#"): # Skip header lines
continue
parts = line.strip().split("t")
if len(parts) < 9:
continue # Skip malformed lines
chrom_gvf = parts[0]
pos_start = int(parts[3]) # SNP start position
info_field = parts[8]
# Check if this SNP is in the current primer region
if chrom_gvf == chrom and start <= pos_start <= end:
snp_id = None
ref = None
alts = []
# Extract SNP ID, reference, and variant sequences
for field in info_field.split(";"):
if field.startswith("Dbxref=dbSNP_"):
snp_id = field.split(":")[1] # Extract rsID from Dbxref
elif field.startswith("Reference_seq="):
ref = field.split("=")[1]
elif field.startswith("Variant_seq="):
alts = field.split("=")[1].split(",")
if snp_id:
results.append({
'Chromosome': chrom_gvf,
'Start': pos_start,
'End': pos_start, # SNPs typically are a single base
'Primer Sequence': sequence,
'SNP ID': snp_id,
'Reference Allele': ref,
'Alternate Alleles': ','.join(alts)
})
return results
note: i have 29 different primer sequences and all should have SNPs inside their range, but it only finds SNPs for 20 of them.
However, when I try to fetch the data via the command line for the primer sequences that have no SNPs after running the script, it finds data.
It would be so amazing if somebody could me!
Thanks already in advance 🙂
I tried to change the code in many different ways, but no matter what I did data was not found for all primer sequences.
I expect SNP data for all primer sequences.