I have a gff file, i want to extract the gene id’s (column 9) and the strand (+ or -) (column 7) i want to do this in bash but I don’t know how to do this. I only want to extract the line that has the feature type Gene in it (column 3).
i tried this:
#!/bin/bash
gff_file="$1"
sequence_id="$2"
gene_ids=$(zcat "$gff_file" | awk -F't' -v seq_id="$sequence_id" '$1 == seq_id && $3 == "gene" {match($9, /ID=([^;]+)/, arr); print arr[1]$7}')
if [ -z "$gene_ids" ]; then
exit 0
fi
sorted_gene_ids=$(echo "$gene_ids" | sort -nk2 | awk '{print $1}')
echo "$sorted_gene_ids"
my gff-file looks like this:
gff_file
when I call the .sh file I also want to filter on the chromosome so for example I would use it like this:
gene_id_extracter.sh example.gff.gz “chr2”
i expected the output to look like this:
Oeu046640.1+
etbusserke is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
1
To extract gene IDs and strands from a GFF file using bash, focusing only on lines with the feature type “gene” in column 3, you can use awk and sort.
#!/bin/bash
# Ensure a GFF file is provided
if [ $# -lt 1 ]; then
echo "Usage: $0 <gff_file>"
exit 1
fi
gff_file="$1"
# Extract gene IDs and strands from the GFF file
gene_ids=$(awk -F't' '$3 == "gene" {match($9, /ID=([^;]+)/, arr); print arr[1] $7}' "$gff_file")
# Check if gene_ids is empty
if [ -z "$gene_ids" ]; then
exit 0
fi
# Sort and print the gene IDs with their strands
sorted_gene_ids=$(echo "$gene_ids" | sort)
echo "$sorted_gene_ids"