I have the following bash
script that randomly subsamples a zipped FASTA file based on specifying the number of DNA sequences to sample num_sequences
. However, the code is buggy. (1) It returns more than num_sequences
sequences and (2) the FASTA headers are corrupt.
#!/bin/bash
# Check if input arguments are provided
if [ $# -ne 3 ]; then
echo "Usage: $0 input_file output_file num_sequences"
exit 1
fi
# Input arguments
input_file=$1
output_file=$2
num_sequences=$3
# Check if input file exists
if [ ! -f "$input_file" ]; then
echo "Input file not found: $input_file"
exit 1
fi
# Check if num_sequences is a positive integer
if ! [[ $num_sequences =~ ^[0-9]+$ ]]; then
echo "Invalid number of sequences: $num_sequences"
exit 1
fi
# Downsizing the FASTA file
echo "Downsizing $input_file to $output_file by randomly selecting $num_sequences sequences..."
# Determine unzip command based on input file type
if [[ "$input_file" == *.gz ]]; then
unzip_command="zcat"
else
unzip_command="cat"
fi
# Count the total number of sequences in the input file
total_sequences=$($unzip_command "$input_file" | grep -c "^>")
# Check if the requested number of sequences is greater than the total number of sequences
if [ "$num_sequences" -gt "$total_sequences" ]; then
echo "Requested number of sequences exceeds total number of sequences in input file."
exit 1
fi
# Generate a random sample of sequence indices using awk and extract sequences
$unzip_command "$input_file" | awk -v max="$total_sequences" -v n="$num_sequences" '
BEGIN {
srand(); # Initialize random number generator
RS = "n>"; # Set record separator to FASTA header format
}
NR > 1 { # Skip first empty record
seq_index = 1 + int(rand() * max); # Generate random sequence index
if (seq_index in selected || length(selected) < n) { # Select sequences
print ">" $0; # Output the complete FASTA record
selected[seq_index] = 1; # Mark this sequence index as selected
}
}
' | $unzip_command > "$output_file"
echo "Downsizing complete. Output saved to $output_file"
The output looks like (only first sequence show for brevity):
>090135ec-5580-4402-89a6-dcaa0e42b9a9
AAGGTTAACCTGGTAACTAGGACACAAGACTCCAGCACCTGGTAGTACTCACTGCGCTCACTAAACTTCTGGATGTCCAAAAAATCAAAATAAGTGTTGGTATAAAATTGGGTCTCCTCCTCCAGCTGGGTCAAAAAAAGATGTATTTAAATTTCGATCAGTTAAAAGTATTGTAATAGCTCCTGCTAAAACTGGAAGGGATAGAAGAAGAAGGAATGCAGTAATTATAACTGATCATACAAATAATGGTATTCGATCATAGAGAATAGCTCGTATATTAATAATAGTTGTAATAAAATTTACTGCACCTAAAATTGATGAAATTCCTGCTAAATGAAGAGAAAAAATAGCTAAATCTACAGATGCTCCTCTATGAGCAATTACTGAAGAAAGAGGTGGGTATACTGTTCATCCTGTTCCAGCTCCATTATTAACTATACTTCTAACAAGGAGAAGGGTTAAAGATGGAGGTAAAAGTCAAAATCTTATATTATTTATTCGTGGGAATGCTATATCGGGAGCTCCTAATATTAATGGAACTAATCAATTTCCAAATCCCCCAATTATTATTGGTATAACTATAAAAAAAATTATAATAAAAGCATGTGCTGTTACAATTACATTATAAATTTGATCATCTCCAATTAAAGATCCTGGGTGTCCAAGTTCTCTTCGAATTAAAATTCTTAAAGAGGTTCCAACTATTCCTGCTCATGCTCCGAAAATAAAATATAATGTACCAATATCTTTATGATTGGTTGACTACACACGTGTCATGCGCTACCAGGTGCTGGAGTCTTGTGTCCCAGTTACCAGGTTACA
Any ideas on what might be causing this and how to fix it?