I have a cluster file for sequences clustering at 100% sequence identity, each containing sequence clusters denoted by cluster numbers followed by IDs. Here’s an example of the file format:
>Cluster 107
0 410aa, >TRINITY_DN9528_c0_g1_i1_30... *
1 410aa, >TRINITY_DN9528_c0_g1_i2_36... at 100.00%
2 404aa, >crgi_XP_011414097.1... at 100.00%
>Cluster 108
0 410aa, >TRINITY_DN11082_c0_g1_i1_69... *
1 410aa, >TRINITY_DN11082_c0_g1_i2_69... at 100.00%
>Cluster 109
0 410aa, >crgi_XP_011450995.2... *
>Cluster 110
0 407aa, >TRINITY_DN4674_c0_g1_i3_24... *
I want to write a bash script that can extract clusters containing a specific string in the ID, but only if the cluster has other sequences besides the one with the specified string. For example, if I input the string “crgi”, the script should fetch only the clusters with this string in the ID, but not if it’s the only sequence in the cluster.
Here’s an example of the expected output for the input string “crgi”:
Clusters in file1.ids containing 'crgi':
107
I’ve tried using grep, awk, and cut, but I’m having trouble extracting the desired clusters efficiently.
Could someone please provide guidance on how to write such a bash script to achieve this task efficiently? Any help would be greatly appreciated! Thank you.
I’ve tried with the following script but it didn’t work:
#!/bin/bash
# Define the search string
search_string="crgi"
# Loop through each file
for file in *.ids; do
# Extract cluster numbers containing the search string
clusters=$(awk -v search="$search_string" '$0 ~ search {print $1}' "$file")
# Initialize a flag for presence of other sequences
other_sequences_found=false
# Check if clusters contain other sequences
while read -r cluster; do
# Check if the cluster contains other sequences besides the search string
if [ "$(grep -c ">$search_string" "$file")" -gt 1 ]; then
other_sequences_found=true
break
fi
done <<< "$clusters"
# If other sequences found, print the cluster numbers
if [ "$other_sequences_found" = true ]; then
echo "Clusters in $file containing '$search_string':"
echo "$clusters"
echo
fi
done