I tried coding in Bash with the Awk and sed
commands but didn’t get the desired output. I have a text file with the following contents:
>AC201869.46386.47908 Regiella insecticola
AGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGCAGCGGGGAGTAGCTTGCTACTCTGCCGGCGAGCGGC
>JQ765428.1.1430 Pantoea dispersa
GCAGCTACACATGCAAGTCGAACGGCAGCACAGAAGAGCTTGCTCTTTGGGTGGCGAGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCCGATGGA
I need output like the following.
>Regiella insecticola
AGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGCAGCGGGGAGTAGCTTGCTACTCTGCCGGCGAGCGGC
>Pantoea dispersa
GCAGCTACACATGCAAGTCGAACGGCAGCACAGAAGAGCTTGCTCTTTGGGTGGCGAGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCCGATGGA
I tried some commands like these but they didn’t work.
sed 's/[1-9]/./g' silva_species_assignment_v132.fa -> textfile.txt
2
The following simple sed
script will remove everything after >
up through the first space.
sed 's/^>[^ ]* />/' file.fasta >newfile.fasta
In your attempt, [1-9]
would replace any non-zero digit, but that’s obviously not a useful pattern for this particular task. If you wanted to target stretches of uppercase alphabetics followed by digits and dots and ending with a space, that would be something like
sed -E 's/[A-Z]+[0-9.]+ //'
where the -E
selects a more modern regex dialect so that you can use +
for “one or more repetitions, as many as possible”.
If you want an Awk solution, maybe somtething like
awk '/^>/ { $1=">" } 1' file.fasta >newfile.fasta
(though this will leave a space after >
).
All three of these work with the examples you have provided; if you have more complex examples where the species name is not simply the text after the first space, probably ask a new question with more precise requirements.