In the Perl Tutorial there is this example with the /g
modifier:
$dna = "ATCGTTGAATGCAAATGACATGAC";
while ($dna =~ /(www)*?TGA/g) { # note the minimal *?
print "Got a TGA stop codon at position ", pos $dna, "n";
}
I translated it to Common Lisp like this:
(let ((dna "ATCGTTGAATGCAAATGACATGAC"))
(ppcre:do-matches (m-s m-e ; match-start match-end
"(\w\w\w)*?TGA"
dna
nil
:start 0 :end (length dna))
(format t "~&;; Got a TGA stop codon at position ~d"
m-e)))
which behaves like the Perl example:
;; Got a TGA stop codon at position 18
;; Got a TGA stop codon at position 23
NIL
From the Perl tutorial:
Position 18 is good, but position 23 is bogus. What happened?
The answer is that our regexp works well until we get past the
last real match. Then the regexp will fail to match a
synchronized TGA and start stepping ahead one character position
at a time, not what we want.The solution is to use G to anchor the match to the codon
alignment …
The documentation of CL-PPCRE says
The following Perl features are (currently) not supported: …
G for Perl’s pos() because we don’t have it.
But of course I want the functionality, at least to cover the tutorial mostly complete with cl-ppcre
. So, I try to figure out what happens, why the regexp will start stepping ahead one character position. So first I looked at it with ppcre:do-scans
:
(let ((dna "ATCGTTGAATGCAAATGACATGAC"))
(ppcre:do-scans (m-s m-e
r-s r-e
"(\w\w\w)*?TGA"
dna
nil
:start 0 :end (length dna))
(format t "~&;; ~a ~a ~a ~a" m-s m-e r-s r-e)))
which outputs:
;; 0 18 #(12) #(15)
;; 20 23 #(NIL) #(NIL) ; <<<< Why NIL?
NIL
So naive as I am, I first tried to use this for me:
(let ((dna "ATCGTTGAATGCAAATGACATGAC"))
(ppcre:do-scans (m-s m-e
r-s r-e
"(\w\w\w)*?TGA"
dna
nil
:start 0 :end (length dna))
(if (null (elt r-s 0))
(return)
(format t "~&;; Got a TGA stop codon at position ~d"
m-e))))
Which did the job.
;; Got a TGA stop codon at position 18
NIL
But I did not expect that this would be a general solution.
Adding triplets to the “DNA” changed the picture again (of course, Frankenstein):
(let ((dna "ATCGTTGAATGCAAATGACATGACTGCTGAGTTATGAAATGCATC"))
(ppcre:do-scans (m-s m-e
r-s r-e
"(\w\w\w)*?TGA"
dna
nil
:start 0 :end (length dna))
(format t "~&;; ~a ~a ~a ~a" m-s m-e r-s r-e)))
results in:
;; 0 18 #(12) #(15)
;; 18 30 #(24) #(27) <<<<<< end 30
;; 31 37 #(31) #(34) <<<<<< start 31 results in a bogus
NIL
To be sure I visualised it for me like this:
;; 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17|18
;; A T C G T T G A A T G C A A A T G A|
;; | | | | | |
;; v
;; 18 19 20 21 22 23 24 25 26|27 28 29|30<
;; C A T G A C T G C T G A| G
;; | | | | ∧
;; v
;; 31 32 33 34 35 36 37 38 39 40 41 42 43 44
;; T T A T G A| A A T G C A T C
;; | | | | |
;; |-------|--------|
;; bogus
And since I thought, maybe I can get more information out of a longer
sequence, I tried this again:
(let ((dna "ATCGTTGAATGCAAATGACATGACTGCTGAGTTATGAAATGCATCTGCTGAATCAAACTGAAATGAATC"))
(ppcre:do-scans (m-s m-e
r-s r-e
"(\w\w\w)*?TGA"
dna
nil
:start 0 :end (length dna))
(format t "~&;; ~a ~a ~a ~a" m-s m-e r-s r-e)))
which resulted in:
;; 0 18 #(12) #(15)
;; 18 30 #(24) #(27) end pos 30
;; 30 51 #(45) #(48) <<<<<<< oops, start pos 30 and the match is correct
;; 51 66 #(60) #(63) <<<<<<< again correct.
NIL
(Again, I wanted to be real sure and completed my visualisation. But I have no doubt that you do not want to see it again.)
This picture kept the same with an even longer “dna” of
"ATCGTTGAATGCAAATGACATGACTGCTGAGTTATGAAATGCATCTGCTGAATCAAACTGAAATGAATCAAATGCTGACCC"
output
;; 0 18 #(12) #(15)
;; 18 30 #(24) #(27)
;; 30 51 #(45) #(48)
;; 51 66 #(60) #(63)
;; 66 78 #(72) #(75)
NIL
And I think: The corresponding effect or the imitation of the G
anchor would be to tell ppcre:do-scans
that in the case of this
search it always shall proceed in that way it does in
the last two cases with the very long strings.
But I do not get the picture right now what to do or how to tell it ppcre:do-scans
and its relatives.
Could you help me?