I am building a text editor which makes use of a Ragel based tokenizer to support syntax highlighting. I am considering the use of a rope data structure to support efficient modifications and undo/redo operations. Is there a standard approach for tokenizing or searching text contained in this type of data structure? Some characters can cause the tokenizer to consume the rest of the stream.
2
I’m familiar with the underlying state-machine approach described in your link — it’s been around for decades. It can tokenise/categorise any stream of text that supports a get-next-character operation.
I’m familiar with ropes in the context of a text editor. The usual purpose (as I know it) is to break the strings into portions that have the same display attributes: colour, font, link, line breaks etc. This works well for both editing and display. The major operations are: insert character; delete character; delete token; cut and paste. Maintaining the rope is not easy.
It’s not obvious from your question whether you expect the tokeniser to generate the rope. Is it a rope of tokens, where each token has its own display attributes? I would be troubled that editing and tokenisation could interfere with each other. Things like quotes and comments can reach a long way.
No, I don’t there is a standard way of combining these ideas. I suspect the right idea is for the tokeniser to generate/regenerate the rope, immediately for the on-screen portion and in the background for the rest. Edits should affect the rope (only) with a short delay before retokenising. The tokenising needs to be interruptible too.
It feels like a reasonable approach, but I’m sure there are many challenges in making it work well. You might want to read some source code (Eclipse? Netbeans?) to see how others do it.