A table of content can look like:
Preface
Table of Content
Chapter 1 ...
1.1 ...
1.1.1 ...
1.1.2 ....
1.2 ...
Summary
Exercises
Chapter 2 ...
...
Appendix ...
A ...
A.1 ...
A.2 ...
B ...
References
Index
Its logical structure is a tree of multiple levels:
Preface
Table of Content
Chapter 1 ...
1.1 ...
1.1.1 ...
1.1.2 ....
1.2 ...
Summary
Exercises
Chapter 2 ...
...
Appendix ...
A ...
A.1 ...
A.2 ...
B ...
References
Index
- I wonder if parsing a table of content into a tree is a parsing
problem according to some grammar (e.g. regular grammar,
context-free grammar, or some other grammars)? - If yes, how can we specify the grammar of a table of content?
-
Can your parsing method deal with ambiguous case e.g.
Preface Table of Content Chapter 1 ... 1.1 ... 1.1.1 ... 1.1.2 .... 1.2 ... Summary Exercises Chapter 2 ... 2.1.1 ... 2.1.2 ... Appendix ... A ... A.1 ... A.2 ... B ... References Index
where
2.1.1 ...
is a one level lower thanChapter 2 ...
, while
1.1.1 ...
is two levels lower thanChapter 1
?
Thanks.
2
This is neither complete nor tested, but it should give you the general idea.
start
= outermost_line+
outermost_line
= no_dot_word description? 'n' one_dot_line*
one_dot_line
= one_dot_word description? 'n' two_dot_line* | two_dot_line
two_dot_line
= two_dot_word description? 'n' three_dot_line* | three_dot_line
The outermost_line
contains any number of one_dot_lines
within it. The way you handle your skipping straight to two dot lines is the | two_dot_line
that can pass through to the next layer. Verifying that the chapter numbers match up with the section numbers I wouldn’t do in the parser, but the next layer up.
4