I am working on defining a bison grammar for http and am dealing with a number of shift/reduce conflicts with regards to how optional whitespace is handled at the end of the header field value. Here is a simplified version of the grammar that just deals with the header field portion:
%{
#define YYDEBUG 1
int yylex(void);
void yyerror(char *s);
%}
%token X00
%%
/* Reference - RFC 7230 */
headerField : fieldName ':' OWS fieldValue OWS CRLF | fieldName ':' OWS CRLF ;
fieldValue : fieldValueSS | fieldValue fieldValueSS ;
fieldValueSS : fieldContent | obsFold ;
fieldContent : fieldVChar optFieldContent ;
optFieldContent : /* empty */ | RWS fieldVChar ;
fieldVChar : obsText | VCHAR ;
/* Even though this line fold is obsolete still account for it for consistency with the RFC */
obsFold : CRLF RWS ;
fieldName : token ;
token : tChar | token tChar ;
tChar : '!' | '#' | '$' | '%' | '&' | 'x27' | '*' | '+' | '-' | '.' | '^' | '_' | '`' | '|' | '~' | DIGIT | ALPHA ;
obsText : r__1 ;
/* Tokens */
ALPHA : LOALPHA | UPALPHA ;
LOALPHA : r__2 ;
UPALPHA : r__3 ;
DIGIT : r__4 ;
SP : 'x20' ;
HTAB : 'x09' ;
WS : SP | HTAB ;
OWS : /* empty */ | OWS WS ;
RWS : WS | RWS WS ;
VCHAR : r__5 ;
LF : 'x0a' ;
CR : 'x0d' ;
CRLF : CR LF ;
/* Range Expansions */
r__1 :
'x80' | 'x81' | 'x82' | 'x83' | 'x84' | 'x85' | 'x86' | 'x87' |
'x88' | 'x89' | 'x8a' | 'x8b' | 'x8c' | 'x8d' | 'x8e' | 'x8f' |
'x90' | 'x91' | 'x92' | 'x93' | 'x94' | 'x95' | 'x96' | 'x97' |
'x98' | 'x99' | 'x9a' | 'x9b' | 'x9c' | 'x9d' | 'x9e' | 'x9f' |
'xa0' | 'xa1' | 'xa2' | 'xa3' | 'xa4' | 'xa5' | 'xa6' | 'xa7' |
'xa8' | 'xa9' | 'xaa' | 'xab' | 'xac' | 'xad' | 'xae' | 'xaf' |
'xb0' | 'xb1' | 'xb2' | 'xb3' | 'xb4' | 'xb5' | 'xb6' | 'xb7' |
'xb8' | 'xb9' | 'xba' | 'xbb' | 'xbc' | 'xbd' | 'xbe' | 'xbf' |
'xc0' | 'xc1' | 'xc2' | 'xc3' | 'xc4' | 'xc5' | 'xc6' | 'xc7' |
'xc8' | 'xc9' | 'xca' | 'xcb' | 'xcc' | 'xcd' | 'xce' | 'xcf' |
'xd0' | 'xd1' | 'xd2' | 'xd3' | 'xd4' | 'xd5' | 'xd6' | 'xd7' |
'xd8' | 'xd9' | 'xda' | 'xdb' | 'xdc' | 'xdd' | 'xde' | 'xdf' |
'xe0' | 'xe1' | 'xe2' | 'xe3' | 'xe4' | 'xe5' | 'xe6' | 'xe7' |
'xe8' | 'xe9' | 'xea' | 'xeb' | 'xec' | 'xed' | 'xee' | 'xef' |
'xf0' | 'xf1' | 'xf2' | 'xf3' | 'xf4' | 'xf5' | 'xf6' | 'xf7' |
'xf8' | 'xf9' | 'xfa' | 'xfb' | 'xfc' | 'xfd' | 'xfe' | 'xff' ;
r__2 :
'x61' | 'x62' | 'x63' | 'x64' | 'x65' | 'x66' | 'x67' | 'x68' |
'x69' | 'x6a' | 'x6b' | 'x6c' | 'x6d' | 'x6e' | 'x6f' | 'x70' |
'x71' | 'x72' | 'x73' | 'x74' | 'x75' | 'x76' | 'x77' | 'x78' |
'x79' | 'x7a' ;
r__3 :
'x41' | 'x42' | 'x43' | 'x44' | 'x45' | 'x46' | 'x47' | 'x48' |
'x49' | 'x4a' | 'x4b' | 'x4c' | 'x4d' | 'x4e' | 'x4f' | 'x50' |
'x51' | 'x52' | 'x53' | 'x54' | 'x55' | 'x56' | 'x57' | 'x58' |
'x59' | 'x5a' ;
r__4 :
'x30' | 'x31' | 'x32' | 'x33' | 'x34' | 'x35' | 'x36' | 'x37' |
'x38' | 'x39' ;
r__5 :
'x21' | 'x22' | 'x23' | 'x24' | 'x25' | 'x26' | 'x27' | 'x28' |
'x29' | 'x2a' | 'x2b' | 'x2c' | 'x2d' | 'x2e' | 'x2f' | 'x30' |
'x31' | 'x32' | 'x33' | 'x34' | 'x35' | 'x36' | 'x37' | 'x38' |
'x39' | 'x3a' | 'x3b' | 'x3c' | 'x3d' | 'x3e' | 'x3f' | 'x40' |
'x41' | 'x42' | 'x43' | 'x44' | 'x45' | 'x46' | 'x47' | 'x48' |
'x49' | 'x4a' | 'x4b' | 'x4c' | 'x4d' | 'x4e' | 'x4f' | 'x50' |
'x51' | 'x52' | 'x53' | 'x54' | 'x55' | 'x56' | 'x57' | 'x58' |
'x59' | 'x5a' | 'x5b' | 'x5c' | 'x5d' | 'x5e' | 'x5f' | 'x60' |
'x61' | 'x62' | 'x63' | 'x64' | 'x65' | 'x66' | 'x67' | 'x68' |
'x69' | 'x6a' | 'x6b' | 'x6c' | 'x6d' | 'x6e' | 'x6f' | 'x70' |
'x71' | 'x72' | 'x73' | 'x74' | 'x75' | 'x76' | 'x77' | 'x78' |
'x79' | 'x7a' | 'x7b' | 'x7c' | 'x7d' | 'x7e' ;
Looking at the .output file I can see that there are 2 conflicts in State 321:
State 321
7 fieldContent: fieldVChar • optFieldContent
' ' shift, and go to state 109
't' shift, and go to state 110
' ' [reduce using rule 8 (optFieldContent)]
't' [reduce using rule 8 (optFieldContent)]
$default reduce using rule 8 (optFieldContent)
optFieldContent go to state 335
SP go to state 324
HTAB go to state 325
WS go to state 336
RWS go to state 337
It appears to me as though this conflict occurs when it reads in a few visible characters and then sees a whitespace. The parser does not know whether to continue accumulating characters in fieldVChar
or to reduce an empty optFieldContent
.
There are also 2 conflicts in State 340:
State 340
12 obsFold: CRLF RWS •
46 RWS: RWS • WS
' ' shift, and go to state 109
't' shift, and go to state 110
' ' [reduce using rule 12 (obsFold)]
't' [reduce using rule 12 (obsFold)]
$default reduce using rule 12 (obsFold)
SP go to state 324
HTAB go to state 325
WS go to state 343
It seems as though these conflicts arise after accepting a CRLF
and at least 1 whitespace token. When it receives another whitespace token it does not know whether to continue accumulating whitespace or reduce a completed obsFold
rule.
I tried fixing the first of these conflicts (State 321) by disallowing an empty optFieldContent
and instead incorporate it into the rule for fieldContent
as follows:
fieldContent : fieldVChar | fieldVChar fieldContentSS ;
fieldContentSS : RWS fieldVChar ;
But this did not seem to do anything to resolve it.
I tried fixing the second of the conflicts (State 340) by defining a separate rule for a field value that contains an obsFold
and one that does not as follows:
headerField : fieldName ':' OWS fieldValueWS CRLF | fieldName ':' OWS CRLF | fieldName ':' OWS fieldValue OWS CRLF ;
fieldValue : fieldValueSS | fieldValue fieldValueSS ;
fieldValueSS : fieldContent ;
fieldValueWS : fieldValueSSWS | fieldValueWS fieldValueSSWS ;
fieldValueSSWS : fieldContent | obsFold ;
And this seemed to resolve the shift/reduce conflicts, but led to a host of reduce/reduce conflicts where it did not know whether to reduce a fieldValueSSWS
or a fieldValueSS
upon reading in a character.