While possibly a duplicate of the follow questions however I can’t for the life of me figure this out.
How does Git store tree objects?
What is the internal format of a Git tree object?
Basically I’m writing some stuff to parse through git files and I can’t seem to parse the tree files. Git files are compressed with zlib and I have that part working, I even have a parser somewhat working using regex to look through the data. I’ve tried to parse through the values manually, I’ve tried different regexes as well as looking through the bytes manually but I really can’t seem to make anything that works.
What I’ve tried
I won’t go into specific implementation detail but basically I’m using this regex on the decompressed tree files.
let re = Regex::new("1?[0-7]{5} .+?.{20}")?;
And I do some stuff to create the following objects
(mode, filename, file_index)
(100644, ".gitignore", [239, 191, 189, 239, 191, 189, 79, 103, 32, 108, 109, 85, 239, 191, 189, 70, 239, 191, 189, 239, 191, 189, 116, 82, 108, 239, 191, 189, 239, 191, 189, 45, 239, 191, 189, 49])
(644, "Cargo.toml", [239, 191, 189, 19, 239, 191, 189, 112, 239, 191, 189, 74, 239, 191, 189, 108, 38, 239, 191, 189, 239, 191, 189, 239, 191, 189, 52, 239, 191, 189, 5, 101, 239, 191, 189, 74, 35, 49])
(644, "README.md", [70, 239, 191, 189, 8, 239, 191, 189, 102, 239, 191, 189, 110, 25, 220, 138, 85, 239, 191, 189, 68, 84, 239, 191, 189, 23, 107, 16, 52, 48, 48])
You’ll notice the length of the file_index
is different each entry and is also not 20. You’ll also notice the mode is cut off on the 2nd and 3rd entry.
After changing the regex a bit to use negative look-ahead.
let re = Regex::new("[\d]{5,6} .+?.+?(?=[\d]{5,6}|$)")?;
This is the output.
Data Length: 35 (100644, ".gitignore", [239, 191, 189, 239, 191, 189, 79, 103, 32, 108, 109, 85, 239, 191, 189, 70, 239, 191, 189, 239, 191, 189, 116, 82, 108, 239, 191, 189, 239, 191, 189, 45, 239, 191, 189])
Data Length: 37 (100644, "Cargo.toml", [239, 191, 189, 19, 239, 191, 189, 112, 239, 191, 189, 74, 239, 191, 189, 108, 38, 239, 191, 189, 239, 191, 189, 239, 191, 189, 52, 239, 191, 189, 5, 101, 239, 191, 189, 74, 35])
Data Length: 28 (100644, "README.md", [70, 239, 191, 189, 8, 239, 191, 189, 102, 239, 191, 189, 110, 25, 220, 138, 85, 239, 191, 189, 68, 84, 239, 191, 189, 23, 107, 16])
Data Length: 46 (40000, "git_stats", [239, 191, 189, 17, 239, 191, 189, 239, 191, 189, 37, 93, 239, 191, 189, 37, 239, 191, 189, 105, 239, 191, 189, 239, 191, 189, 87, 107, 239, 191, 189, 239, 191, 189, 239, 191, 189, 239, 191, 189, 239, 191, 189, 239, 191, 189])
The SHA1 hash really should be 20 bytes long (or 20 u8s to get converted to 40 hex values) but not a single value has the same length, even if it is specified in the regex!
I’ve also tried parsing without regex and even resorted to asking ChatGPT but nothing I tried worked. I assume its something to do with strings being weird but I have no idea how I would go about implementing working code for this. This is especially frustrating because the code mentioned here /a/33039114/15474643 actually works.
Also I do understand that these SHA1 hashes are hex encoded, meaning each byte is actually two values but still I have no idea where I’m going wrong.
Also also I’m using the Fancy Regex crate for the parsing.