I have 17 warc files directing one website and they’re all 1GB in size. How can I merge them into a single .warc file so I can get the whole content of the website in one file?
I have tried the ‘type’ command in windows and the output file only refer to about 10 htmls which is far less than the original ones.
plus, I unzip the warc.gz files to get the warc files which can be correctly recognized by the ReplayWeb
I’m a fully starter about the things. Thank you for your help!
Bill is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
Just concatenate the WARC files, no matter whether still gzipped or not:
cat 1st.warc.gz 2nd.warc.gz >all.warc.gz
cat 1st.warc 2nd.warc >all.warc
Both – all.warc.gz and all.warc – are valid WARC files.
One note about gzipped WARC files: the WARC standard recommends record-level compression. As a consequence, compressing an uncompressed WARC files requires specific tools. For example warcio or FastWARC provide a “recompress” utility. All WARC tools should support reading record-level compressed WARC files, including Replay Webpage.