About me:
I have some Python experience, but am pretty new to HTML files and parsing them, so I’m looking for pointers and tips.
About the question:
A friend of mine wants me to provide coding help for his situation. He has the data from an old website saved as .html based text files (this is not a scraping question). He wants me to read the files as HTML, find the sections, break each one out (time tags, data tags, etc), and then save them into a database for retrieval as web pages (so the HTML tags are NOT to be stripped).
Python has a ton of libraries for HTML, but most appear to be specifically for scraping from a live web page. Ideally, the library will be able to read directly from the file and pull information from a specified starting tag to the matching closing tag.
Can anyone recommend an HTML parser that can take data from a file and find the beginning and ending of blocks, then for each block, parse out the time tags, data tags, etc, and give them to me to save as text (still with the HTML tags) to be put into the database?
I’m pretty sure this is a known and solved problem, but a search of the duplicates doesn’t say so. Feel free to include pointers to videos that explain things I (clearly) don’t yet understand. Thanks.
As I am at the beginning of the project, I haven’t done anything yet. Really basic Pseudo Code for the program is shown below. The parsing input is for debugging flags in early development, allowing me to dump data to screen or other files to ensure I’m doing what my friend wants. Ultimately, the data will go in a database (default value).
### pseudo code ###
# parse command line input (debugflags, input file name)
# open specified file(s)
# use HTML parser (from HTML lib) to strip page title off (if present)
# while !EOF
# read a full record to a buffer (HTML parser?)
# use HTML parser to parse each section of the record into local variables
# depending on parameters, write the data to:
# to console
# an HTML file
# a CSV file (tab delimited)
# to SQL (using SQL library)
# close all files