Researching archiving systems like archive.org, found out main issue of such is the dynamic content.
Initial analysis shows that content ‘dynamicity’ can be assigned to one of the following levels:
-
Static html content – plain old web page which is represented only by html markup with auxiliary css-referred resources (usually images).
-
Static html powered by javascirpt – same as Level 1, but has javascript code, which only manipulates existing markup (such as expand/collapse).
-
“Onload” page construction – web page with javascript code, which makes a certain additional requests during page load phase. After loads phase page content is fully constructed.
-
Dynamic client-side content – UI elements are modified by javascript code on-the-go, as user traverses through interface. Usually these are modern SPA (single-page-applications, like gmail.com), “endless” lists (list tail is loaded when user scrolls down to the list bottom) , loading content on demand (smart expanders) and so on.
So I assume that Levels 1 and 2 can be archived pretty easily. Could you please suggest how to handle Levels 3 and 4? Looks like it should involve page rendering, but some details would be helpful.
Update: To clarify the question: ideally offline version should be fully-functional, at least within the site level (ignoring external domains content). Also, if Level4 is too hard to automate fully – is there an approach involving human operator who makes hints to the system about content?
3
It is doable, integrate in your crawler a webkit browser. Then get all static pages first. Log down the requests made by the page (you can now because the browser in your crawler actually renders the page).
That will give you an overview of the responses done.
The onloads are easy because they are done directly. The harder part is clickable elements which load additional content. To get them find everything with a custom event attached and execute the event to see what happens. If content changes you will know.
The responses could be cache so you could create a full working version.
Considerations are mostly processing time and browser issues. It is much slower to crawl this way then textual parsing since you actually need to load and render the page.
3