What are the best ways of gathering information on events (any type) from the internet ?
Keeping in mind that different websites will present information in different ways.
I was thinking ‘smart’ web crawlers, but that can turn out to be extremely challenging, simply because of the hugely varied ways that different sites present their information.
Then I was thinking of sifting through the official twitter feeds of organisations, people with knowledge of events .. etc and look for the event hash tag, grab the tweet and dissect it to grab the relevant information about the event.
Information I am interested in gathering is: Date and Time of Event, Address where Event is being held, and any Celebrities (or any famous people) attending the event (if any).
The reason to ask here is my hope that experienced folk will open my eyes to things I’ve missed, which I am sure I have.
I would approach this as a program that would screen scrape certain sites. So if your looking for celebrities then you would want to find some site to use as a source and then start with their home page and download the HTML. Then you would have to teach your program to parse out the HTML to pull down only the information you want.
You will also want to look for URLs as most sites will have both internal and exteranal links. This will probably be the hardest part to filter out as you will not know what is on the other end of that URL until your program gets there.
The basic concept of this is a web crawler and believe me you will need to have a way to either throttle back the application or have a very beefy server to handle the information flow. It has been several years at least since I have written a crawler and I was actually after very specific information on a very specific site.
1