Wednesday, May 23, 2012

How to parse bad html?

I am writing a search engine that goes to all my company affiliates websites parse html and stores them in database. These websites are really old and are not html compliant out of 100000 websites around 25% have bad html that makes it difficult to parse. I need to write a c# code that might fix bad html and then parse the contents or come up with a solution that will address above said issue. If you are sitting on idea, an actual hint or code snippet would help.





No comments:

Post a Comment