i have following html in string , have extract content in paragraph tags ideas??
link http://www.public-domain-content.com/books/coming_race/c1p1.shtml
i have tried
const string html_tag_pattern = "<[^>]+.*?>"; static string striphtml(string inputstring) { return regex.replace(inputstring, html_tag_pattern, string.empty); }
it removes html tags dont want remove tags because way how can content paragraph tags
secondly makes line breaks \n in text , and applying replace("\n","") dose not helps 1 problem when apply
int urlstart = e.result.indexof("<p>"), urlend = e.result.indexof("<p> </p></td>\r" ); string paragraph = e.result.substring(urlstart, urlend); extractedcontent.text = paragraph.replace(environment.newline, "");
<p> </p></td>\r
appears @ end of paragraph urlend dose not makes sure paragraph shown
the string extracted shown in visual studio
page downloaded webclient end of htmlpage
we provide ourselves ropes of\rsuitable length , strength- and- pardon me- must not\rdrink more to-night. our hands , feet must steady and\rfirm tomorrow.\"\r<p> </p> </td>\r </tr>\r\r <tr>\r <td height=\"25\" width=\"10%\">\r \r </td><td height=\"25\" width=\"80%\" align=\"center\">\r <font color=\"#ffffff\">\r <font size=\"4\">1</font> \r </font></td>\r <td height=\"25\" width=\"10%\" align=\"right\"><a href=\"c2p1.shtml\">next</a></td>\r </tr>\r </table>\r </center>\r</div>\r<p align=\"center\"><a href=\"index.shtml\"><b>the coming race -by- edward bulwer lytton</b></a></p>\r<p><b><center><a href=\"http://www.public-domain-content.com/encyclopedia.shtml\">encyclopedia</a> - <a href=\"http://www.public-domain-content.com/books.shtml\">books</a> - <a href=\"http://www.public-domain-content.com/religion.shtml\">religion<a/> - <a href=\"http://www.public-domain-content.com/links2.shtml\">links</a> - <a href=\"http://www.public-domain-content.com/\">home</a> - <a href=\"http://www.webmaster-headquarters.com/mb/\">message boards</a></b><br>this <a href=\"http://www.wikipedia.org/\">wikipedia</a> content licensed under <a href=\"http://www.gnu.org/copyleft/fdl.html\">gnu fr
don't use regular expressions parse html. use html agility pack (or similar) instead.
a quick example, this:
htmldocument document = new htmldocument(); document.load("your_file_here.htm"); foreach(htmlnode paragraph in document.documentelement.selectnodes("//p")) { // paragraph node here string content = paragraph.innertext; // or similar }
Comments
Post a Comment