Featured post

c# - Usage of Server Side Controls in MVC Frame work -

i using asp.net 4.0 , mvc 2.0 web application. project requiremrnt have use server side control in application not possibl in noraml case. ideally want use adrotator control , datalist control. i saw few samples , references in codepleax mvc controllib howwver found less useful. can tell how utilize theese controls in asp.net application along mvc. note: please provide functionalities related adrotator , datalist controls not equivalent functionalities thanks in advace. mvc pages not use normal .net solution makes use of normal .net components impossible. a normal .net page use event driven solution call different methods service side mvc use actions , view completly different way handle things. also, mvc not use viewstate normal .net controlls require. found article discussing mixing of normal .net , mvc.

user agent - How to maintain web-politeness (avoid being banned) during web-crawl? -


  • a web-bot crawling site , using bandwdith resources.

  • bots numerous , many purposes, starting homemade, university research, scrappers, new startups established search engines (and many more categories probably)

apart large search engines can potentially send traffic site, why webmasters allow other bots purpose not know ? incentives webmasters allow these bots ?

2nd question is:

should distributed crawler multiple crawlagent-nodes on internet, use different user-agent string each agent, because if use same ua, benefit of scaling via multiple agents highly reduced. because large websites high crawl-delay set, may take weeks or months crawl fully.

3rd question: since robots.txt (the defined crawl control method) @ domain level. should crawler have politeness policy per domain or per ip (sometimes many websites hosted on same ip) .

how tackle such web poilteness problems ? other related things keep in mind ?

  1. there many useful bots besides search engine bots , there growing number of search engines. in case, bots want block using incorrect user-agent strings , ignoring robots.txt files how going stop them? can block @ ip level once detect them others it's hard.

  2. the user agent string has nothing crawl rate. millions of browser users using same user agent string. web sites throttle access based on ip address. if want crawl site faster you'll need more agents, really, shouldn't doing - crawler should polite , should crawling each individual site whilst making progress on many other sites.

  3. crawler should polite per-domain. single ip may server many different servers that's no sweat router that's passing packets , fro. each individual server limit ability maintain multiple connections , how bandwidth can consume. there's one-web-site-served-by-many-ip addresses scenario (e.g. round robin dns or smarter): bandwidth , connection limits on sites these happen @ router-level, once again, polite per domain.


Comments

Popular posts from this blog

c# - Usage of Server Side Controls in MVC Frame work -

cocoa - Nesting arrays into NSDictionary object (Objective-C) -

ios - Very simple iPhone App crashes on UILabel settext -