user agent - How to maintain web-politeness (avoid being banned) during web-crawl? -

- August 15, 2012

a web-bot crawling site , using bandwdith resources.
bots numerous , many purposes, starting homemade, university research, scrappers, new startups established search engines (and many more categories probably)

apart large search engines can potentially send traffic site, why webmasters allow other bots purpose not know ? incentives webmasters allow these bots ?

2nd question is:

should distributed crawler multiple crawlagent-nodes on internet, use different user-agent string each agent, because if use same ua, benefit of scaling via multiple agents highly reduced. because large websites high crawl-delay set, may take weeks or months crawl fully.

3rd question: since robots.txt (the defined crawl control method) @ domain level. should crawler have politeness policy per domain or per ip (sometimes many websites hosted on same ip) .

how tackle such web poilteness problems ? other related things keep in mind ?

there many useful bots besides search engine bots , there growing number of search engines. in case, bots want block using incorrect user-agent strings , ignoring robots.txt files how going stop them? can block @ ip level once detect them others it's hard.
the user agent string has nothing crawl rate. millions of browser users using same user agent string. web sites throttle access based on ip address. if want crawl site faster you'll need more agents, really, shouldn't doing - crawler should polite , should crawling each individual site whilst making progress on many other sites.
crawler should polite per-domain. single ip may server many different servers that's no sweat router that's passing packets , fro. each individual server limit ability maintain multiple connections , how bandwidth can consume. there's one-web-site-served-by-many-ip addresses scenario (e.g. round robin dns or smarter): bandwidth , connection limits on sites these happen @ router-level, once again, polite per domain.

Search This Blog

TY

Featured post

c# - Usage of Server Side Controls in MVC Frame work -

user agent - How to maintain web-politeness (avoid being banned) during web-crawl? -

Comments

Post a Comment

Popular posts from this blog

ios - Very simple iPhone App crashes on UILabel settext -

mysql - Why there can be only one TIMESTAMP column with CURRENT_TIMESTAMP in DEFAULT clause? -

c# - Usage of Server Side Controls in MVC Frame work -