python - Scrapy Endless Crawling -

- June 15, 2015

i have built crawling spider using python scrapy agains distributor websites. trying collect urls under domain , each page, urls listed under page. , want use gephi visualize network connections domain.

(1) how crawled url stored(memory or disk) , crawl limit? however, crawler has been running 4 days think , has crawled 700k pages. know scrapy not crawl page has crawled wondering: number of pages increases, there limit scrapy "remember" page has crawled? crawled url stay in memory or mechanism behind this?

(2) will there end crawl single domain? if not? btw, should stop crawling right since don't know when end of spider, don't know if possible have dynamic page "domain crawling" endless task.... example, have parametric search box , combinations of search lead new page(javascript call) actually.. lead huge redundancy..

before know scrapy, tried figure out pattern in url first , populate urls, after that, go each url , using urllib2+bs4 scrape. not quite sure kind of "blind" crawling controllable.

there might "philosophical" questions here instead of specific questions but... appreciate thought or idea.

Search This Blog

SSIS

python - Scrapy Endless Crawling -

Comments

Post a Comment

Popular posts from this blog

c# - How Configure Devart dotConnect for SQLite Code First? -

java - Copying object fields -

c++ - Clear the memory after returning a vector in a function -