Why scrapy crawler stops? -
i have written crawler using scrapy framework parse products site. crawler stops in between without completing full parsing process. have researched lot on , of answers indicate crawler being blocked website. there mechanism can detect whether spider being stopped website or stop on own?
the below info level log entry of spider .
2013-09-23 09:59:07+0000 [scrapy] info: scrapy 0.18.0 started (bot: crawler) 2013-09-23 09:59:08+0000 [spider] info: spider opened 2013-09-23 09:59:08+0000 [spider] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2013-09-23 10:00:08+0000 [spider] info: crawled 10 pages (at 10 pages/min), scraped 7 items (at 7 items/min) 2013-09-23 10:01:08+0000 [spider] info: crawled 22 pages (at 12 pages/min), scraped 19 items (at 12 items/min) 2013-09-23 10:02:08+0000 [spider] info: crawled 31 pages (at 9 pages/min), scraped 28 items (at 9 items/min) 2013-09-23 10:03:08+0000 [spider] info: crawled 40 pages (at 9 pages/min), scraped 37 items (at 9 items/min) 2013-09-23 10:04:08+0000 [spider] info: crawled 49 pages (at 9 pages/min), scraped 46 items (at 9 items/min) 2013-09-23 10:05:08+0000 [spider] info: crawled 59 pages (at 10 pages/min), scraped 56 items (at 10 items/min)
below last part of debug level entry in log file before spider closed:
2013-09-25 11:33:24+0000 [spider] debug: crawled (200) <get http://url.html> (referer: http://site_name) 2013-09-25 11:33:24+0000 [spider] debug: scraped <200 http://url.html> //scrapped data in json form 2013-09-25 11:33:25+0000 [spider] info: closing spider (finished) 2013-09-25 11:33:25+0000 [spider] info: dumping scrapy stats: {'downloader/request_bytes': 36754, 'downloader/request_count': 103, 'downloader/request_method_count/get': 103, 'downloader/response_bytes': 390792, 'downloader/response_count': 103, 'downloader/response_status_count/200': 102, 'downloader/response_status_count/302': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2013, 9, 25, 11, 33, 25, 1359), 'item_scraped_count': 99, 'log_count/debug': 310, 'log_count/info': 14, 'request_depth_max': 1, 'response_received_count': 102, 'scheduler/dequeued': 100, 'scheduler/dequeued/disk': 100, 'scheduler/enqueued': 100, 'scheduler/enqueued/disk': 100, 'start_time': datetime.datetime(2013, 9, 25, 11, 23, 3, 869392)} 2013-09-25 11:33:25+0000 [spider] info: spider closed (finished)
still there pages remaining parsed, spider stops.
so far know spider:
- there queue or pool of urls scraped/parsed parsing methods. can specify, bind url specific method or let default 'parse' job.
- from parsing methods must return/yield request(s), feed pool, or item(s)
- when pool runs out of urls or stop signal sent spider stops crawling.
would nice if share spider code can check if binds correct. it's easy miss bindings mistake using sgmllinkextractor example.
Comments
Post a Comment