python - Scrapy spider memory leak -
my spider have serious memory leak.. after 15 min of run memory 5gb , scrapy tells (using prefs() ) there 900k requests objects , thats all. can reason high number of living requests objects? request goes , doesnt goes down. other objects close zero.
my spider looks this:
class externallinkspider(crawlspider): name = 'external_link_spider' allowed_domains = [''] start_urls = [''] rules = (rule(lxmllinkextractor(allow=()), callback='parse_obj', follow=true),) def parse_obj(self, response): if not isinstance(response, htmlresponse): return link in lxmllinkextractor(allow=(), deny=self.allowed_domains).extract_links(response): if not link.nofollow: yield linkcrawlitem(domain=link.url)
here output of prefs()
htmlresponse 2 oldest: 0s ago externallinkspider 1 oldest: 3285s ago linkcrawlitem 2 oldest: 0s ago request 1663405 oldest: 3284s ago
memory 100k scraped pages can hit 40gb mark on sites ( example @ victorinox.com reach 35gb of memory @ 100k scraped pages mark). on other lesser.
upd.
there few possible issues see right away.
before starting though, wanted mention prefs() doesn't show number of requests queued, shows the number of request() objects alive. it's possible reference request object , keep alive, if it's no longer queued downloaded.
i don't see in code you've provided cause this, though should keep in mind.
right off bat, i'd ask: using cookies? if not, sites pass around session id variable generate new sessionid each page visit. you'll continue queuing same pages on , on again. instance, victorinox.com have "jsessionid=18537cba2f198e3c1a5c9ee17b6c63ad" in it's url string, id changing every new page load.
second, may you're hitting spider trap. is, page reloads itself, new infinite amount of links. think of calendar link "next month" , "previous month". i'm not directly seeing on victorinox.com, though.
third, provided code spider not constrained specific domain. extract every link finds on every page, running parse_obj
on each one. main page victorinox.com instance has link http://www.youtube.com/victorinoxswissarmy. in turn fill requests tons of youtube links.
you'll need troubleshoot more find out what's going on, though.
some strategies may want use:
- create new downloader middleware , log of requests (to file, or database). review requests odd behaviour.
- limit depth prevent continuing down rabbit hole infinitely.
- limit domain test if it's still problem.
if find you're legitimately generating many requests, , memory issue, enable persistent job queue , save requests disk, instead. i'd recommend against first step, though, it's more crawler isn't working wanted to.
Comments
Post a Comment