python - Scrapy spider memory leak -

- September 15, 2015

my spider have serious memory leak.. after 15 min of run memory 5gb , scrapy tells (using prefs() ) there 900k requests objects , thats all. can reason high number of living requests objects? request goes , doesnt goes down. other objects close zero.

my spider looks this:

class externallinkspider(crawlspider):   name = 'external_link_spider'   allowed_domains = ['']   start_urls = ['']    rules = (rule(lxmllinkextractor(allow=()), callback='parse_obj', follow=true),)    def parse_obj(self, response):     if not isinstance(response, htmlresponse):         return     link in lxmllinkextractor(allow=(), deny=self.allowed_domains).extract_links(response):         if not link.nofollow:             yield linkcrawlitem(domain=link.url)

here output of prefs()

htmlresponse                        2   oldest: 0s ago  externallinkspider                  1   oldest: 3285s ago linkcrawlitem                       2   oldest: 0s ago request                        1663405   oldest: 3284s ago

memory 100k scraped pages can hit 40gb mark on sites ( example @ victorinox.com reach 35gb of memory @ 100k scraped pages mark). on other lesser.

upd.

there few possible issues see right away.

before starting though, wanted mention prefs() doesn't show number of requests queued, shows the number of request() objects alive. it's possible reference request object , keep alive, if it's no longer queued downloaded.

i don't see in code you've provided cause this, though should keep in mind.

right off bat, i'd ask: using cookies? if not, sites pass around session id variable generate new sessionid each page visit. you'll continue queuing same pages on , on again. instance, victorinox.com have "jsessionid=18537cba2f198e3c1a5c9ee17b6c63ad" in it's url string, id changing every new page load.

second, may you're hitting spider trap. is, page reloads itself, new infinite amount of links. think of calendar link "next month" , "previous month". i'm not directly seeing on victorinox.com, though.

third, provided code spider not constrained specific domain. extract every link finds on every page, running parse_obj on each one. main page victorinox.com instance has link http://www.youtube.com/victorinoxswissarmy. in turn fill requests tons of youtube links.

you'll need troubleshoot more find out what's going on, though.

some strategies may want use:

create new downloader middleware , log of requests (to file, or database). review requests odd behaviour.
limit depth prevent continuing down rabbit hole infinitely.
limit domain test if it's still problem.

if find you're legitimately generating many requests, , memory issue, enable persistent job queue , save requests disk, instead. i'd recommend against first step, though, it's more crawler isn't working wanted to.

Search This Blog

Dil

python - Scrapy spider memory leak -

Comments

Post a Comment

Popular posts from this blog

c# - Store DBContext Log in other EF table -

c# - Display ASPX Popup control in RowDeleteing Event (ASPX Gridview) -

Nuget pack csproj using nuspec -