Ticket #12 (closed defect: fixed)

Opened 3 months ago

Last modified 3 months ago

tarantula memory usage is excessive

Reported by: someone23 Assigned to: stu
Priority: major Component: tarantula
Keywords: Cc: someone23

Description

Perhaps this isn't standard usage, but it's what I need. I've got a test the crawls my project and ends up crawling over 13000 links on the last run. top reports the resident memory usage as 637M. Previously I had a test that had queued up ~50000 links, but it wouldn't finish because around link 40000 or so my box ran out of memory (it's got 2 gig).

I looked into it and it looks like most of the space is taken up by response bodies that tarantula keeps references to since it waits to write everything to disk until the end. I'm working on a patch to make it write the detail pages as they are crawled and store the details for the index page until the end, but I wanted to write this stuff down before I forgot. I'll post the patch herer when I get it cleaned up.

Attachments

tarantula_less_mem.patch (7.0 kB) - added by someone23 on 05/02/08 17:36:28.
reduce memory usage by writing html detail pages incrementally, also made console reporting more Reporter-like
tarantula_less_mem_with_tests2.patch.txt (13.7 kB) - added by someone23 on 05/04/08 00:03:40.
updated patch with test fixes

Change History

05/02/08 02:57:04 changed by someone23

  • owner changed from muness to stu.
  • component changed from unassigned to tarantula.

05/02/08 17:36:28 changed by someone23

  • attachment tarantula_less_mem.patch added.

reduce memory usage by writing html detail pages incrementally, also made console reporting more Reporter-like

05/02/08 17:48:47 changed by someone23

I just added a patch that reduces memory usage for the above ~13K link test from ~630M to ~100M. In my case it also reduced the time taken by the test from ~24 minutes to ~12 minutes.

It works by writing the html detail pages as the results are generated and saving the details for the index page until the end. I changed it so the report object responds to 'report' and 'finish_report', where 'report' gets called with each result (through save_result) and 'finish_report' gets called at the end (through generate_reports). For HtmlReporter finish_report writes the index file using the accumulated details from the results that were passed through 'report'.

I also changed console reporting to conform more to the 'Report' way of doing things by making an IOReport and takes an IO object on initialization ($stderr in the crawler initialization). It then writes the results to the console in 'finish_report'. Perhaps I should have made that a separate patch, but I didn't think about it until just now. Hopefully it isn't too much hassle.

05/03/08 01:45:02 changed by stu

  • status changed from new to assigned.

This patch looks pretty good, but it has no tests (and it breaks a bunch of tests). someone23, Can you please resubmit with tests?

05/04/08 00:03:40 changed by someone23

  • attachment tarantula_less_mem_with_tests2.patch.txt added.

updated patch with test fixes

05/04/08 00:11:00 changed by someone23

Try this one on for size.

05/09/08 09:44:23 changed by jgehtland

  • status changed from assigned to closed.
  • resolution set to fixed.

Applied the patch in trunk.