Jobs: pausing and resuming crawls
Sometimes, for big sites, it’s desirable to pause crawls and be able to resume them later.
Scrapy supports this functionality out of the box by providing the following facilities:
a scheduler that persists scheduled requests on disk
a duplicates filter that persists visited requests on disk
an extension that keeps some spider state (key/value pairs) persistent between batches
Job directory
To enable persistence support, define a job directory through the
JOBDIR setting.
The job directory will store all required data to keep the state of a single job (i.e. a spider run), so that if stopped cleanly, it can be resumed later.
Warning
This directory must not be shared by different spiders, or even different jobs of the same spider.
Warning
Treat the job directory with the same security care as your
Scrapy project source code. Do not point JOBDIR to a path that
untrusted parties can write to.
See also Job directory contents.
How to use it
To start a spider with persistence support enabled, run it like this:
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
Then, you can stop the spider safely at any time (by pressing Ctrl-C or sending a signal), and resume it later by issuing the same command:
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
Keeping persistent state between batches
Sometimes you’ll want to keep some persistent spider state between pause/resume
batches. You can use the spider.state attribute for that, which should be a
dict. There’s a built-in extension
that takes care of serializing, storing and loading that attribute from the job
directory, when the spider starts and stops.
Here’s an example of a callback that uses the spider state (other spider code is omitted for brevity):
def parse_item(self, response):
# parse item here
self.state["items_count"] = self.state.get("items_count", 0) + 1
Persistence gotchas
There are a few things to keep in mind if you want to be able to use the Scrapy persistence support:
Pause limitations
Job pausing and resuming is only supported when the spider is paused by stopping it cleanly. Forced, sudden or otherwise unclean shutdown can lead to data corruption in the job directory, which may prevent the spider from resuming correctly.
Request serialization
For persistence to work, Request objects must be
serializable with pickle, except for the callback and errback
values passed to their __init__ method, which must be methods of the
running Spider class.
If you wish to log the requests that couldn’t be serialized, you can set the
SCHEDULER_DEBUG setting to True in the project’s settings page.
It is False by default.
Job directory contents
The contents of a job directory depend on the components used during the job.
Components known to write in the job directory include the scheduler and the SpiderState
extension. See the reference documentation of the corresponding components for
details.
For example, with default settings, the job directory may look like this:
├── requests.queue
| ├── active.json
| └── {hostname}-{hash}
| └── {priority}{s?}
| ├── q{00000}
| └── info.json
├── requests.seen
└── spider.state
Where:
Schedulercreates therequests.queue/directory and theactive.jsonfile, the latter containing the state data returned byDownloaderAwarePriorityQueue.close()the last time the job was paused.DownloaderAwarePriorityQueuecreates the{hostname}-{hash}directories.ScrapyPriorityQueuecreates the{priority}{s?}directories.scrapy.squeues.PickleLifoDiskQueue, a subclass ofqueuelib.LifoDiskQueuethat usespickleto serializedictrepresentations ofscrapy.Requestobjects, creates theinfo.jsonandq{00000}files.RFPDupeFiltercreates therequests.seenfile.SpiderStatecreates thespider.statefile.