Data Storage Proposal
From CKAN
CKAN is beginning to be less and less of a metadata store and we're increasingly looking at the contents of datasets as well their descriptions. Since we've begun to duplicate functionality between the QA and Storage modules, and many other services will begin to depend on storage, this should be architected and have at least common naming for all involved components.
Comments:
- A queue processing system is central to this as many components need to be notified of events. I propose we use Celery: its a monster but basically the alternative is usually re-inventing it with a huge mess of cronjobs, IPC, temporary tables or data files etc. (How do you repeat a failed job? How do you programmatically schedule a task to be run in a given interval?
- The archiver has a few functions: enforce a common naming scheme for uploaded resources, notify the processing system of resource updates, deal with locking etc. It is neither part of the crawler nor the upload form, but both should write through it. It is different from OFS in that it knows about CKAN.
- The header cache is used to support HTTP caching (304). We can base the crawler off a modified version of httblib2 but don't need to cache response bodies as this is done in storage.
- The crawler is completely controlled via the queue and not cron. It will be notified of new packages quickly and for small resources have a loading cycle run within a few seconds after package addition.
