Managing Datasets
From CKAN
Do I just register a dataset, or can I upload and store data online?
CKAN now allows you to both register links to datasets and store data directly online. For more information, please see this blog post
I've spotted some spam on CKAN
This happens occasionally. It is useful to alert the CKAN admins to a dataset that is spam or has been spammed by editing the dataset and adding tag: meta.spam
A CKAN system administrator can delete the spam revisions from the history page (/revision) and then purge them at /ckan-admin/trash provided by the admin extension.
How Do I Install the CKAN Bookmarklet?
Want to automate creating a dataset from the info in a webpage? You can use the CKAN bookmarklet. This will automatically extract information such as url, title and a description and send it off to the new dataset form (where you can then edit further).
To install the bookmarklet just create a new bookmark with the following text as the location:
javascript:(function(){f='http://thedatahub.org/dataset/new?url='+encodeURIComponent(window.location.href)+'&title='+encodeURIComponent(document.title);if((n=document.getElementsByName('description')[0])&&(d=n.content)){f+='¬es='+encodeURIComponent(d);}a=function(){if(!window.open(f)){location.href=f;}};if(/Firefox/.test(navigator.userAgent)){setTimeout(a,0)}else{a()}})()
You could title the bookmark 'Add to CKAN'.
For more details on installing bookmarklets see the del.icio.us help page.
What's the Difference Between Deletion and Purging ?
'Deleting' a dataset is not permanent (it can be resurrected), whereas 'purging' a dataset loses it forever. The recommended policy is to delete datasets that are wrongly added or duplicates, and purge datasets which are spam.
How do I Delete a dataset ?
NB: You need privileges to do this: you must either be an administrator for the dataset, or a CKAN system administrator. If you were the person who created the dataset (whilst logged in) then you are automatically an administrator for the dataset so can delete it.
- When viewing the dataset, click the "Edit" tab.
- Close to the bottom of the form should be the field called "State". Change the value from "active" to "deleted".
- Hit save
Alternatively you should mark the dataset for deletion by editing it and adding a suitable tag e.g.: meta.duplicate or meta.not-data and a system administrator should clean it up in time (or ask us on the ckan IRC or the ckan-discuss email list).
When a dataset is in the "deleted" state then it is still visible to its admin and sysadmins, but for general users it will not be listed or searchable.
How do deleted datasets work ?
In CKAN changes are "revisioned" so that the history of changes can be viewed, and any can be reversed if desired. 'Deleting' a dataset actually means changing the dataset's 'state' value from 'active' to 'deleted', which can then be changed back to resurrect it.
The average user will not be able to browse or search for a 'deleted' dataset, but the dataset's administrator, or a system administrator can. These particular users should check the 'state' field when viewing the dataset, to see if it is deleted or not.
How Do I Purge a Dataset ?
(ideal for a dataset created by a spammer - you get rid of its revision history)
NB You need to be a CKAN system administrator to do this
If you are a system administrator:
- Visit the dataset's history page (/dataset/NAME/history)
- Find the earliest revision that has been done by the spammer (furthest down the list). Click on this revision (link in the Revision column).
- This revision page has a 'delete' button on the right-hand side (if you are administrator) - click this. That will flag the revision for deletion
- Go to the trash in the Administrator dashboard: /ckan-admin/trash
- Under 'Deleted revisions' click "Purge them all"
How Do I Purge a Revision ?
You need to be a CKAN system administrator. If you are a system administrator:
- Visit the Revision Log page (/revision)
- There will be a "delete" button next to each revision
- Clicking on that will flag the revision for deletion
- Go to the trash in the Administrator dashboard: /ckan-admin/trash
- Under 'Deleted revisions' click "Purge them all"
How do I purge a group ?
(ideal for a group created by a spammer - you get rid of its revision history)
NB You need to be a CKAN system administrator.
If you are a system administrator:
- Visit the group's history page (/group/NAME/history)
- Find the earliest revision that has been done by the spammer (furthest down the list). Click on this revision (link in the Revision column).
- This revision page has a 'delete' button on the right-hand side (if you are administrator) - click this. That will flag the revision for deletion
- Go to the trash in the Administrator dashboard: /ckan-admin/trash
- Under 'Deleted revisions' click "Purge them all"
How Do I Deal with Duplicate Datasets ?
- Pick one of the datasets as the dataset to keep. All other versions will be considered the "duplicates"
- For each duplicate:
- Merge any relevant material in the duplicate into kept dataset
- Optimal Put the duplicate dataset in a deleted state [Requires appropriate permissions -- usually that you are dataset owner or sysadmin]
- Go to the dataset
- Find the state field dropdown and select deleted
- Save the dataset
- Less Optimal
- tag with the `meta.duplicate` tag
- At top of notes field insert this, where `main-dataset` is replaced with name of dataset that is being kept: **Duplicate of dataset:main-dataset**
What about Spam and Permissions?
We favour a low barrier to editing in CKAN, to encourage and benefit from many casual additions and improvements just like for Wikipedia. But we tend to setup CKAN to require a user to register or log-in to create or edit datasets, because this is a defence against automated spam. If spam does get through, since the database is versioned, it is easy to remove spam or other bad edits.
On theDataHub.org we have made two conditions:
- you need to register to create or edit a package
- registering a user requires completing the captcha
Users that log-in to CKAN can set edit permissions on datasets and groups that they create, if it is necessary to 'lock-down' or 'open-up' a particular dataset.
Dataset Notes Markup
In addition to markdown syntax, dataset notes support CKAN-specific markup for linking to datasets, tags, and groups:
- dataset:dataset-name
- tag:tag-name
- group:group-name
Tag Conventions
What Are Tag Families ?
There is a convention of using the '.' character to create groupings of tags, for example: the meta group of tags described in the next item. Current known tag families are:
- meta e.g. meta.duplicate, meta.spam
- todo e.g. todo.list-datasets
NB: at an earlier stage there was a convention of using '-' as grouping separator. However, '-' is also used to separate words in multi-word tags so this practice is now discouraged.
What Are Meta Tags ?
Meta tags have the prefix "meta" and are especially use for "house-keeping"-type activities around datasets such as marking datasets as spam. Current standard meta tags are:
- meta.duplicate - duplicate of another dataset. See entry in dataset section on handling duplicates for this tags use.
- meta.spam
Dataset Resources
What Form Should the Hash Field Have?
Options:
- urn:hash scheme set out in (but not approved) http://tools.ietf.org/id/draft-thiemann-hash-urn-01.txt
- urn:sha1:hash, urn:tree:hash (this seems to be what magnet supports but these seem to have no RFC associated to them)
What Hash Function Should I Use
We recommend using sha1.