Guide Economics

From CKAN

Jump to: navigation, search

A Guide to Working on Economics Data in CKAN

Introduction

CKAN offers a platform for organising economic data of use to academics, policy makers and industry practitioners.

How can producers and users of economic data make use of CKAN? What are the steps?

This page is a work in progress as of early 2011. There are several sections which have [to be added] comments, and if you have any suggestions for how to add to these, please feel free to make an edit.

Overview

The front page of CKAN (http://ckan.net/) looks, at first glance, somewhat formidable. However, it is merely the front end to a powerful collection of data references from a wide array of disciplines. CKAN aims to link these data references in meaningful and useful ways.

Placing a reference to a dataset on CKAN, and then adding tags which describe its characteristics opens up a world of possibilities for comparing disparate data sources. These data sources are currently scattered around the web.

What does CKAN achieve for economic data?

The data entry process on CKAN achieves the following:

1. It stores a reference to the dataset. In other words, a way for other people to find the original data set.

2. Added "tags" give other people an idea of what that data refers to in the real world, e.g. referring to a dataset with the tags "house" "prices" "uk" "monthly" narrows things down quite nicely.

3. It makes the original data searchable from one search portal, in other words, CKAN. This enables you to compare it with other similar data through the same search portal.

Economic data comparison could be a lot quicker

Wouldn't it be great if instead of having to visit twenty different websites and compile local spreadsheets from across these different websites to do a piece of analysis work on publicly available economic data, you could select the data just from one place on the fly?

The first part of the solution is that the data sets need to be compiled on a basis which makes them comparable. Ideally they should be 'machine readable'.

What machine readable means is the data are stored in such a way that a computer program can have the data readily within its reach, and knows enough about it to determine how to put the datasets together.

How to get data within reach and in a comparable format?

Google spreadsheets are one answer. They are publicly available tools offered by Google which enable many users to view and edit the same spreadsheet. If a Google spreadsheet is set to 'Publicly viewable' then theoretically anyone on the web can edit it.

One approach therefore is to place publicly available economic data within a Google spreadsheet, and place it on the spreadsheet in a way which is a standard format. So, there would be a standardised heading section with say names of the data set and specifics like what it represents exactly. And if it is time series, then the time series can be put together in a standard way. So, the first quarter of 2011 might be universally represented by 2011.1.

The key is the standard. With the standard in place, anyone could easily query CKAN-referenced economic data and draw out specific clusters of related data based on the tags.

The data could then easily be exported to econometric software in the right format.

Some pointers on adding dataset references

1. Try to make it as easy as possible to find the data - CKAN does not actually store the data. It provides a way to find the data, an infrastructure for navigating the ocean of data. If you have chanced upon a great publicly available data set, its possible to leave way marker for others to try and locate it themselves.

2. Perfection isn't required and the system cannot be broken - Boldness is welcome when adding data set references. The early cartographers provided rough approximations of key features of the geography they wanted to map, and others built upon their work. Similarly, economic data referencing is something which will evolve over time, until in due course policy makers, academics and practitioners have a better viewpoint of the economic phenomena they are interested in.

3. Understanding the technical terms isn't essential for adding packages - When you add a new package through http://www.ckan.net/package/new, the technical terms such as the myriad legal definitions of Open Data and computing technical terms such as "keys" and "urls" and "formats" might discourage you from adding a reference. However, it's not critical to understand what these refer to. Merely getting a package of data referenced and with some means for other people to find it is what's more important.

What is a package?

[to be added]

How to add a package?

Navigate to http://ckan.net/package/new.

Title/ Name? What do these mean?

There might be some initial confusion about what the difference is between "Title" and "Name". After all, is not a title a name, of sorts? In CKAN, Title refers to the more wordy way to identify what the data refers to. It is long hand and more descriptive. "Name" is the shorter form, the snappier shorthand which saves you lots of words if you want to refer to the data package quickly.

For "Title", there is some free reign about how best to describe the data. It might be something like "Monthly UK house prices 1980-2011".

For "Name" it is more regimented, for the sake of keeping things consistent. For economics data packages, they start with the word "econ" and then the Name is a short series of focused descriptive words broken up with hyphens. An example might be "econ-uk-house-prices-monthly".

URL - what does this mean?

CKAN is all about linking data scattered around the web. In some ways therefore, this is the key field. It says to other people, based on everything else you've read about this package, all its attributes, tags, descriptions etc, here's where you actually go to find the authoritative original publication of this information. Note that this is not the reference to the actual data, which might at first seem very confusing. It might therefore in time be useful to change this from "URL" to "Data description" or suchlike, as it is easy to confuse with "Resources" later on, which actually are references to the data. In short, this field is an authoritative web-based source of information, probably from the the data providers themselves, which gives a person browsing key facts about the data "from the horse's mouth" as it were. It is effectively what gives the data its credibility as a data source and a central point of reference.

Notes - what to add here?

Notes at first glance appears a forbiddingly blank canvas. What should be added here? The answer is, this section provides a user friendly outline of what the data refers to. Ideally, it should be descriptive, and in plain language where the source permits. If there are any caveats about how to interpret the data, this is a good place to flag them up. Any seasonality issues, any breaks in the time series? Any information about how the data provider disseminates the data are helpful also, such as whether the provider sends out the information in a table with a particular name or number on a regular basis.

License description - navigating a confusing list.

For those not versed in the rapidly developing field of intellectual property surrounding data of all forms, the drop down list presented here may be the point at which a user might stop and say, this is all a bit difficult. It doesn't matter however if you leave this blank. There are some great resources on CKAN for leading into this topic. The process of working out the provenance of economic data is valuable as it helps practitioners to appreciate better the background characteristics of the data sets they are working with which may inform their interpretation of them.

What is a tag?

Tags are words which describe what a package refers to. The more specific tags are, the more uniquely identifiable a data package is. It is a form of classification. As anyone who has played the game twenty questions can verify, it is incredible how quickly it is possible for a few simple pointers to narrow down a topic of interest to something very specific. There are a few conventions on how tags are added. They are described here: http://wiki.okfn.org/ckan/doc/faq#TagConventions. There are a huge amount of technical terms around this field: what is markup? what is a tag family? As tagging is central to the functionality and usefulness of CKAN, it would be useful to simplify this in some way. Is it possible to add too many tags? Or too few? In the spirit of twenty questions, it might be helpful to focus on just on five key tags which really get to the heart of the matter.

How to name a package?

A central source of potential confusion is how to name a package. Again, the key is that it really doesn't matter too much on this. The important thing is uniqueness and making it at least reasonably representative of the underlying data it refers to. Starting it with the word "econ-" instantly narrows it down as an economics data set.

How to link to a Google spreadsheet

Google Spreadsheets (https://spreadsheets.google.com) are a publicly accessible way to store tabular data. To add a new spreadsheet, firstly you will need a Google account. Google spreadsheets can be used as an intermediate parking place for data, enabling the data to be compiled together and stored in a standard format. The standard formatting improves the machine-readability of the data. Then you can link to this Google Spreadsheet by naming it as resource from a CKAN package. When you describe the resource under "Resources" on the CKAN package adder page, instead of calling it "xls" as you might for an MS Excel Spreadsheet, you can refer to it as "[format to be added]"

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox