Data Packages

From CKAN

Jump to: navigation, search

The official data package specification is now at http://www.dataprotocols.org/en/latest/packages.html

This page is now deprecated.

A Data Package (or DataPackage) is a cohesive wrapping of a collection of data and possibly other assets into a single form. It provides the basis for convenient delivery, installation and management of datasets.

Those familiar with code package can think of a Data Package as analogous distribution format for datasets. More discussion in Data Package Background.

Contents

How It Fits into the Ecosystem

Data package system.png

Concepts

Specification

Aims:

Current Format

Each package must provide package descriptor metadata. As a file this should be named "datapackage.json" and placed in the top-level directory.

Package Metadata

The current spec has the following attributes (this is heavily based on python distributions):

Future spec

This is very closely based on the Common JS spec (also a json based format). That spec in turn shared many common attributes with Debs, Python etc.

Optional attributes:

Package Bundle Format

This is relevant when the Package is serialized to disk. A serialization of a Package on disk is known as a Package Bundle. Note that not all packages will be serialized to disk. For example, some packages could provide their data via an API.

Directory Layout

A Data Package Bundle will observe the following:

To illustrate:

 /{data-package-name}
     /datapackage.json <-- data package metadata
     /data/ <--- where resources are downloaded to
     # optionally
     /README.txt
     /.dp/dprc <--- ini style information about where the data package came from

dprc format

 [paths]
 default = http://thedatahub.org/{username}/{dataset}

Catalogs and Discovery

In order to find Data Packages tools may make use of a "consolidated" catalog, usually at "HOME/.dp/catalog.json".

catalog.json has the following format

{
   version: ...
   packages:
     {name}: {
       {version}:
         metadata: {metadata},
         bundles: [
           url: ...
           type: file, url, ckan, zip, tgz
         ]
}

When Package metadata is added to the catalog a field called bundle is added pointing to a bundle source for this item.

Deferred

To be decided

Data Package Manager

Inspiration: Existing Package Formats

The specification is heavily inspired by various software packaging formats including the Debian 'Debs' format, Python Distributions and CommonsJS Packages. More background on these other formats can be found below.

Debs

http://www.debian.org/doc/debian-policy/ch-controlfields.html

The fields in the binary package paragraphs are:

5.6.2 Maintainer

The package maintainer's name and email address. The name must come first, then the email address inside angle brackets <> (in RFC822 format).

5.6.13 Description

In a source or binary control file, the Description field contains a description of the binary package, consisting of two parts, the synopsis or the short description, and the long description. The field's format is as follows:

5.6.5 Section

This field specifies an application area into which the package has been classified. See Sections, Section 2.4.

JARs

http://java.sun.com/j2se/1.3/docs/guide/jar/jar.html

The META-INF directory

The following files/directories in the META-INF directory are recognized and interpreted by the Java 2 Platform to configure applications, extensions, class loaders and services:

MANIFEST.MF - The manifest file that is used to define extension and package related data.

INDEX.LIST

CommonJS javascript packages

http://wiki.commonjs.org/wiki/Packages/1.0

The following is an extract:

Packages

This specification describes the CommonJS package format for distributing CommonJS programs and libraries. A CommonJS package is a cohesive wrapping of a collection of modules, code and other assets into a single form. It provides the basis for convenient delivery, installation and management of CommonJS components.

This specifies the CommonJS package descriptor file and package file format. It does not specify a package catalogue file or format; this is an exercise for future specifications. The package descriptor file is a statement of known fact at the time the package is published and may not be modified without publishing a new release.

Package Descriptor File

Each package must provide a top-level package descriptor file called "package.json". This file is a JSON format file. Each package must provide all the following fields in its package descriptor file.

Catalog Properties

When a package.json is included in a catalog of packages, the following fields should be present for each package.

checksums: {
  "md5": "841959b03e98c92d938cdeade9e0784d",
  "sha1": " f8919b549295a259a6cef5b06e7c86607a3c3ab7",
  "sha256": "1abb530034bc88162e8427245839ec17c5515e01a5dede6e702932bbebbfe8a7"
}

This checksum is meant to be automatically added by the catalog service

Open Document Format

http://en.wikipedia.org/wiki/OpenDocument_technical_specification#Format_internals

 meta.xml
 META-INF/
   manifest.xml

meta.xml contains the file metadata. For example, Author, "Last modified by", date of last modification, etc. The contents look somewhat like this:

   <meta:creation-date>2003-09-10T15:31:11</meta:creation-date>
   <dc:creator>Daniel Carrera</dc:creator>
   <dc:date>2005-06-29T22:02:06</dc:date>
   <dc:language>es-ES</dc:language>
   <meta:document-statistic  table-count="6" object-count="0"
     page-count="59" paragraph-count="676"
     image-count="2" word-count="16701"
     character-count="98757"/>

META-INF is a separate folder. Information about the files contained in the OpenDocument package is stored in an XML file called the manifest file. The manifest file is always stored at the pathname META-INF/manifest.xml. The main pieces of information stored in the manifest are:

Other Information

Matthew Brett's thoughts:

Discussion

FL / Since this system is quite complicated, here are some Data Package Simplification Proposals.

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox