Helmut
From CKAN
This page is 50% sketch, 50% documentation. Needs to be improved.
Helmut is a generic reconciliation service.
How this relates to OpenSpending/CKAN:
- The service could be run at entities.thedatahub.org and partially operated through CKAN (hook into its processing architecture etc.)
- We don't create any more alias mapping in Google Docs which has various problems (GDocs API isn't that great for writing, needs Google credentials, UI - while fantastic - does not imply the task).
- URLs generated by Helmut could be used alongside OpenCorporates URIs to identify entities such as public bodies and taxonomy items uniquely.
Potentially provides the following services:
- URIs for entities, composed from a type name and an entity key (e.g. "Department of Health" -> "/departments/uk/Department_of_Health", where "departments" is the type and "uk/Department_of_Health" is the key).
- A reconciliation API to perform somewhat fuzzy matching on these entities, both based on a general query string and specified properties (follows Google Refine spec).
- An alias normalization service where for each entity multiple aliases can be defined. These will be used for reconciliation but also in a strict mode for simple forwarding to their entity. This effectively replaces the normalization spreadsheets we've been using e.g. for UK departments, CKAN MIME types and publicdata.eu package categories.
- A web interface for creating alias mappings (and possibly entities) to manually refine matching.
The general workflow will be the following:
- A user can use webstore credentials to register a type name with the system, giving the following information: ** http://webstore.thedatahub.org/pudo/helmut/types
- A webstore table name for the entities table and the column name for the key (eventually also: for the entity label etc.)
- A second table for the alias table and the column name for the entity FK
- Helmut will load the entities and alias table into solr and continue to do so periodically to allow for a wildcard retrieval etc. Helmut will also build a sitemap for SEO purposes.
- Users can begin to query the service and add aliases which will be directly saved to webstore.
- V2: Users will be able to customize the presentation of entities and thereby be able to create while label helmut instances, e.g. publicbodies.org
- V8: add RDF ;-)
References
- Helmut code: https://github.com/okfn/helmut
- Reconciliation API documentation: http://code.google.com/p/google-refine/wiki/ReconciliationServiceApi
- OpenCorporates.com API endpoint: http://opencorporates.com/reconcile
- SILK engine basics: http://www.assembla.com/spaces/silk/wiki/Silk_Link_Discovery_Engine
- SILK metrics: http://www.assembla.com/code/silk/git/nodes/silk2/silk-core/src/main/scala/de/fuberlin/wiwiss/silk/impl/metric