ZipcodeZoo.com

InfoTags

A Step toward Self-Cleaning Data Distribution Systems

This page describes a project to capture user input on errors in data elements, efficiently share that user input, and correct those data elements.

Overview

A unit of information (“data element”) is added to a central database along with comments about that data element. To that data element is attached metadata that helps understand the original context of that data element. To the comments might be attached info about the author of the comment. The comments might pertain to the accuracy of that information. Interested parties could access the database, view data elements and comments, and possibly agree on the incorrectness of the data element, recording this in the database as well. At a later time, that party might issue a new data element.

Problems Addressed

The project is expected to solve 3 problems:

  1. DDT in the Food Chain. One set of facts that is of great importance to ZipcodeZoo is the observation data that we will be acquiring from GBIF and using for our maps. A published map of Panthera Leo (African Lion) makes it clear that we should question the accuracy of information about the geographic distribution of species in the wild. GBIF created none of these errors – they came up the food chain from their providers. If it is true that up to 30% of published observation records are wrong, that’s a lot of DDT in the food chain. The ideal system would allow data cleaning to move through the provider network even more efficiently than errors move through that network. And the ideal system would allow corrections to happen at any time, from any point in the network.
  2. Recurring Nightmares. In a partnership between a web site such as ZipcodeZoo or EOL and a data provider, regular transfers of data from one level of the system to the next are expected. ZipcodeZoo, for instance, will be importing 158 million GBIF records 6-12 times a year. A properly designed information tagging system could prevent errors from returning. Rather than manually clean an imported database of these returning errors, a site such as EOL or ZipcodeZoo could elect to not import or use any record with pending comments, or any record marked as “incorrect” by an appropriate party.
  3. Communication Mayhem. In a data provider network, one provider gathers information from other providers, organizes it and aggregates it, and passes this on to yet another provider. GBIF receives data from a provider such as a museum, and passes it on to a provider who will display it on a web site or otherwise use it. At the end of this process is a fourth party, a user viewing a web site such as EOL.org or ZipcodeZoo.com. That user might spot an error, and report it to someone, as happened recently. A Peace Corps volunteer in Mexico wrote ZipcodeZoo with this message: “University of Kansas has 2 specimens of Phrynosoma, one braconnieri and one taurus, from Puebla Mexico listed as being from Namibia. I have sent them a message about it but your database has picked up that error.” So she sent 2 messages, and ZipcodeZoo should now communicate with her, with GBIF, and with someone at the University of Kansas. The communication problem could be solved by allowing the user to send a comment directly to the central database, with the comment automatically attached to the questionable data element. At intervals, participants would review their commented records, and mark (some or all) for removal. Such a record would be replaced, rather than edited.

Plan for a Prototype

Here are preliminary notes for what we plan to do here at ZipcodeZoo, for the purpose of building a working prototype of a self-cleaning system.

The Prototype at ZipcodeZoo

We have built a prototype for a simple version of InfoTags. In this version, we allow users to reject specific observations used in constructing our distribution maps. More elaborate versions of this would allow users to comment on the accuracy of various factual claims about a species, or offer corrections. And more elaborate versions would show users all comments received on some data element, should the user request that level of detail.

Our prototype is a mythical excerpt from the Data Table that will be used for observation data in ZipcodeZoo version 2.0. View it by clicking here.

Identifiers: Record ID, GUID or MD5

In the first prototype, each observation would have its own Record ID, created by MySQL. Using a GUID as the identifier makes great sense, particularly if there is an accepted issuing agency (such as Google) that would create it. Google has created a GUID generator, should the future design of an Infotagging system include local GUID generation. See http://www.newguid.net/iGoogle_CreateGuid.aspx

Extensions

We hope that others in data distribution systems will explore this idea with us, and that we can work out a system that can serve us all, particularly our users.

Credits

This idea was initially developed at the EOL Informatics Advisory Group meeting of April 14-15, 2008 in Woods Hole, MA.

 

First Created : April 16, 2008. Last Revised: April 22, 2008.