A Step toward Self-Cleaning Data Distribution Systems
This page describes a project to capture user input on errors in data elements, efficiently share that user input, and correct those data elements.
A unit of information (“data element”) is added to a central database along with comments about that data element. To that data element is attached metadata that helps understand the original context of that data element. To the comments might be attached info about the author of the comment. The comments might pertain to the accuracy of that information. Interested parties could access the database, view data elements and comments, and possibly agree on the incorrectness of the data element, recording this in the database as well. At a later time, that party might issue a new data element.
The project is expected to solve 3 problems:
- DDT in the Food Chain. One set of facts that is of great importance to ZipcodeZoo is the observation data that we will be acquiring from GBIF and using for our maps. A published map of Panthera Leo (African Lion) makes it clear that we should question the accuracy of information about the geographic distribution of species in the wild. GBIF created none of these errors – they came up the food chain from their providers. If it is true that up to 30% of published observation records are wrong, that’s a lot of DDT in the food chain. The ideal system would allow data cleaning to move through the provider network even more efficiently than errors move through that network. And the ideal system would allow corrections to happen at any time, from any point in the network.
- Recurring Nightmares. In a partnership between a web site such as ZipcodeZoo or EOL and a data provider, regular transfers of data from one level of the system to the next are expected. ZipcodeZoo, for instance, will be importing 158 million GBIF records 6-12 times a year. A properly designed information tagging system could prevent errors from returning. Rather than manually clean an imported database of these returning errors, a site such as EOL or ZipcodeZoo could elect to not import or use any record with pending comments, or any record marked as “incorrect” by an appropriate party.
- Communication Mayhem. In a data provider network, one provider gathers information from other providers, organizes it and aggregates it, and passes this on to yet another provider. GBIF receives data from a provider such as a museum, and passes it on to a provider who will display it on a web site or otherwise use it. At the end of this process is a fourth party, a user viewing a web site such as EOL.org or ZipcodeZoo.com. That user might spot an error, and report it to someone, as happened recently. A Peace Corps volunteer in Mexico wrote ZipcodeZoo with this message: “University of Kansas has 2 specimens of Phrynosoma, one braconnieri and one taurus, from Puebla Mexico listed as being from Namibia. I have sent them a message about it but your database has picked up that error.” So she sent 2 messages, and ZipcodeZoo should now communicate with her, with GBIF, and with someone at the University of Kansas. The communication problem could be solved by allowing the user to send a comment directly to the central database, with the comment automatically attached to the questionable data element. At intervals, participants would review their commented records, and mark (some or all) for removal. Such a record would be replaced, rather than edited.
Plan for a Prototype
Here are preliminary notes for what we plan to do here at ZipcodeZoo, for the purpose of building a working prototype of a self-cleaning system.
- Units of information. Our prototype is of observation records, with each observation record being the unit of information that a user can comment on. When the prototype is extended, other data types, such as taxonomic information, can also be given InfoTags.
- User Interface. In the prototype, a clickable icon is displayed after each displayed record. Clicking the icon takes the user to a comment page, where the claim would be presented, and the user could enter a comment. Users without a registration cookie would need to register to leave a comment; those with a registration cookie would have their contact info (in a database table, and identified by their id from the cookie), along with the comment, added to the database. Any page containing any fact that had been commented on would be able to hide or display comments. In our prototype, this control is missing, and all comments are displayed.
- The Database. It would be a monstrous table that would hold information for every data element at ZipcodeZoo. So it is proposed to assume that all is well until commented upon/corrected, and to add to the database at that time. The first user commenting on an observation would thus create the first records in the database for that observation. The database would minimally have a DataElements table (all elements which have received comments), a Comments table (all comments received), a Commenters table (name, email address or other info that the commenter might provide).
- MetaData. To be useful, metadata must accompany the data element when it is first added to the DataElements table. For instance, an observation record with latitude, longitude, and date would have metadata for the species observed, the data provider, and the page on which the data element appeared.
- Adding more comments. When a second user comments on a data element, the second comment must be linked from the Comments table to the DataElements table. In the prototype, this can be done by using the appropriate recordID of the DataElements table.
- Limited Functionality. The prototype ZipcodeZoo builds could serve the needs of ZipcodeZoo by using a local database for storing comments and related information. All page builds/rebuilds would consult this database to determine what records to exclude, so that we can at least prevent previously detected DDT from reaching our users. Ideally, the database would be accessible to the outside world, so that they could make corrections based on comments.
The Prototype at ZipcodeZoo
We have built a prototype for a simple version of InfoTags. In this version, we allow users to reject specific observations used in constructing our distribution maps. More elaborate versions of this would allow users to comment on the accuracy of various factual claims about a species, or offer corrections. And more elaborate versions would show users all comments received on some data element, should the user request that level of detail.
Our prototype is a mythical excerpt from the Data Table that will be used for observation data in ZipcodeZoo version 2.0. View it by clicking here.
Identifiers: Record ID, GUID or MD5
In the first prototype, each observation would have its own Record ID, created by MySQL. Using a GUID as the identifier makes great sense, particularly if there is an accepted issuing agency (such as Google) that would create it. Google has created a GUID generator, should the future design of an Infotagging system include local GUID generation. See http://www.newguid.net/iGoogle_CreateGuid.aspx
We hope that others in data distribution systems will explore this idea with us, and that we can work out a system that can serve us all, particularly our users.
This idea was initially developed at the EOL Informatics Advisory Group meeting of April 14-15, 2008 in Woods Hole, MA.
First Created : April 16, 2008. Last Revised: April 22, 2008.