Speaking Notes from a Tech Talks Presentation by Kurt Nordstrom, July 13, 2005
- OAI stands for "Open Archives Initiative." This is actually the name of the organization that develops OAI-PMH. This stands for OAI Protocol for Metadata Harvesting. When most people talk about "OAI," they usually mean "OAI-PMH," since the technology is more fun to talk about than a bunch of folks getting together to hash out standards. For the remainder of this talk, I'll use "OAI" and "OAI-PMH" interchangeably.
- In a nutshell, OAI is a simple, structured way for organizations to expose content for harvesting, or to harvest exposed content. This would be the "open" part of the name. The "archives" refers to whatever content is being harvested, particularly metadata. I could explain "initiative," but y'all are librarians for the most part, and pretty smart, so I expect you to know that one.
- Historically, OAI has its roots in the problem of transferring abstracts for scientific and academic papers for the purpose of journal publication and peer review. Prior to the current version of OAI-PMH, which is 2.0, there was OAI-PMH v. 1.x, and prior to that, something called the Santa Fe Convention. These don't have much to do with my talk, but they can score you big points if you drop the names in casual library tech conversation. So boys, have you got interoperability support on your system yet? The Santa Fe Convention is outdated now, you know...
- The OAI guidelines specify both service providers and
data providers. These are important concepts, so we'll define
them right away.
- Service providers are systems that go out and harvest available metadata and let you do something useful with it. An example would be OAIster, which harvests records from a number of sources and makes them publicly searchable.
- Data providers on the other hand, are systems that make metadata available for harvesting. They are the ones that are like "this is what I have, come get it." An example would be a particular collection or database that has made its metadata OAI harvestable.
- In order to understand how OAI works, there are a few underlying concepts
that are worth briefly covering. OAI, like most anything useful, is built on top
of a few existing technologies.
- HTTP is the Hyper Text Transfer Protocol. This is the protocol that powers the World Wide Web. When you open up your Web browser and go to a page beginning with <http://>, you're using it. HTTP specifies ways to pass more than just addresses within URLs, however. There is also a mechanism to pass commands and information which can be interpreted by specialized programs on the server. OAI "piggybacks" onto HTTP, and makes use of these information-passing mechanisms (called POST and GET) to do its job.
- XML stands for eXtensible Markup Language. While XML is a popular buzzword today, and often seems touted as a solution to any problem, from cancer to world hunger, it is really just a set of rules for formatting computer data in a way that it is easily usable and readable. There are a number of XML-based technologies out there that do varied and wonderful things with XML data, but it is important to remember that XML is just a way to format data. Anybody who tells you otherwise is trying to sell you something. Regarding OAI, the thing to know is that OAI uses an XML metadata format.
- Dublin Core is the XML metadata format that OAI uses. I'd never heard of it before I came to work for the library, and the name seemed weird to me. It still seems weird to me. I imagined that it was developed in Ireland, but that isn't the case. As it turns out, there are a few Dublins in the world. Dublin, Ireland, as I mentioned, is known for its redheads, and for producing Guinness Stout. Dublin, Texas, is known for its rednecks and for producing cane-sugar Dr. Pepper. The Dublin from which Dublin Core originated, however, is Dublin, Ohio, which can claim the origination of the Wendy's franchise to its name. You might consider this trivia, but you won't if you go to Dublin expecting a Guinness and somebody hands you a Wendy's frosty or a Dr. Pepper instead.
- Dublin Core is designed to be an extremely flexible format, which makes it a good "lingua franca" for OAI to use for exchanging metadata. It defines 15 main elements that can be mapped to almost any data set imaginable. All of the records harvested through OAI will need to be available in Dublin Core format to be compliant with standards.
- OAI defines a set of verbs that can be used to accomplish
its tasks. By using combinations of these verbs, a service
provider can contact a data provider in order to list
and retrieve records. Here is a list of the verbs, with brief descriptions on
what they do.
- GetRecord - Not surprisingly, this verb tells the data provider to send over a particular record, identified by a unique identifier.
- Identify - This verb is used to let the service provider learn the OAI capabilities of the data provider.
- ListIdentifiers - Get a listing of unique record ID's on a particular data provider. This list can be filtered by date, or by metadata format.
- ListMetadataFormats - Get a listing of the metadata formats offered by the data provider. Dublin Core must be supported, though others can be supported as well.
- ListRecords - This is similar to ListIdentifiers, only it returns the full records instead of just the identifiers. It can be filtered by the same criteria as ListIdentifiers.
- ListSets - The data provider may have its records broken into arbitrary named sets for organizational purposes. This verb will retrieve information on what particular sets may exist.
I've thrown a bit of information out now, but it might not make a whole lot of sense just by itself. What I want to do is to paint a picture of how this all might be used, so a "big picture" can be seen.
Let's start with an imaginary data provider. We have an organization, "Cool Old Comics" that catalogs vintage comic books. They keep this catalog in a database on a Web content management system. Users can connect to this site and browse the catalog to retrieve information about these old comics.
One day, COC gets an email from another group, "Super Comics Catalog," which seeks to create a huge catalog of comics. They ask COC if they would be willing to share the information in their catalog with them, so that it can be absorbed into the super database. However, The Super Comics Catalog is using a different platform than COC is.
Normally, there would be a problem inherent here. How to get the records from COC into the database of SCC? This could be a daunting task. Somebody would have to send the files from COC over to SCC, and then SCC would have to figure out what to do with them. Perhaps an importing script could be written, or maybe somebody would have to enter them by hand? And what if they don't use the same terms to describe things? What a mess.
OAI-PMH to the rescue. Fortunately, both COC and SCC use systems that, while different, support OAI for metadata transfer. This might actually work! What we are seeing here is COC acting as the data provider, while SCC will be playing the role of service provider.
We'll say that COC's site resides at <http://www.cool-old-comics.org>, with the OAI-PMH section at <http://www.cool-old-comics.org/oai.php>. This is where SCC will point their harvester.
The harvester first needs to gather some information, so it uses the Identify verb first. The address that the harvester contacts is <http://www.cool-old-comics.org/oai.php?ver=identify>. The verb=Identify indicates that the harvester is passing the Identify command to the COC server via the "GET" mechanism in HTTP. In response, the server sends the harvester back an XML formatted page of data about itself.
Next, the harvester wants to get down to business and get some metadata records. Now, it could employ the ListRecords verb, but instead we'll say it uses ListIdentifiers. The URL that it sends is <http://www.cool-old-comics.org/oai.php?verb=ListIdentifiers>. In response, the COC sends back an XML page that contains the identifiers of all of the available records that it currently has.
Finally, the harvester can retrieve the records. Again, it could retrieve
them all with ListRecords, but let's assume that bandwidth is
somewhat limited and we want to grab records in a more conservative manner. The
harvester could possibly retrieve one record per minute or whatever works. To
retrieve, say, a record identified by oai:cool-old-comics.org:BM0001, the
harvester would send the following: <http://www.cool-old-comics.org/oai.php?verb=getRecprd&identifier=
oai:cool-old-comics.org:BM0001&metadataPrefix=oai_dc>. This is a mouthful, but it really is just saying "give me the single record identified by oai:cool-old-comics.org:BM0001 in the Dublin Core format." The data provider responds by sending the record (in Dublin Core format) to the harvester, where it can be saved into the other database.
If, further down the line, SCC would like to continue to harvest from COC, it may do so, but refine its retrieval to new records since a certain date (most likely the date of the last harvest).
This was intended to be a very simple example. The verbs do have more qualifiers that can be used to do more precise or refined harvesting. Such is beyond the scope of this overview, though.
One important thing to note is that OAI is not a search technology. The two organizations had to be aware of each other's existence before harvesting could take place. OAI is designed to allow site A to efficiently harvest the metadata of site B. There is no mechanism for extensive searching of site B's content or for searching for valid sites at all. Such is possibly the territory of technologies like Z39.50 or the like.
The Digital Projects Lab currently is focusing the lion's share of our work on the Portal to Texas History, which is a content system based on the Keystone DLS. We use a custom XML metadata set to describe our information. Keystone has built-in OAI compatibility, and can convert metadata to Dublin Core (through the use of XSLT stylesheets) and expose it for harvesting. This means that all of our records that we host on the Portal will be able to be harvested automatically by any site that wishes to.
In addition, we have plans to extend the Portal architecture into a more generic library-wide system to allow collections within the Libraries to be hosted on our servers. These too will benefit from OAI compliance. Because the same metadata schema will be used across the collections which, in turn, can be converted to Dublin Core for export, there will be no additional work required to make these collections harvestable.
About the Author: Programmer Kurt Nordstrom resolves technical issues for the Portal to Texas History and the UNT Libraries' Digital Collections.