Internet Archive


I received a special request for an Internet Archive pipe. Starting from the advanced search page there was plenty to work with. The Advanced XML Search form returns whichever record fields you might want as XML, JSON, CSV, or an HTML table. The form exposes all of passed parameters in the search response URL, making it straightforward to rig a pipe to create a well formatted query.

Internet Archive Search Pipe

Internet Archive Search Pipe

Some convenient details in the pipe. Adding long strings in the Pipes interface is annoying due to the short textbox lengths so having each of the record fields added to an array, fl[], makes it easy to see all the parameters. And the Pipes team have added a Create RSS module which makes converting the returned fields to an RSS feed much cleaner.

One quibble with the data format from the Internet Archive; the record fields are returned in XML as repeated elements which makes it just a little harder to manipulate. The JSON response is great with every field placed in a distinct element.

I tried tuning the quality of results. By default the search string John Singer Sargent gets translated into this baroque query:

(title:john^100 OR description:john^15 OR collection:john^10 OR language:john^10 OR text:john^1) (title:singer^100 OR description:singer^15 OR collection:singer^10 OR language:singer^10 OR text:singer^1) (title:sargent^100 OR description:sargent^15 OR collection:sargent^10 OR language:sargent^10 OR text:sargent^1)

The ^s are boosting operators in Lucene with the numbers setting the relevancy weights. Exact phrase matching did not work as well since the artist name could be formatted differently. Compare John Singer Sargent to “John Singer Sargent”.  Lucene’s proximity search operator could do the trick though it can cast the net too wide.

I’ve added the pipe to the Met object information aggregator.

Internet Archive results for John Singer Sargent

Internet Archive results for John Singer Sargent

//TODO: Format the results to include the media type icon that the Internet Archive provides – book, audio, video.

Posted in Metropolitan Museum of Art Tagged: books, search