Miskatonic University Press

Code4Lib 2010: Tuesday 23 February


More notes from Code4Lib 2010, some brief, some so brief as to be nonexistent.

Cathy Marshall: People, Their Digital Stuff, and Time: Opportunities, Challenges and Life-Logging Barbie

Three or four things to think of about when we mix people, stuff and time. Ruminations about personal digital archiving. "Feral ethnography."

"People rely on benign neglect as a de facto stewardship technique and collection policy."

"Personal digital archiving != archiving a personal digital collection."

Lots of laughing. Quite funny.

One can keep everything. Why might one do that? Don't know an item's future worth. It's hard thankless work to delete. "Filtering and searching can locate the gems among the gravel."

"It's easier to keep than to cull." Loss as a means of culling collections.

Personal scholarly archives as an example. One researcher who'd thought he'd have lots of stuff, but now has little, only things since 2001, mostly PDFs.

"Implicaton: not all long-term stores ned to perform with the same level of reliability."

Use-based heuristics help assess value.

Second point. No single preservation technology/repository/etc. will win the battle for your stuff.

Instead of centralizing, we'll be knittingtogether stores and services. Mobile devices, email, web sites, etc.

No single archive. The catalogue is the answer. Some things (dental records, high school photos, tax records) SHOULD be in different places. You can find them later when you need them.

Third point: Forget about digital originals or reference copies.

People have local copies of images etc. that they think of as the master copies, but there's a lot of useful added metadata in the online versions (eg Flickr) that makes it more valuable.

Example of a picture of an enormous catfish; the image has been resized and rescaled and had the quality changed, and appears in lots of different places online in different formats: and has lots of different explanations: it's a record-size fish, it's this, it's that, it was caught by X, it was caught by Y, it was caught here, it was caught there.

Where are the tools that will let us harvest the metadata that copies have grown? Where's the search tool for gathering copies, not deduping them?

Fourth: Given 1-3 there will be some interesting opportunities to take a fresh look at searching and browsing.

Techniques for re-encounter: stable personal geography; value-based organization; better presentation of surrogates.

Possible interfaces to do all this: faceted browsing, eg LifeBits approach. Annotated timeline (also LifeBits). Hard to do.

Bottom-up efforts: lots of digitization happening, policies, tools, practices. Personal archiving as cottage industry. SALT project at Stanford.

New opportunistic uses of massed data. She did a study of photos in Flickr of the same thing, a mosaic in Milan. Superstition: you stand on it, on the bull, spin on it, and then take a picture and post it online. In the pictures you can actually see the mosaic being eroded over time!


http://www.csdl.tamu.edu/~marshall/, http://research.microsoft.com/~cathymar/

Blog: http://ccmarshall.blogspot.com/

Twitter: http://twitter.com/ccmarshall

Jeremy Frumkin and Terry Reese, Cloud4Lib

Idea: Cloud4Lib = an open library platform.

Lots of different projects out there doing different or the same things. Need to glue them all together, put all the Code4Lib work and energy into one thing. "Enable libraries to truly and collaboratively build and use common infrastructure." Development efforts should enhance an entire platform, not just one piece of it all.


They set up a wiki. Interesting: some Amazon EC2 servers where people can experiment. Sponsored by someone or other.

Use cases, mentioning University of Hobbitville

Ross Singer, The Linked Library Data Cloud: Stop Talking and Start Doing

Tim Berners-Lee's Four Rules of Linked Data

Library linked data cloud was amazingly empty but has been growing slowly. id.loc.gov, LIBRIS and DBPedia.

Ross matched up lcsubjects.org and http://marccodes.heroku.com/gacs/.

Chronicling America.

VIAF connects to DBPedia. Can search with SRU.

Ross made LinkedLCCN

He also hacked on VuFind to add RDFa. http://dilettantes.code4lib.org/vufind/ Explore button.

mirlyn.lib.umich.edu Bill Dueber added links to his VuFind.


  • Agree on data models. FRBR or something like it. Aboutness vs isness.
  • More linked data available from very common identifiers
  • More linkages to resources outside the library domain. Who will do that? How? Tools.
  • Sustainability and preservation

Good talk. People inspired.

Harrison Dekker, Do-It-Yourself Cloud Computing with Apache and R

R. Rapache.

Good blog about R that I follow: Revolutions

Rosalyn Metz and Michael B. Klein, Public Datasets in the Cloud

Infrastructure as a Service: Amazon EC2

Platform as a Service: Google Apps, Heroku

Software as a Service: Zoho, Google Docs

Not talking about data you can download eg in a CSV.

Did a video of setting up an EC2 instance (took seconds), and attached (mounting) a volume to it (the volume being a big set of census data in this case). Very cool.

Socrata. (Expensive.)

Did an example of loading census data into Google Fusion Tables. Really wild stuff. 200 GB dataset copied into place and ready to be analyzed into three minutes. Looked like great tools for analyzing the data, visualizing, cross-tabulating, etc.

Michael Klein talked about issues and problems with all of this. Authority, provenance, preservation, access, etc.

Pushing Library Data. Secondary uses of it: research, testing, unified indexing. Not just bib data: anonymized borrowing data, etc.

Karen Coombs, 7++ Ways to Improve Library UIs with OCLC Web Services

  1. Crosslisting print and electronic materials. Can't see all the formats all in one thing. Use WorldCat Search API to see if you have the print version of the book and add record to ebook record.

  2. Linking to libaries nearby in case a book is out.

  3. Providing journal table of contents. Use xISSN to see if feed of recent TOCs is available.

  4. Peer review indicators.

  5. Provide information about authors. Use Identities and Wikipedia API to insert author information into dialogue box in UI.

  6. Link to free online versions of books. Get OCLCnum, then use ?'s APIs to see if they have it and then link to it.

  7. Adding similar items on same screen. Use DDC classification. Makes a lot more sense than just a Title keyword match like VuFind does.

  8. Bonus: Creating a mobile catalogue.


Blog post: New York Times Mashups

Jennifer Bowen, Taking Control of Library Metadata & Websites Using the eXtensible Catalog

Extensible Catalog

Four components that can be used individually or together.

  1. User interface built on Drupal. Faceting. FRBRized. Customizable search interface.

Metadata has been restructured in a new way, FRBRy, but she didn't have time to get into that.

In search results there's a place to show the matching text (from record or whatever) so that users know why they're getting the result they did. They found in research that users want that.

Web forms to let you create custom search boxes, for just journals, databases, etc. (Widget-maker, I guess.) Showed how they automatically generate a DVD browser, with no special programming.

They've made about 20 Drupal modules.

  1. Metadata tools. Automated processing of large batches of metadata.

Metadata Services Toolkit. Harvest, process, dedupe, clean up, aggregate, synchronize metadata.


Version tracking of metadata through changes, so you can track the history.

  1. Connectivity tools to match up XC and ILS.

NCIP and OAI. OAI Toolkit works with virtually any ISL that exports MARC. NCIP Toolkit lets XC talk to ILS auth, circ and patron services. Real-time.


Anjanette Young and Jeff Sherwood, Matching Dirty Data---Yet Another Wheel

Goal: Ingest metadata and PDFs for ETDs from UMI into DSpace.


Exact title. Find intersection of sets. Verify intersection with exact author. Shorten author names to remove punctuation etc.

Examples of titles and names that are tricky to match.

Levenshtein Edit Distance.

Reminded me of this poem by bp Nichol that's carved in the pavement of bp Nichol Lane:

A lake
A lane
A line

Similarity = 1 - dL / max (|s1|, |s2|)

Fuzzy query in Solr/Lucene users Levenshtein distance.

Reduce search space. Identify stop words. Throw out common words (eg "models" and "data" in their diss titles).

Got a bit lost here.

Jaro-Winkler Algorithm. (Solr spellchecker uses it.) Works best for short strings. Developed by US Census.

Code: pypi editdist

http://bit.ly/ZGSmF String Comparison Tutorial

What they were looking for but was released after they'd done all their own work: MarcXimiL: The Bibliographic Similarity Analysis Framework

Slides: http://www.slideshare.net/ghostmob/matching-dirty-data

Ryan Scherle and Jose Aguera, HIVE: A New Tool for Working With Vocabularies

HIVE = Helping Interdisciplinary Vocabulary Engineering. Jose Aguera wasn't here.


Site: http://hive.nescent.org:9090/

Code: http://hive-mrc.googlecode.com/


David Kennedy and David Chandek-Stark, Metadata Editing---A Truly Extensible Solution

Duke University Libraries Digital Collections.

Python, Django, Yahoo! Grids CSS, jQuery.


http://tridentproject.org/, http://blog.tridentproject.org/

Lightning Talks

UW Forward - Steve Meyer

UW Forward uses Blacklight.

Search for 'psychology' and they recommend the psychology subject librarian. They use WorldCat API to get links to Wikipedia entries for authors. Challenges: 14 Voyager ILS instances in the U Wisconsin system! Serials licensed differently at different campuses. They're having various problems of the kinds such projects has and he'd like to talk to people with similar ones.

MODS4Ruby & Opinionated XML - Matt Zumwalt

Prezi presentation was a bit zoomy, but lively.


The Digital Archaeological Record - Matt Cordial

The Digital Archaeological Record

Beta application

Archaeologists can submit data encoded in whatever way they want, and then connect it to other data.

Example: fauna, searching in a square in SW United States. Integrate data tables.

Hydra: Blacklight + ActiveFedora + Rails - Willy Mene

Stanford + U Virginia + Hull.

Hydrangea Project next: Blacklight, ActiveFedora, Shelver, in Rails.

Why CouchDB? - Benjamin Young

Data gets lonely. Often depends on APIless app. Web apps a bit better. Open source apps better but data can be in RDBMSes.


Listed advantages to using it. Portable standalone apps. Imagine as CouchDB apps: openlibrary.org. 3.5 gig dump now. If it supported replication you could get updates and parts of it.

Subject guides: ad-hoc.

couch.io does hosting and you can get one free database.

Data integrity (cheap, fast, and easy) - Gwen Exner

HathiTrust Large Scale Search update - Tom Burton-West


5.4 million full-text books. Full-text search of each, average response time of 3 seconds. They're using Solr. Big-scale stuff.

EAD and MARC Sitting in a Tree: D-R-U-P-A-L - Mark Matienzo


When he was at NYPL they migrated to Drupal from static content, ColdFusion, other stuff. Snazzy new site: http://nypl.org/


http://nypl.org/shrew/b16185699/mods.xml, http://nypl.org/shrew/b16185699/marcxml.xml

EZproxy Wondertool - Paul Joseph

He's at UBC. Had a bunch of EZProxy work to do and did a thing that made all his work easier.

HathiTrust APIs - Albert Bertram


Repository of MARC Abominations - Simon Spero

Lovecraft meets MARC. Building a test set of the eldritch MARC records from non-Euclidean geometries. Send to marcthulhu@ibiblio.org.

Mystery Meat - Joe Atzberger

Stephen Abram's anti-open-source FUD. Sirsi's Workflows client ships with tar.exe and gzip.exe but does NOT come with a copy of the GPL, which its license says it must. Does the Software Freedom Law Center know about this?

Fuwatto Search - Masao Takaku

http://fuwat.to/worldcat, Twitter: http://twitter.com/tmasao