Miskatonic University Press

Access 2009. Saturday #1: Mark Leggott, Virtual Research Environment: Two Years Later, or, Islandora Takes Shape

11 October 2009 conferences

Almost done now. There were four talks at Access on the morning of the last day, which was just a half-day. The opener was by our host Mark Leggott, who runs the University of Prince Edward Island library system. See also Peter Zimmerman’s notes and the video of the talk.

http://islandora.ca/, customized Drupal-Fedora thing.

He wants to build capacity. Staff capacity, capacity on the island, etc. With library at forefront. Data stewardship: more than curation.

People at UPEI tell them: “This is the first time someone has responded to the data challenges I am facing.”

“There is no more significant opportunity for academic libraries in the next few decaides.”

“It’s all about the local. Google can’t do local like you.”

Open source/open data.

Islandora: Seven people working > 50 of their time, soon to be over twelve. Used in research at UPEI ($150K hardware and $200K staff), administration (senate, document management pilot), learning (digital collections, LOR).

In the listing of everything in a given repository, everything has a B beside it, which means “post to a blog about this.” It sets up everything they need so they can easily say something about the paper with a link to it etc.

Islandora External. People outside are using it: U North Texas, Guelph, Carleton, UNBB, Georgia Tech. Interest from others.

The framework: Fedora: repository layer, all the data and metadata. Drupal. Islandora: the glue that joins them. Drupal module, PHP and Java apps. Rule engine for flexible workflows. Drop-in support for new modules. XACML filter.

SHERPA. Djatoka. Going to start using R. Abbyy.

Workflows very important. Can help researchers automate them, which they like. Lots of drudge work in science labs and automating it saves everyone, including lab students, time.

Solution packs: “assemblages of collection policies, disseminators, workflows, apps, data.”

They’re working with Sun. Going to sell boxes with Islandora already installed on it.


Access 2009, Friday #8: Jonathan Jiras, The Extensible Catalog

11 October 2009 conferences

Jonathan Jiras is from the Rochester Institute of Technology (across Lake Ontario from me) and was talking about the eXtensible Catalog project. See the video of the talk. There’s all kinds of interesting stuff going on in Rochester and I’m really interested in see how XC turns out. My notes:

XC is a set of open source tools. 1.0 scheduled for January 2010. Early versions available now, software available for download.

Three parts to project:

User interface: Faceted, FRBRized. How’re they doing that? Built on Drupal. FRBRization: group related resources by Work.

Metadata tools: Aggregate from various sources, clean it up, convert MARC and DC Core to XC schema. XC schema uses RDA elements and has a FRBRized structure. “First real world practical use of RDA.”

XC connectivity suite: OAI toolkit, synchronize data that’s in another repository. NCIP toolkit. Circ status, placing holds and recalls, etc.

Showed how all the parts fit together. Customization etc. “XC is a suite of tools that work best for you.”


Access 2009, Friday lightning talks

11 October 2009 conferences

All of the lightning talks at Access 2009 on the morning of Friday 2 October were good. The video recording has them all, I think, or you can read Peter Zimmerman’s notes. I didn’t write down much.

Dalhousie’s WorldCat Local had problems with the FRBR implementation and they want some changes made to suit academic libraries. For example, if there are multiple editions grouped together, it shows at the top the one with the most holdings in WorldCat libraries. They always want the most recent edition to show, especially in the sciences.

Bess Sadler showed some great GIS stuff.

My York University colleague Ali Sadaqain talked about our implementation of VuFind.


Access 2009, Thursday #9 - Stevan Harnad, Grasping What is Already Within Immediate Reach: Universal Open Access Mandates

07 October 2009 conferences

Stevan Harnad wasn’t there, so he sent a screencast of him giving the talk over his slides. It worked quite well, and was lightened by a couple of interludes where he got up to let his cat in or out. We could hear it meow. After the video, he came up on screen in a live video feed and answered questions. He couldn’t see us but did very well considering. The videos are online.

Peter Zimmerman blogged the talk. Here are my notes.

Defined open access. Free, open access.

To what? Essential: all 2.5 million articles published each year in all 25,000 peer-reviewed journals, in all scholarly and scientific disciplines, worldwide.

Open access does this. 25% - 250% greater research impact. Lawrence (2001) in Nature showed big citation advantage for open access computer science papers. Later work showed same result in all other disciplines.

Golden way: Pulishers conveert their jounrals into OS. Depends on publishers.

Green way: Researchers deposit all their published articles into their own institution’s repository. Depends only on research community. Research community can’t make publshers change to Golden, but they can change themselves to Green.

EPrints (openarchives.org) is OAI harvestable.

But IRs are a necessary but not sufficient condition for green OA. Most of them are empty. Need to have mandates that say everyone does OA.

U Southhampton does it. Showed how their repository is growing. World’s first green OA mandate.

“You self-archive unto others as you would have others self-archive unto you.”

Self-archiving mandate is natural extension of publishing mandate, for the web era.

Vast majority of people willing to self-archive, but don’t bother to go to the trouble unless it’s mandated.

57 university mandates so far, 4 in US, three faculty-level and MIT’s university-wide one.

EUA recommended mandate for all 791 universities in 46 countries, but not adopted yet.

Discussed green OA self-archiving advocacy and how to show its advantages to people.

Mandate should require: deposit of all articles; in the institutional repository; immediately upon acceptance of publication.

(look for Which Green OA Mandate is Optimal? and other links)

63% of journals already endorse green OA self-archiving.

EPrints has a solution for the others: button so you can request a copy for research purposes. ID/OA mandate. Immediate Deposit + Open Access.

“Don’t succumb to Gold Fever. The fastest and surest road to OA is Green.”

Three points about why us OA. Maxmize research output. Measure and reward research output. collect, showcase, manage a permanent record of that research output.


Access 2009, Thursday #7-8: Roy Tennant vs Mike Rylander

07 October 2009 conferences

I think my numbering scheme for Access talks doesn’t match anyone else’s. Hard cheese, I’m afraid. This was a twofer, first with Roy Tennant in OCLC vendor mode and then Mike Rylander in Equinox Software (of Evergreen fame) vendor mode. Peter Zimmerman blogged both talks.

They both described the same kind of thing, a distributed ILS out on the Internet and not running locally, but Rylander’s vision was people building it all themselves and owning it all themselves. This vibed much better with me than the OCLC idea, though the latter is certainly more likely and will be really interesting to see.

Roy Tennant, ILS In the Sky with Diamonds

Putting the ILS in the sky means “moving library data and applications to the network level at web scale.” Moving to network level: going to the cloud. Explained cloud computing. Advantages and disadvantages. Amazon.

OLE Project. oleproject.org. Showed workflow diagrams, the point at which their Mellon funding ended and they don’t have more funding. What next? Unknown. [This was clarified in IRC by someone involved with the project who said the first round of funding ended; the second is to come.]

eXtensible Catalog, out of Rochester.

Libraries are held back by: too many systems to support, too much invested in maintenance, a fragmented web presence, lost opportunities for leveraging data.

Putting an ILS on the network: boring. What can they do to use the infrastructure they have?

1,212,383 libraries worldwide. Transactions: 166,041,975,140 per year. 5,265/per second

They say they could do that with a handful of commodity servers. How?

There’s data, public and private. Users. Libraries. Services.

Next-gen ILS: Do all library functions. Scalable. 100% web-based.

He stressed how much cheaper this will be. Selling the product. Shows timelines for what they’re doing. Targetting a rollout in 2011 for this new thing.

Mike Rylander, Open Source ILS

Open source matters. Open data matters.

People threaten their vendors by saying they’ll move to Evergreen. He’s cool with that.

Cloud computing: “The use of any computing resource that is not mine, AND that I don’t have to manage.” “Learning not to waste computing resource.”

Evergreen. SOA, SaaS-ish. Paas-ish.

Could they scale Evergreen up to run a community-owned, community-run community-maintained Platform-as-a-Service cloud?


Access 2009, Thursday #5: Dan Chudnov, Repository Development at the Library of Congress

07 October 2009 conferences

A lively, interesting and inspiring talk from Dan Chudnov, as always. He gets right to the good stuff and leaves out the dull bits about how committees mandated their initial governance models. Peter Zimmerman blogged this talk. Here are my notes:

He works in the Repository Development Group at LC. 30 people, with dedicated developers, QA, systems operations, project management. They are part of the OSI, Strategic Initiatives. Their job: Capture the digital artifiact, register at Copyright Office, pass it along into the digital collection for registration, cataloging, indexation and preservation.“

World Digital Library: wdl.org

Partners from all over the world, with data and metadata coming in from all over in all kinds of way. Also it’s all available for people all over the world /ru/ for Russian/, /zh/ Chinese, /ar/ for Arabic.

Huge load on the site when it went live, with 9000 requests a second. Bigger than any web thing LC had done before. Built on Solaris, Apache, Ngingx, Django, etc.

Clean URIs that won’t change. Static pages, which helps global edge caching. They used Akamai to keep access working, they couldn’t havekept up with all that demand themselves.

Chroncling Americaa: chroniclingamerica.loc.gov

Started with about 140,000 US newspaper title records. All of the data there is freely downloadable. Whole issues. 100 TB of data, growing quickly, at just 16 states so far. Petabytes in a few years. Built on Solaris, Apache, MySQL, Solr, Django. Again with the clean URIs, lots of local page caching because it has many more pages that are each used a lot less.

They use BagIt, in fact helped put it together. It’s like a packing slip for data. Works across space, systems, organizations, and time. Bagger, a GUI for making Bags: http://sf.net/proects/loc-xferutils/ This was the first open-source licensed software that LC has released.

They get 30,000 books in each day, lots of newspapers, etc: their mailroom is a fascinating place to watch, he says.

Showed Transfer UI, an inventory/workflow tool they use internally to manage all of the stuff coming in for Chronicling America.

Registering and depositing materials for copyright: They hope to support eDposit with these tools. Also "Deposit Demand,” taking in basically databases of journals.

How does all this stuff get incorporated digitally into the collection? “Does anybody know what that means? I don’t know what it means either.”

Traditional approach: make catalogue records, make an exhibit site. Cost of integrating all of this is high. Building them costs money, maintaining them costs money. But cost of consistent web strategies is low.

Linked data: Use URIs as names for things. Use HTTP URIs. Provide useful information. Include links to other URIS. http://w3.org/DesignIssues/LinkdedData.html

Two sites doing this:

http://id.loc.gov/ LCSH on the web for free. Embodies linked data principles. View source to see alternate formats: RDF/XML, Ntriples, JSON. Also linked at bottom too. “At this URI is a concept with a precise meaning.” Now there is a standard way to refer to an LCSH heading, instead of strings. All freely downloadable too. This was also new for LC.

[ ] Idea: Get a dump of all items in our collection and their subject headings. Download LCSH. Compare and see what we have that has invalid or non-authoritative subject headings.

At Chronicling America, view source to see OCR information and a resourcemap. Look at the resourcemap. OAI-ORE aggregation. “A constellation of stars in the sky.” Linked data, reusing other vocabularies. All exposed on the web. This was new for LC too.

Really interesting thing about this app: the web is the API. You visit the page, you’re using the API. The API documentation is just a bunch of links to pages on the site.

If we all do this in all our apps, we have: distributed conceptual integration. The web is a universal collection. LC puts its artifacts on its web, we puts ours on our web, it all fits together and there we go.

He summarizes: “Content that scales on the way in. Apps that scale on the way out. Movage movage movage. Transfer, inventory, workflow. BagIt. FLOSS. Free data you can use. Web of data, available and useful.”


Access 2009, Thursday #2: Richard Akerman - Will We Command Our Data?

07 October 2009 conferences

Richard Akerman blogs at Science Library Pad. Peter Zimmerman blogged this talk too.

My notes from his Access 2009 talk:

Talking about data. Storage has gotten incredbly amazing. Can store huge amounts of data on tiny spaces. Data floating around in the cloud doesn’t seem real, but it is, and it takes a lot of energy and hardware to store it: electricity, air conditioning, etc. Carbon emissions.

Four sources of data: research data, government data, library data, personal data.

Research data: lots done about storing it, giving access to it.

Government data: has really opened up in the last year.

Library data: in catalogues, access logs, id.loc.gov, etc.

Personal data: easy to share your GPSed location, all other personal data, on line.

Everything about sharing data is getting easier: the value of it (more can be done with it by others), the ease of it, and the level of it.

OECD agreed: data from publicy-funded research should be released to the public. One reason this isn’t controversial is that publishers aren’t in, were never in, the business of publishing data. So data is an easy way to get into open access.

Toronto statement on prepublication data sharing: http://www.nature.com/nature/journal/v461/n7261/full/461168a.html

Open up the data before any papers based on it come out. Say, I’m going to write about this, but go ahead and use this data however you want.

In libraries: Berkeley Accord (March 2008). Basic rights to access to data in library systems. All vendors but one signed on (Innovative Interfaces)? Though how well have they implemented?

Personal data. WIRED cover feature “Living By Numbers” and personal data tracking (July 2009).

Why libraries? Advocates, exemplars, experts.

If lots of data is made available, how is it made findable? Need solid metadata and classification to make it easy for people to find, otherwise it’s just a big mess of numbers.

http://datacite.org/ DOIs for data

NRC/CISTI: Gateway to Scientific Data Sets http://cisti-icist.nrc-cnrc.gc.ca/eng/services/cisti/scientific-data/data-sets/

Crown copyright in Canada makes it hard to give away government data (which is another example of how stupid it is), but this project is on: GeoGratis: http://geogratis.cgdi.gc.ca/

How can libraries connect to their patrons?

  • LibraryThing’s free covers
  • Open Library
  • Talis Connected Commons
  • MESUR
  • id.loc.gov

APIs vs raw data. APIs: always serves up latest data, control over access, tracking/stats, complex functionality. Raw data: unconstrained access, not limited by API, no metadata

Book about recording of personal data: http://totalrecallbook.com/.


Access 2009, Thursday #1: Cory Doctorow

07 October 2009 conferences

The first talk at Access 2009, early in the morning of Thursday 1 October, was by Cory Doctorow. Like all the talks it was recorded, and for fun I’ll include embed the video here.

You can also just download the audio.

Peter Zimmerman blogged this talk and all the others. My notes:

Old visions of networks and Internet as just hyper versions of what we already had, with more TV and movie stars.

“We have forever traded quality and reliability for price, access, and customizabilty.”

“Content isn’t king, conversation is.” That’s why the telecommunications industry is bigger than the entertainment industry.

Telcos calling the shots and setting the laws now, which means trouble.

Discussion of culture and industry and ownership. Rules vary from country to country. Parody right in US but not Canada/UK. South African laws inhibit making alternate versions of even out of copyright books for the blind. Search engines probably illegal in Europe.

Regulatory system for big companies, making lots of copies of things with big machines, being shoehorned to fit regular people who can’t get through a day without copying. His examples: finding NHS information about what it means when a kid has little pink polka-dots, birthday calls from relatives, all affected by copyright law. Obama doing fireside chats on YouTube, regulated by copyright.

Copyright continues to be made as industrial rules for industrial players. Copyright should regulate what industries do, not what you and I do.

Cory likes copyright, but he doesn’t want the same rules he uses with his publisher to apply to his readers.

Copyright’s purpose is to ensure that the largest number of people have the most amount of participation.

Librarians have powerful voices to speak out about this: very respected by everyone, our goal of universal access to all human knowledge is a fundamental human goal, everyone knows we’re not getting rich on it, so we speak as an “unimpeachable force for moral good.”

On a question about a phony trade war with China and how much we buy from them: “Our factories can’t be converted back from executive lofts.”

Shell isn’t an oil company, it’s an IT company that moves oil. Shell without the Internet is “just a hole in the ground with some guns around it.” MacDonald’s isn’t a hamburger company, it’s an IT company that sells hamburgers.

The future is more IT and better supply chains.

“The coin through which you level up in the great game that is academe is citation.”