Friday, November 27, 2009

Mashlib Pipes Tutorial: Reading List Inspired Journal Watchlists

It's coming up to that Mashed Libraries time of year again (Middlemash, this time), which means it's time for me to put together a new Yahoo Pipes demo, ideally one that builds on something I've used at a previous mashed Library event (in particular, the Mashlib Pipes Tutorial: 2D Journal Search). So here's a little something that does exactly that...

(If you're new to Yahoo Pipes, check out the introductory post "Getting Started With Yahoo Pipes: Merging RSS Feeds", which shows how Pipes work, and leads you through a get you startd activity to create a pipe that merged several RSS feeds to produce a single, aggregate feed.)

In particular, in this demonstration I'll show how it's possible to use a pipe that either you yourself, or someone else entirely, might have already built, within a new pipe of your own.

To set the scene, suppose you have a list... a list of resources... a reading list, for example. In and of itself, this might be a key resource in the delivery of a course, but might it also be more than that? Whenever I see a list of web resources, I ask myself: could this be the basis of a custom search engine that searches over just those resources, or just the websites those resources live on? So where might this thought take us in the context of a reading list?

Over on the Arcadia blog, I showed how Mendeley might be used to support the publication and syndication of reading lists (Reading List Management with Mendeley), using a list built around references to a series of journal articles as an example; this list, in fact: "Synthetic Biology" Collection, by Ricardo Vidal.

One of the nice things about Mendeley public collection is that they expose an RSS feed of the collection, so might we be able to use Yahoo Pipes to:

- extract a set of journal titles from a list, and then
- create a watchlist over the current tables of contents of those listed journals

to keep us up to date with the some of the recent literature in that topic area?

To start with, let's look at how we might grab a list of current contents feeds, filtered by keyword, for the journals listed in the Mendeley Reading list identified above. The first thing to do is import the reading list feed into the pipes environment:

Pipes grab feed

The way this list has been constructed we can find a reference to the journal in the description element:

Inspecting a Mendeley reading list

We also note that a "journal" category has been used, which we could filter the items against:

Mendeley Pipe - filter on journal items

Extracting the journal title from the description requires the use of a heuristic (rule of thumb). Noticing that the description typically contains references of the form Journal title (year), volume, page reference etc., we can use a regular expression to strip out everything after and including the first bracket that opens on to a numeral 1 or 2:

Mendeley -xtracting a journal title

The second step of the regular expression block simply groups the title as a search phrase within quotation marks. Using a Unique block removes duplicate entries in the feed so we don't search for the same journal title more than once.

dedupe in a pipe

To grab the filter contents feed, we use the Loop block to search Scott Wilson's JOPML service (which provides a search interface over journal titles indexed by the TicTocs current journal contents service) using the 2D Journal Search block I put together after the last mashlib:

Using an embedded pipe

So what do we have now? The ability to pull out a list of unique journal titles (hopefully!) and then use these as publication search terms in a JOPML search for publications indexed by TicTocs; a set of keyword/topic search terms are then applied to the titles of articles in the current issue of the listed journals, in order to provide a watchlist of articles on a particular topic from a list of journals as identified in a reading list. (To display all the content, simply use a space character as a search term.)

by looking at the data returned by the 2D search pipe, we also get a few ideas for further possible refinements:

TicTocs metadata

For example, we might sort the results by publication title, or use a regular expression to annotate the title with the name of the journal it came from.

You can find a copy of the pipe here: Mendeley Reading list journal watcher.

Tuesday, November 3, 2009

Open Library Training Materials and Custom Search Engines

Chatting to a fellow Visiting Fellow this morning about the best way of searching historical newspaper content (my vote, given the local context, was to use Factiva or LexisNexis (or is it called Lexis Library now?), it struck me that it might be handy to be able to search across all the UK HEI Library websites for tutorials and generic training materials.

So how might we do this? One easy way I know of creating a site (or page) limited search engine is via a Google Custom Search Engine.

For a more detailed overview, see Google Developer Day US - Google Custom Search Engine.

Handily, @ostephens posted an application that screenscrapes UK HE Library details (including th URIs of Library websites) from Sconul the other day (Accessing Sconul Access), from which I took a dump of details for all UK Libraries that I've placed at, RSSified at Sconul RSS Pipe and geocoded/mapped at SCONUL Map.

Sconul Map

It's been some time since I created a Google Linked Custom Search Engine (that is, a custom search engine whose limits are set dynamically from a linked to configuration (that is, annotation file), but I had a vague memory that it was possible to create an annotation file from an RSS feed (such as the one I created above from the Sconul data), and indeed it is possible: Tools for Linked CSEs

So for example, here is an annotation file from the Sconul data

However, wiring this feed into a new CSE didn't appear to work (maybe it takes some time for the feed to start powering the CSE?), so instead I ran a quick search and replace over the contents of a copy of the annotation file, and just pasted the literal URIs into the configuration page of a new CSE:

Creating a Google CSE

Anyway, here it is, such as it is: a custom search engine that searches over UK HEI Library websites (possibly;-)

Exanple UK HEI Library website CSE

PS Also related, @lorcand posted a couple of tweets over the last few days about Library tutorial videos on Youtube:
- U Glasgow library vids "Library On Demand" via @williamjnixon (URL fixed)
- Five (yes, five) video tutorials about using the library catalogue.

I'm not sure if there's a list of Library presences on Youtube (there is one for HEIs, which I used as the basis of a UK HEIs DeliTV channel) but if there is, it won't be too hard to create a custom search engine over those channels.

In the meantime, here are a couple more loosely library related custom search engines: How Do I? Instructional video search (about HowDoI), and my Science Experimental Protocols Video Search , which searches over several sites that collate scientific/experimental protocols.

See also: Brian Kelly on Opening Up Institutional Training Resources, me on Google('s) Training Resources

Cambridge Calendar Feeds (Part I) - Screenscraping with Yahoo Pipes

In the post Getting Started With Yahoo Pipes: Merging RSS Feeds, I described how it's possible to merge two or more RSS feeds within a Yahoo pipe in order to produce a single combined feed.

But what happens if there is no feed available from a website or a webpage? Is there any way we can bring that content into the Yahoo Pipes environment, perhaps so that we can combine it with a 'proper' RSS feed? Well it so happens that there are several other ways of bringing content into a Pipe, other than by subscribing to an RSS feed, and I'll describe one of them here: importing an HTML page directly, and turning a particular section of it into an RSS feed.

As an example, let's consider the following page on the talks@cam website:


Although many of the list types on the talks@cam website support a wide variety of output formats (including RSS), the daily full fat feed doesn't appear to. The only option is the HTML output (unless someone can tell me how to find a feed? That said, this post is all about making do with what we've got and consequently generating RSS from an HTML page...)

If we look at the HTML code that generates the page using the View Source option from the browser View menu, we can (eventually) see that the page is structured quite neatly.

View source on a camt talks daily listing

(Just because the page looks neat and tidy in the full browser view doesn't necessarily mean the HTML code is nicely structured!).
In particular, we can see that there is a repeating element at the start of each calendar entry - in particular, the <li> element - and that each listing item follows a similar pattern, or structure. So for example, if we look at:

<li>12:45 - <a href="" class="click link">Dependent types and program equivalence</a></li>

we see it has the structure:

<li>TIME - <a href="EVENT URI" class="click link">EVENT DETAILS</a></li>

This is how it appeared on the rendered web page in the browser:

talks@cam example event

The <li> defines a list element, which is rendered in the browser as a list item starting with a bullet; the <a> tag defines a link, with the defined URI and link text.

Now suppose that we would like an RSS feed where each item in the feed corresponded to a separate event. How might w go about that?

The first step is to bring the HTML page into the pipes environment using the Fetch HTML block from the Pipes' Source menu. The URI appears to pull up the events for the current date (though a more exact URI pattern of the form such as also pulls up the page for a particular date) so that's the one we'll use:

Yahoo screen scraper

You'll see the pipe has brought the web page into the pipe context. As well as previewing the rendered web page, we can inspect the HTML by clicking on the source in the Previewer:

Pipe screen scraper - view html

You'll notice that the Fetch Page block allows us to declare where we want to Cut content from, and to and also how we want to split it. That is, we can specify some HTML code in the page that tells the pipe where the 'useful' part of the page starts (that is, that part of it we want to "scrape" the content from), and where the useful part of the page (as far as we're concerned) ends, and what piece of repeating HTML we want the pipe to use to separate out the different items contained in the page.

To star with that 'delimiter', we recall each item in the event list starts with <li>, so we shall use that as our delimiter.

But how do we know where the useful HTML starts, and where it ends? We have to find that by trial and error through inspecting the HTML. We need something that appears for the first time in the page close to thee start of the useful content, and something that appears for the first time after that just after the end of the useful content. (If necessary, we might have to grab an excessively large piece of HTML from the page .)

Not that in the current case, whilst it might look like <h3>Monday 02 November 2009</h3> provides us with a unique place from which to start cutting content, if we look at the listing for a different today's date, there will be a different set of characters there...!

Using the </h3> tag as the start of the useful content, and </ul> as the end, we can tell the pipe to cut out the useful listings:

talks@cam scraping

Adding the <li> delimiter gives us a crude set of items, one for each event:

Screeenscraping talks@cam

In order to turn these items into 'proper' RSS feed items, we need to define a title, and ideally also a link for the event. We can create, and prepopulate these links using the Rename block from the Operators menu:

Screenscraping Yahoo pipes

We now need to tidy up those elements with a Regex (Regular Expression) block. Regular Expressions are like voodoo magic - they let you manipulate a string of characters in order to transform that string in all sorts of ways. Written correctly, they can be very powerful and look very elegant. I tend use them in a pidgin way, fumbling my way to a solution using simple rules of them and tricks I've use before!

So for example, I know that the pattern .*href="([^"]*)".* will strip out the URI from a single line of text, that contains a link, and place it in the variable $1.

RegEx in Yahoo pipe

(The . stands for 'any character'; the * says 'zero or more of the preceding character (or any character for a preceding .); the expression [^"]* says 'a set of characters []* that isn't ^ a "; the () marks out the set of contiguous characters that will be passed to the variable $1; href=" and the final " are literal string matched characters. The whole string of matched characters are then replaced by the contents of the $1 variable. The s is ticked so that the pipe will cope with any excess whitespace characters.)

A second regular expression - ([^\s]*)[^>]*>([^<]*).* replaced by $2 ($1) - this time applied to the title element, extracts the name of the talk and the time.

Yahoo pipes regex

Finally, we tidy up the output of the feed to remove any items that don't also link to an vent record on talks@cam (that is, we remove items where an event link was not extracted by the regular expression.)

Yahoo pipes - filter

And to tidy up the presentation a little more, we click in the title tab to give the pipe an appropriate name:

Pipes title

If we now Run the pipe, we can grab a link to an RSS feed of today's talks@cam:

talks@cam scraper pipe

In a follow on to this post, I'll show how to bring time into the equation and add a timestamp for each event to each item in the feed.