It's coming up to that Mashed Libraries time of year again (Middlemash, this time), which means it's time for me to put together a new Yahoo Pipes demo, ideally one that builds on something I've used at a previous mashed Library event (in particular, the Mashlib Pipes Tutorial: 2D Journal Search). So here's a little something that does exactly that...
(If you're new to Yahoo Pipes, check out the introductory post "Getting Started With Yahoo Pipes: Merging RSS Feeds", which shows how Pipes work, and leads you through a get you startd activity to create a pipe that merged several RSS feeds to produce a single, aggregate feed.)
In particular, in this demonstration I'll show how it's possible to use a pipe that either you yourself, or someone else entirely, might have already built, within a new pipe of your own.
To set the scene, suppose you have a list... a list of resources... a reading list, for example. In and of itself, this might be a key resource in the delivery of a course, but might it also be more than that? Whenever I see a list of web resources, I ask myself: could this be the basis of a custom search engine that searches over just those resources, or just the websites those resources live on? So where might this thought take us in the context of a reading list?
One of the nice things about Mendeley public collection is that they expose an RSS feed of the collection, so might we be able to use Yahoo Pipes to:
- extract a set of journal titles from a list, and then - create a watchlist over the current tables of contents of those listed journals
to keep us up to date with the some of the recent literature in that topic area?
To start with, let's look at how we might grab a list of current contents feeds, filtered by keyword, for the journals listed in the Mendeley Reading list identified above. The first thing to do is import the reading list feed into the pipes environment:
The way this list has been constructed we can find a reference to the journal in the description element:
We also note that a "journal" category has been used, which we could filter the items against:
Extracting the journal title from the description requires the use of a heuristic (rule of thumb). Noticing that the description typically contains references of the form Journal title (year), volume, page reference etc., we can use a regular expression to strip out everything after and including the first bracket that opens on to a numeral 1 or 2:
The second step of the regular expression block simply groups the title as a search phrase within quotation marks. Using a Unique block removes duplicate entries in the feed so we don't search for the same journal title more than once.
To grab the filter contents feed, we use the Loop block to search Scott Wilson's JOPML service (which provides a search interface over journal titles indexed by the TicTocs current journal contents service) using the 2D Journal Search block I put together after the last mashlib:
So what do we have now? The ability to pull out a list of unique journal titles (hopefully!) and then use these as publication search terms in a JOPML search for publications indexed by TicTocs; a set of keyword/topic search terms are then applied to the titles of articles in the current issue of the listed journals, in order to provide a watchlist of articles on a particular topic from a list of journals as identified in a reading list. (To display all the content, simply use a space character as a search term.)
by looking at the data returned by the 2D search pipe, we also get a few ideas for further possible refinements:
For example, we might sort the results by publication title, or use a regular expression to annotate the title with the name of the journal it came from.
Chatting to a fellow Visiting Fellow this morning about the best way of searching historical newspaper content (my vote, given the local context, was to use Factiva or LexisNexis (or is it called Lexis Library now?), it struck me that it might be handy to be able to search across all the UK HEI Library websites for tutorials and generic training materials.
It's been some time since I created a Google Linked Custom Search Engine (that is, a custom search engine whose limits are set dynamically from a linked to configuration (that is, annotation file), but I had a vague memory that it was possible to create an annotation file from an RSS feed (such as the one I created above from the Sconul data), and indeed it is possible: Tools for Linked CSEs
However, wiring this feed into a new CSE didn't appear to work (maybe it takes some time for the feed to start powering the CSE?), so instead I ran a quick search and replace over the contents of a copy of the annotation file, and just pasted the literal URIs into the configuration page of a new CSE:
I'm not sure if there's a list of Library presences on Youtube (there is one for HEIs, which I used as the basis of a UK HEIs DeliTV channel) but if there is, it won't be too hard to create a custom search engine over those channels.
But what happens if there is no feed available from a website or a webpage? Is there any way we can bring that content into the Yahoo Pipes environment, perhaps so that we can combine it with a 'proper' RSS feed? Well it so happens that there are several other ways of bringing content into a Pipe, other than by subscribing to an RSS feed, and I'll describe one of them here: importing an HTML page directly, and turning a particular section of it into an RSS feed.
As an example, let's consider the following page on the talks@cam website:
Although many of the list types on the talks@cam website support a wide variety of output formats (including RSS), the daily full fat feed doesn't appear to. The only option is the HTML output (unless someone can tell me how to find a feed? That said, this post is all about making do with what we've got and consequently generating RSS from an HTML page...)
If we look at the HTML code that generates the page using the View Source option from the browser View menu, we can (eventually) see that the page is structured quite neatly.
(Just because the page looks neat and tidy in the full browser view doesn't necessarily mean the HTML code is nicely structured!). In particular, we can see that there is a repeating element at the start of each calendar entry - in particular, the <li> element - and that each listing item follows a similar pattern, or structure. So for example, if we look at:
<li>12:45 - <a href="http://talks.cam.ac.uk/talk/index/20701" class="click link">Dependent types and program equivalence</a></li>
This is how it appeared on the rendered web page in the browser:
The <li> defines a list element, which is rendered in the browser as a list item starting with a bullet; the <a> tag defines a link, with the defined URI and link text.
Now suppose that we would like an RSS feed where each item in the feed corresponded to a separate event. How might w go about that?
The first step is to bring the HTML page into the pipes environment using the Fetch HTML block from the Pipes' Source menu. The URI http://talks.cam.ac.uk/dates appears to pull up the events for the current date (though a more exact URI pattern of the form http://talks.cam.ac.uk/dates/YEAR/MONTH/DAY such as http://talks.cam.ac.uk/dates/2009/11/3 also pulls up the page for a particular date) so that's the one we'll use:
You'll see the pipe has brought the web page into the pipe context. As well as previewing the rendered web page, we can inspect the HTML by clicking on the source in the Previewer:
You'll notice that the Fetch Page block allows us to declare where we want to Cut content from, and to and also how we want to split it. That is, we can specify some HTML code in the page that tells the pipe where the 'useful' part of the page starts (that is, that part of it we want to "scrape" the content from), and where the useful part of the page (as far as we're concerned) ends, and what piece of repeating HTML we want the pipe to use to separate out the different items contained in the page.
To star with that 'delimiter', we recall each item in the event list starts with <li>, so we shall use that as our delimiter.
But how do we know where the useful HTML starts, and where it ends? We have to find that by trial and error through inspecting the HTML. We need something that appears for the first time in the page close to thee start of the useful content, and something that appears for the first time after that just after the end of the useful content. (If necessary, we might have to grab an excessively large piece of HTML from the page .)
Not that in the current case, whilst it might look like <h3>Monday 02 November 2009</h3> provides us with a unique place from which to start cutting content, if we look at the listing for a different today's date, there will be a different set of characters there...!
Using the </h3> tag as the start of the useful content, and </ul> as the end, we can tell the pipe to cut out the useful listings:
Adding the <li> delimiter gives us a crude set of items, one for each event:
In order to turn these items into 'proper' RSS feed items, we need to define a title, and ideally also a link for the event. We can create, and prepopulate these links using the Rename block from the Operators menu:
We now need to tidy up those elements with a Regex (Regular Expression) block. Regular Expressions are like voodoo magic - they let you manipulate a string of characters in order to transform that string in all sorts of ways. Written correctly, they can be very powerful and look very elegant. I tend use them in a pidgin way, fumbling my way to a solution using simple rules of them and tricks I've use before!
So for example, I know that the pattern .*href="([^"]*)".* will strip out the URI from a single line of text, that contains a link, and place it in the variable $1.
(The . stands for 'any character'; the * says 'zero or more of the preceding character (or any character for a preceding .); the expression [^"]* says 'a set of characters * that isn't ^ a "; the () marks out the set of contiguous characters that will be passed to the variable $1; href=" and the final " are literal string matched characters. The whole string of matched characters are then replaced by the contents of the $1 variable. The s is ticked so that the pipe will cope with any excess whitespace characters.)
A second regular expression - ([^\s]*)[^>]*>([^<]*).* replaced by $2 ($1) - this time applied to the title element, extracts the name of the talk and the time.
Finally, we tidy up the output of the feed to remove any items that don't also link to an vent record on talks@cam (that is, we remove items where an event link was not extracted by the regular expression.)
And to tidy up the presentation a little more, we click in the title tab to give the pipe an appropriate name:
If we now Run the pipe, we can grab a link to an RSS feed of today's talks@cam:
In a follow on to this post, I'll show how to bring time into the equation and add a timestamp for each event to each item in the feed.
One of the very many clever things that folk have worked out what they can do with web pages, web browsers and such like is a way of supporting the autodiscovery of RSS feeds associated with a web page by declaring the location of the web feed within a <link> tag in the <head> of a web page (RSS autodiscovery: howto).
What this means is that if you have a list of links to library web pages, you can potentially automatically discover any RSS feeds associated with that library (if they have published autodiscoverable feed links, that is).
Some time ago, I put together a quick app that took a screenscraped list of UK HEI Library homepage URIs from somewhere (you wouldn't believe how hard it is to try to find a list of UK HEI library homepages;-) and tried to autodiscover any RSS feeds associated with them - Autodiscoverable RSS Feeds From HEI Library Websites. When I ran the detector just now, I got about a 36% success rate, which is far better than this time last year...
So anyway, I was wondering: how do the Cambridge University Libraries fare?
Looking through my list of handy cam.ac.uk links, here's one for an XML feed of the associated libraries, with links to their homepage: http://www.lib.cam.ac.uk/api/local/libraries_data.cgi
Notice that the URI for the web page of each library can be found down the XML path: libraries.library.web_address
So let's bring this in to a Yahoo Pipes environment, and try to autodetect any RSS feeds linked to from those pages. As well as importing RSS, Yahoo Pipes can also import JSON and XML feeds using the Fetch Data import block. However, I've noticed that the Fetch Data block sometimes chokes (I'm not quite sure why) so instead I use another Yahoo service - YQL - to act as an intermediary that will fetch the xml, maybe process it a little for me, and then pull the result into the pipe:
What the query statement does: select library.web_address from xml where url='http://www.lib.cam.ac.uk/api/local/libraries_data.cgi' is grab all the library.web_address elements (that point to the homepage for each library) from the XML page at http://www.lib.cam.ac.uk/api/local/libraries_data.cgi and pass them in to the pipe as XML.
NB it's trivial to create a simple 'helper' pipe block that acts like a mimimal Fetch Data block but actually pulls in the XML file via YQL:
This block could then be included in a pipe in the same way that a Fetch Data block can be...
So what next? Well, now we can use the Feed Autodiscovery block to see if there are any autodiscoverable RSS feeds listed on those web pages.
In order to do this, we need to pop the Feed Autodiscovery inside a Loop block - this allows the pipe to grab any autodiscovered feed URIs and produce a new feed of feed URIs by replacing the original elements that point to the Library homepages. The Emit all results instruction enforces this replacement policy.
So to recap - we grab a list of webpage URIs fromn the Cambridge Libraries XML feed:
Then we replace those feed items by any and all autodiscovered feed URIs:
(Who'd have thunk it - Penguin of the Day;-).
Note that the pipe also reports on any broken links it finds in the original homepage list:
Again, the pipe reports on any feed URLs that appear to be broken:
So there we have it - a pipe that contains the aggregated feed items from the autodiscoverable RSS feeds listed on Cambridge University Library homepages, all powered by a single XML file containing links to the Library homepages.
If you don't already know what RSS is, you may have noticed the folllowing logo appear on different websites, and even within your browser, and never really been sure what it's actually for...
What it's for is wiring (or plumbing). What it's for is passing content from one web page or application to another. What it's for is never having to visit that web page again to keep up to date with new content that might appear on that web page or website. What it's for is letting you see content from that page or site in another application, such as feed reader like Google Reader, or a 'web desktop/dashboard' like Netvibes, or Google personal pages. What it's for is turning websites into 'not email', that you can subscribe to from a single application and then view updates from in a single location.
It's also for much more than that, but that's what we'll start with...
But that's not what this post is about... What this post is about is how you can use an online application called Yahoo Pipes to do all sorts of plumbing with RSS feeds.
To get started, you'll need a Yahoo account, then you can create your first pipe...
It's like Lego, but with bits of web content pulled into the pipe from one or more RSS feeds, with the content packaged up into bundles where each bundle contains: - a title; - some content (like the body of a blog post or news story), referred to as the description; - a link (which is often to the original web page that contains the description).
We can pull one or more of these feeds into the pipes environment by creating a new Yahoo pipe and then using the Fetch Feed block from the Sources area of the left hand side bar:
(Highlighing a block by clicking on it lets you preview the output of that block.)
We can add combine the output of several feeds simply by adding the URL of each required feed to the Fetch Feed block:
The order of items in combined feed will be all the items from the last feed in the Fecth Feed block. followed by the items in the feed before it, and so on.
To order the items in the combination feed by date order, use the 'Sort' block:
The field you need to sort on is chosen from the drop down menu - PubDate is the element we want to sort on:
(Wire the blocks together by clicking on the 'output circle' at the bottom of a block and dragging the 'wire' that is produced onto the 'input circle' at the top of the next block.)
Finally, we need to connect the output block to the pipe to complete it.
If you run the Pipe, you will see its 'front page':
you can now subscribe to the output feed from this pipe (or use it in another pipe...), add it to your Yahoo or Google homepage, and so on.
There's a lot more you can do with Yahoo Pipes, but this is a good start: being able to aggregate (that is merge, or combine) content from several different sources into a single feed, and then order them accroding to time.
So how else might we use this simple 'aggregate and order' pattern?
How about combining table of contents feeds from different journals (you can find their URIs from TicTocs?
In this way, you can create a single RSS feed that keeps you up to date with the contents of several different journals you are interested in, and maybe also pulls in content from a recent/new books feed from your Library?
In this post, I'll provide an example of a bookmarklet pattern that passes some highlighted (that is, selected) text within the current page and passes it to another web page.
In at least some versions of IE, we need to use the construction document.selection.createRange().text.
To see how this works, highlight some text on this page and then click here.
Here is an example that achieves that effect: Highlight some text and click here.
So how might we use this in practice? How about DOI resolution? (If you don't know about DOIs, they're Digital Object Identifiers - so go Google.. ;-)
DOIs typically look something like this: doi:10.1016/S0040-1625(03)00072-6. A long string of characters (in various formats depending on publisher), often prefixed by doi:
A DOI can point to one or more instances of a document. A DOI resolver will take a DOI and point you to an instance of it depending on various criteria. (In a library setting, this might depend on what online resources your library subscribes to.)
So for example, let's see what the DOI resolver at http://dx.doi.org/ can do with the DOI 10.1016/S0040-1625(03)00072-6...
You can call the resolver with the DOI in the following way: http://dx.doi.org/doi:THE-DOI_YOU/WANT:RESOLVING
Hopefully, you might now see an opportunity here for a bookmarklet that uses the 'getSelection' pattern? In particular, a bookmarklet that lets a user highlight a DOI and then click on the bookmarklet to resolve that DOI.
Grab the selected text (hopefully corresponding a valid DOI!;-): var t=window.getSelection?window.getSelection().toString():document.selection.createRange().text;
Construct a URI that will pass this DOI to the DOI resolver: var uri="http://dx.doi.org/doi:"+t;
Go to this URI, and as a result get redirected to an instance of the actual resource: window.location=uri;
We can then simplify this as follows: var t=window.getSelection?window.getSelection().toString():document.selection.createRange().text; window.location="http://dx.doi.org/doi:"+t;
Try it - select this DOI (just the numbers... no leading doi:): 10.1016/S0040-1625(03)00072-6 and then click on this DOI Resolver bookmarklet.
Here's a tool for helping generate your own bookmarklets using the 'get selection' pattern:
PS If you leave the space for the URI blank, you can generate a bookmarklet that will let you highlight an unlinked URI in a webpage, like the following one: http://arcadiaproject.blogspot.com and 'click through' it (via the bookmarklet) to the corresponding webpage...
PPS in some situations, it might be sensible to 'go defensive' and encode the selected text so that it works nicely in a URI. do this by adding the step: t=encodeURIComponent(t); before the window.location step.
The Arcadia Project is a three-year project funded by a generous grant from the Arcadia Fund to explore the role of academic libraries in a digital age. A major part of the project is a Fellowship Program which will brin people to Cambridge to work on aspects of this very broad subject. We have a project website which serves as the hub for our more formal activities. This blog has been set up to complement the main Arcadia Project Blog and act as a home for technical posts that might put off a more general audience...