Monday, January 04, 2010

Scholarly Googles, foibles and FAILs

Martin Fenner's recent blog post about ORCID, a way of uniquely identifying research scientists (or, I suppose, in principle, just about anybody) in databases, got me thinking a little about how this might solve some of my own problems. Briefly, as I understand it, ORCID will allow easier identification of published papers in the scientific literature and attribution (correctly, one hopes) to individual authors. It promises to solve a whole host of problems, including differentiating between researchers with identical names (just try looking up papers by "A. Wong", or "J. Smith" - go on, I dare you), or the same person publishing under different names (like a married name, for example).

One example where this might be useful is when granting agencies want to measure the "impact" of the funding dollars they've put into a project. And, in science, the most frequent measure of impact is the publication. Scientist "X" has a grant from the Big Granting Agency, so let's find out how many papers Scientist X has published, count them up, and report that number. Easy, right?

Not necessarily.

Conventional search engines such as PubMed (for the biomedical sciences, which is the area I inhabit) are easy enough to search, but are keyed to a limited number of descriptive terms (keywords, author names, and the like). And PubMed doesn't handle the problems identified above (one name, many people, or one person, multiple names) at all, as far as I can tell. Searching PubMed using "R. Wintle" finds a bunch of publications that I didn't write; by contrast, using "R.F. Wintle" misses one that I did. For people with more common names and/or a lot more publications, sifting through the results for relevant ones becomes a real chore. PubMed, too, only deals with biomedical papers - so if I'd happened to publish some interesting algorithm in a Math journal (oh, go on - it could happen), that would also be missed by both search strategies.

But it gets worse. The issues with PubMed (which is, after all, a curated set of publication data - in other words, it contains only "potentially relevant" information) absolutely pale in comparison with the monster that is Google Scholar. Scholar has a major advantage over PubMed, as it indexes each article's full text, just like Google does with web pages. So, looking for acknowledgments in the text ("thanks to Scientist X for helpful advice", or "experiments were performed in the facilities at Big Shiny Lab") becomes trivial. PubMed can't do this. Not at all.

But - and this is where it becomes tricky - Scholar is not smart enough to do date ranges smaller than a year. So if, for example, one wanted to find all publications acknowledging experiments performed at Big Shiny Lab in the first quarter of 2010 - well, you're out of luck. Or should I say, I'm out of luck. And this, unfortunately, is precisely the kind of data I need to gather. Four times a year, as it turns out, for one funding agency. For others, I'm occasionally obliged to do it based on the fiscal year (April through March), or various types of "government" years (July to June, October to September), which Scholar also can't do. PubMed can deal with monthly date ranges no problem, but not with full-text searches.

See the problem? Presented with the question above, I can search Google Scholar for all of 2010, and then manually go through the resultant mess of hits to (a) find those in the first quarter, rather than the other nine months of the year, (b) eliminate the inevitable duplicates, and (c) trim out the remaining chaff caused by spurious keyword hits. This, as you might imagine, is both time-consuming and irritating. ORCID, truth be told, won't solve this particular problem. Nothing will, unless Google smartens up and puts "proper" date tags on its indexed publications and implements a more sophisticated date-limit on searches (which, by the way, I've asked them to do - go on, you can ask too!).

All this leads me to the inevitable conclusion that there must be a better way. Data-mining from indexed publication records is not an easy task, and people much cleverer than I have spent a lot of time (and money) on it. What I'm looking for, of course, is a push-button solution: show me all the publications, in a certain date range, containing relevant references to the Big Shiny Lab, sorted nicely and with all the redundant hits eliminated. If we (and by "we", I mean "somebody") can search the whole web, harmonize identifiers using something like ORCID, index thousands of scientific journals, and dig through it all with sophisticated keyword strategies, surely a little request like that isn't too much to ask?

Friday, January 01, 2010


Glenora, Ontario - Winter 2009
A winter scene, on the way home.

Happy New Year, to anyone who might still be reading.

The annual trip down to the other end of Lake Ontario is over, the slew of Christmas presents packed into two vehicles and transported back home, and calm has descended on Chateau Ricardipus for the time being. Back to the Land of Wireless Internet™ again, which makes just about everything easier. Blog posting, obviously - but also looking up trivia ("where on Earth have I seen that movie actor before?" being a favourite question around here), uploading photos to Flickr of course, and even (*shudder*) doing work. From home. On a Statutory Holiday.

Our next, biggest challenge will be re-synchronizing ourselves with the school/work day schedule by Monday morning. No more sleeping late, lounging about in pajamas until mid-day, and snacking throughout. No more staying up to watch movies. Back to the daily routine, which, to be honest, is looking like a nice option after two solid weeks off.

And for the new year? Well, given recent events, I'm hoping for less travel, at least to the US, since border crossing has once again escalated into a complicated gauntlet of security checks, restrictions on what you can and can't be reading/working on/playing with on the plane, and other general Scroogeiness. And just when I'd perfected the "shoes and belt and pockets" security-check dance. No, in the near future I'm hoping for nothing more exotic than a trip to Montréal, for which I will once again take the train, an altogether much more civilized experience than negotiating airport security these days.

Or perhaps somewhere I can drive, since I rather enjoy transporting myself. Armed, as always, with a camera, and the potential for some diversions along the way. Like that last trip home, where I detoured along the Loyalist Parkway for the first time in many, many years, revisiting the Glenora Ferry, which I remember as being a highlight of family trips when I was young. I'm pleased to report that some things, in keeping with the Christmas theme of tradition, never change.

M.V. Glenora
M.V. Glenora, coming to take me home.

All the best for 2010, everyone.