Ricardiblog: PubMed

Here's a little ~~lazy blogging~~ cross-posting: I've just put up a new post at yet another of my haunts, the Life Science Tools of the Trade blog. Actually, "haunt" is a good term, because I took it over from various other bloggers, all of whom have since disappeared. I swear I don't know where the bodies are hidden, really.

Over there, I'm whingeing again [So, going with your strengths then, are you? - Ed.] about the U.S. National Center for Biotechnology Information's venerable PubMed literature search engine, and how maybe, just maybe, Google Scholar is a better bet. Dear old PubMed is a tool that many (most? all?) scientists, at least those of a biomedical bent, are very used to using to find published articles. Unlike Google, it sits on top of a curated, and keyword-indexed, collection of relevant literature. Also unlike Google, it is completely unable to search inside articles using free text queries, and appears to occasionally be somewhat less than adept at finding things.

The post is here: PubMedically failing.

Martin Fenner's recent blog post about ORCID, a way of uniquely identifying research scientists (or, I suppose, in principle, just about anybody) in databases, got me thinking a little about how this might solve some of my own problems. Briefly, as I understand it, ORCID will allow easier identification of published papers in the scientific literature and attribution (correctly, one hopes) to individual authors. It promises to solve a whole host of problems, including differentiating between researchers with identical names (just try looking up papers by "A. Wong", or "J. Smith" - go on, I dare you), or the same person publishing under different names (like a married name, for example).

One example where this might be useful is when granting agencies want to measure the "impact" of the funding dollars they've put into a project. And, in science, the most frequent measure of impact is the publication. Scientist "X" has a grant from the Big Granting Agency, so let's find out how many papers Scientist X has published, count them up, and report that number. Easy, right?

Not necessarily.

Conventional search engines such as PubMed (for the biomedical sciences, which is the area I inhabit) are easy enough to search, but are keyed to a limited number of descriptive terms (keywords, author names, and the like). And PubMed doesn't handle the problems identified above (one name, many people, or one person, multiple names) at all, as far as I can tell. Searching PubMed using "R. Wintle" finds a bunch of publications that I didn't write; by contrast, using "R.F. Wintle" misses one that I did. For people with more common names and/or a lot more publications, sifting through the results for relevant ones becomes a real chore. PubMed, too, only deals with biomedical papers - so if I'd happened to publish some interesting algorithm in a Math journal (oh, go on - it could happen), that would also be missed by both search strategies.

But it gets worse. The issues with PubMed (which is, after all, a curated set of publication data - in other words, it contains only "potentially relevant" information) absolutely pale in comparison with the monster that is Google Scholar. Scholar has a major advantage over PubMed, as it indexes each article's full text, just like Google does with web pages. So, looking for acknowledgments in the text ("thanks to Scientist X for helpful advice", or "experiments were performed in the facilities at Big Shiny Lab") becomes trivial. PubMed can't do this. Not at all.

But - and this is where it becomes tricky - Scholar is not smart enough to do date ranges smaller than a year. So if, for example, one wanted to find all publications acknowledging experiments performed at Big Shiny Lab in the first quarter of 2010 - well, you're out of luck. Or should I say, I'm out of luck. And this, unfortunately, is precisely the kind of data I need to gather. Four times a year, as it turns out, for one funding agency. For others, I'm occasionally obliged to do it based on the fiscal year (April through March), or various types of "government" years (July to June, October to September), which Scholar also can't do. PubMed can deal with monthly date ranges no problem, but not with full-text searches.

See the problem? Presented with the question above, I can search Google Scholar for all of 2010, and then manually go through the resultant mess of hits to (a) find those in the first quarter, rather than the other nine months of the year, (b) eliminate the inevitable duplicates, and (c) trim out the remaining chaff caused by spurious keyword hits. This, as you might imagine, is both time-consuming and irritating. ORCID, truth be told, won't solve this particular problem. Nothing will, unless Google smartens up and puts "proper" date tags on its indexed publications and implements a more sophisticated date-limit on searches (which, by the way, I've asked them to do - go on, you can ask too!).

All this leads me to the inevitable conclusion that there must be a better way. Data-mining from indexed publication records is not an easy task, and people much cleverer than I have spent a lot of time (and money) on it. What I'm looking for, of course, is a push-button solution: show me all the publications, in a certain date range, containing relevant references to the Big Shiny Lab, sorted nicely and with all the redundant hits eliminated. If we (and by "we", I mean "somebody") can search the whole web, harmonize identifiers using something like ORCID, index thousands of scientific journals, and dig through it all with sophisticated keyword strategies, surely a little request like that isn't too much to ask?

Ricardiblog

Wednesday, December 15, 2010

More Cross-Posty Goodness

Monday, January 04, 2010

Scholarly Googles, foibles and FAILs