Monday, January 04, 2010

Scholarly Googles, foibles and FAILs

Martin Fenner's recent blog post about ORCID, a way of uniquely identifying research scientists (or, I suppose, in principle, just about anybody) in databases, got me thinking a little about how this might solve some of my own problems. Briefly, as I understand it, ORCID will allow easier identification of published papers in the scientific literature and attribution (correctly, one hopes) to individual authors. It promises to solve a whole host of problems, including differentiating between researchers with identical names (just try looking up papers by "A. Wong", or "J. Smith" - go on, I dare you), or the same person publishing under different names (like a married name, for example).


One example where this might be useful is when granting agencies want to measure the "impact" of the funding dollars they've put into a project. And, in science, the most frequent measure of impact is the publication. Scientist "X" has a grant from the Big Granting Agency, so let's find out how many papers Scientist X has published, count them up, and report that number. Easy, right?


Not necessarily.


Conventional search engines such as PubMed (for the biomedical sciences, which is the area I inhabit) are easy enough to search, but are keyed to a limited number of descriptive terms (keywords, author names, and the like). And PubMed doesn't handle the problems identified above (one name, many people, or one person, multiple names) at all, as far as I can tell. Searching PubMed using "R. Wintle" finds a bunch of publications that I didn't write; by contrast, using "R.F. Wintle" misses one that I did. For people with more common names and/or a lot more publications, sifting through the results for relevant ones becomes a real chore. PubMed, too, only deals with biomedical papers - so if I'd happened to publish some interesting algorithm in a Math journal (oh, go on - it could happen), that would also be missed by both search strategies.


But it gets worse. The issues with PubMed (which is, after all, a curated set of publication data - in other words, it contains only "potentially relevant" information) absolutely pale in comparison with the monster that is Google Scholar. Scholar has a major advantage over PubMed, as it indexes each article's full text, just like Google does with web pages. So, looking for acknowledgments in the text ("thanks to Scientist X for helpful advice", or "experiments were performed in the facilities at Big Shiny Lab") becomes trivial. PubMed can't do this. Not at all.


But - and this is where it becomes tricky - Scholar is not smart enough to do date ranges smaller than a year. So if, for example, one wanted to find all publications acknowledging experiments performed at Big Shiny Lab in the first quarter of 2010 - well, you're out of luck. Or should I say, I'm out of luck. And this, unfortunately, is precisely the kind of data I need to gather. Four times a year, as it turns out, for one funding agency. For others, I'm occasionally obliged to do it based on the fiscal year (April through March), or various types of "government" years (July to June, October to September), which Scholar also can't do. PubMed can deal with monthly date ranges no problem, but not with full-text searches.


See the problem? Presented with the question above, I can search Google Scholar for all of 2010, and then manually go through the resultant mess of hits to (a) find those in the first quarter, rather than the other nine months of the year, (b) eliminate the inevitable duplicates, and (c) trim out the remaining chaff caused by spurious keyword hits. This, as you might imagine, is both time-consuming and irritating. ORCID, truth be told, won't solve this particular problem. Nothing will, unless Google smartens up and puts "proper" date tags on its indexed publications and implements a more sophisticated date-limit on searches (which, by the way, I've asked them to do - go on, you can ask too!).


All this leads me to the inevitable conclusion that there must be a better way. Data-mining from indexed publication records is not an easy task, and people much cleverer than I have spent a lot of time (and money) on it. What I'm looking for, of course, is a push-button solution: show me all the publications, in a certain date range, containing relevant references to the Big Shiny Lab, sorted nicely and with all the redundant hits eliminated. If we (and by "we", I mean "somebody") can search the whole web, harmonize identifiers using something like ORCID, index thousands of scientific journals, and dig through it all with sophisticated keyword strategies, surely a little request like that isn't too much to ask?

14 comments:

Martin Fenner said...

Richard, Scopus can already do most of the of the things you mention. Your Scopus Author ID is 6603384333, and Scopus knows that you have 66 coauthors and have been cited 629 times.

Ricardipus said...

Thanks, Martin. But can it tell me how many publications, by anybody at all, acknowledged my assistance, anywhere in the text, between April and June of 2009?

(the answer to this question is probably zero, but you get the idea)

WrathofDawn said...

A post! A post!

Ooo! Ooo! Ooo!

*goes to read it*

vw - quetau - cake from Qu├ębec

WrathofDawn said...

Dude. That is a pain re the quarterly accouting.

You can't be the only scientist who has this problem. I would suspect anyone seeking research grants has it. Would it be reasonable to hire someone like a postgrad student or some such to do this research for you? Someone with sufficient intelligence and knowledge of the field to do the job reliably, who still can't charge much and needs the cash?

Perhaps one of them would become sufficiently motivated to solve the problem...

WrathofDawn said...

"accouting?"

ACCOUTING?

*sigh*

Ricardipus said...

Yes, the quarterly accoutrements are annoying.

I could get someone else to shovel through all the raw hits from Google Scholar, instead of doing it myself, but I have doubts that it would be done right. Call me a control freak.

Hiring someone to develop software text-mining tools to filter the raw Scholar hits would, I think, be an expensive exercise in frustration. There is someone at Queen's University who's developed something along these lines, but I've not met her. Yet.

WrathofDawn said...

Ah. There is that. And I wouldn't want to stake my professional reputation (if I had one) on someone's potentially incorrect research.

Wasn't suggesting you pay someone to develop the requisite software (I'm daft but not totally doolally) but thought the torture of the manual search might motivate the budding scientist to seek a solution on the grounds of never wanting to have to go through all that palaver ever again and seeing the future all too clearly.

Which just shows my lack of comprehension of the situation. Hope you eventually find the search engine functions you need. It sounds quite tedious as it stands now.

#Debi said...

Silly me. I would have thought that PubMed is the bar at Club Med...

Alethea said...

Dawn - science is rife with tasks we would love to farm out to someone with sufficient intelligence and knowledge of the field to do the job reliably, who still can't charge much and needs the cash. It's also rife with people who meet that description. It's unfortunately short on the cash necessary to bring the two together.

Yes, I'm writing another grant application. Sigh.

WrathofDawn said...

@ Alethea - My sympathies. Also, good luck. Hope you get the grant you're seeking!

Aled Hughes said...

My solution to this problem was was the result of my parents being romantic, and naming me after a valley/lake in Wales. With a name like Aled, how many people have ever heard of it outside of the UK?
The INS in the US once asked me if I'd spelled it wrongly, and comments such as 'what kind of name is that?'
'Mine', quoth I.

Bob said...

Urgh, looking through PubMed and the like makes my eyes water. But maybe that's because I have no patience with it...

wv: abichan - a Chinese herbal remedy for a sneezing fit

Ricardipus said...

Hey, there's Bob! How are you doing these days?

Bob said...

Hello! I'm alright, very busy, hence the giving up of blogging. Still dropping by every now and then though :) Doing an intercalated BSc this year, which is why Pubmed is my mortal enemy...