Justin Kiggins

Product. Data. Science.

in the future, the scientific literature will follow you

16 August 2012

Of the 1+ million new scientific papers published each year, which ones should a scientist read?

Its obvious that any single researcher can’t read them all. Nor do we want to… the overwhelming majority aren’t relevant to us or our work. We limit what we read and we employ variety of methods to chose what new research we do read… a strategic foraging task utilizing a hodge-podge set of tools, including subscribing to journals or RSS feeds, saved Pubmed searches

But times are a-changin’ for academic publishing and some of the changes are very similar to those in journalism:

The publisher of a major international newspaper once told me that he delivers “the five or six things I absolutely have to know this morning.” But there was always a fundamental problem with that idea, which the Internet has made starkly obvious: There is far more that matters than any one of us can follow. In most cases, the limiting factor in journalism is not what was reported but the attention we can pay to it.

The goal of personalized news—news that is tailored specificially to me—is hot but unrealized. I recently came across an article by Jonathan Stray proposing three principles that ought to govern personalized news: interest, effects, and agency.

You should see a story if:

  1. You specifically go looking for it.
  2. It affects you or any of your communities.
  3. There is something you might be able to do about it.

This got me thinking about what these principles mean for “following the literature”. In particular, how would one develop a strategy for research literature discovery (typically known as “following the literature”) that embodies these principles, where the literature follows the researcher?


Anyone who wants to know should be able to know. From a product point of view, this translates into good search and subscription features. Search is particularly important because it makes it easy to satisfy your curiosity, closing the gap between wondering and knowing.

This is perhaps the most obvious minimal requirement… to be able to find things that I’m looking for. This is a passive feature of a system for research literature discovery… I take the action. I decide when I want to know about something. I search for “Lastname et al, 2003″ after seeing the reference at the bottom of a figure in a presentation. I go hunting for a recent paper that someone mentioned to me at the coffee cart.

There are already some very good tools out there for this. Pubmed & Google Scholar are my go-to sources. So, for scientists at least,* I’m going to consider this a problem basically solved and move on…


I should know about things that will affect me. Local news organizations always did this, by covering what was of interest to their particular geographic community. But each of us is a member of many different communities now, mostly defined by identity or interest and not geography. Each way of seeing communities gives us a different way of understanding who might be affected by something happening in the world.

Did I get scooped? Did someone cite my work? Is there a new paper that changes the way I interpret my not-yet-published results?

These are harder, more time consuming questions to answer by relying exclusively on a search-based interface. Historically, these questions would have been answered by subscribing to specialized “society journals”… The Journal of Obscure Sub-Subfield. More common today are custom filters and saved searches. For example, here’s Bradley Voytek’s strategy:

Similarly, Drugmonkey polled his readers a while back on how they keep abreast of the literature. The responses (summarized here) include everything from tools like pubcrawler to relying on blogs, journal RSS feeds, to making graduate students do it.

Its obvious that everyone is looking out for papers that will affect them… but can we do better? All of these together are (1) cludgy, (2) require a lot of time and effort to setup and tailor, and (3) require the researcher to already know what will affect them. To a certain extent, the third isn’t a huge problem… its obvious that my collaborators’ and competitors’ work will affect me and a huge part of our role as scientists in knowing what the state of the field is and how it will affect our work.

But what about the “unknown unknowns“?

What if a system could leverage existing sources about what will affect me to predict which new papers I should know about? What if it could use a researcher’s publications, personal library of papers, and network of friends & colleagues to assess newly published papers for “likelihood of effect”?

A big breakthrough in this direction was launched recently with Google Scholar’s “My Updates” feature (see Jonathan Eisen’s summary here), which analyzes a researchers past publications to predict relevance of papers to them. One shortcoming of this approach is that its usefulness will be more limited for graduate students (especially those in their early years) who will have fewer publications to their names than a tenured professor.

A similar approach (still in Beta) is Mendeley Suggest, which (apparently) leverages a user’s library, compares it with other users’ libraries, and makes suggestions for relevant papers. Mendeley Suggest certainly brought a few papers to my attention that affect my work, but they were all 5-10 years old. Not exactly a great tool for keeping tabs on new papers. Practically speaking, a better approach might be to take something like the ScipleRSS, which ranks the results of a batch of journal RSS feeds according to a set of weighted keywords, and use the Mendeley API to set the keyword weightings based on a users actual library (if I ever have time, I’d like to actually do this). Having access to the user’s use-statistics for the papers in their library, as well as social aspects (what papers were recently added by Mendeley contacts? by other users in Mendeley Groups?) could make such a prediction system even more powerful. Regardless of whether this is the best approach, there’s room for improvement here and Mendeley seems like a good place to start.

Beyond these emerging projects, its hard to imagine exactly where the future lies… if a system had enough data about you, your research, and the broader context you work in (say, your publications, grants, manuscripts, library, conference attendance info, which posters you stood in front of, the funding climate in your field), would it be able to make broader predictions for which papers would “affect” you? Would it be able to know that some paper in a totally different field solved a problem in a way that informed your own work, even though the research programs might not share any keywords?

Its not clear where the boundary is, but it is clear that this area is ready for some innovation, at least as it relates to improving the efficiency of alerting researchers to relevant new work.


Ultimately, I believe journalism must facilitate change. Otherwise, what’s the point? This translates to the idea of agency, the idea that someone can be empowered by knowing. But not every person can affect every thing, because people differ in position, capability, and authority. So my third principle is this: Anyone who might be able to act on a story should see it. This applies regardless of whether or not that person is directly affected, which makes it the most social and empathetic of these principles.

Science must also facilitate change… it must change the way that we view the world and thus, the way we respond to it. Our primary goal as scientists is to produce knowledge. But as John Archibald Wheeler described it, “We live on an island surrounded by a sea of ignorance. As our island of knowledge grows, so does the shore of our ignorance.” Every “answer” begets multiple more questions. (So really, we’re talking about a hyper-island in some multidimensional ocean) One of the primary challenges of a scientist is knowing in which direction to extend the island of knowledge.

And yet, all scientists are constrained by resources preventing us from extending the entire island at once. Not just money, but by the technical skills of trainees in the lab, the types of equipment that a lab has access to, the experiments that are available through collaborations. In deciding the direction that a lab should take (which projects to put into a grant, which ones to pilot, which grad students to accept), a researcher has to take all of these resource constraints into account, while recognizing the unique resources and opportunities available to determine the key areas where a lab can be productive.

Ultimately, the need for being able to have access to research we are interested in and to be alerted to research that affects us is meant to support this primary goal of producing new knowledge.

But could a research literature discovery system support this endeavor more directly through ensuring that a scientist knows about research that they might be able to act on?

There is a section in Michael Nielsen’s Reinventing Discovery, where he imagines a future scenario in science:

 You’re a theoretical physicist working at the California Institute of Technology (Caltech), in Pasadena. Each morning, you begin your work by sitting down at your computer, which presents to you a list of ten requests for your assistance, a list that’s been distilled especially for you from millions of such requests filed overnight by scientists around the world. Out of all those requests, these are the problems where you are likely to have maximal comparative advantage. Today, one of the requests immediately catches your eye. A materials scientist in Budapest, Hungary, has been working on a project to develop a new type of crystal. During the project an unanticipated difficulty has come up involving a very specialized type of problem: figuring out the behavior of particles as they [diffuse] on a triangular latticework. Unfortunately for the materials scientist, diffusion is a subject they don’t know much about. You, in turn, don’t know much about crystals, but you are an expert on the mathematics of diffusion, and in fact, you’ve previously solved several research problems similar to the problem puzzling the materials scientist. After mulling over the  diffusion problem for a few minutes, you’re sure that the problem will fall easily to mathematical techniques you know well, but which the materials scientist probably doesn’t know at all. […]

Long story, short, you collaborate & everyone wins. This vision will require some major changes in the way science is done before it will be fully realized. However, a key principle here is that the theoretical physicist is alerted to a problem not based on her interests or whether his research program is affected by the work of the materials scientist in Budapest, but rather because she has agency in the topic. And we can similarly imagine a framework where a scientist is alerted to new publications not just because they are interested in the topic or because it affected them, but rather because they are uniquely situated to do the next experiment. Maybe they are an evolutionary anthropologist with a dataset, who could be alerted to a recently published model of neandertal ancestry that they were uniquely equipped to validate or invalidate. Or a molecular biologist with the perfect combination of reagents, gizmos, and trainees to try to replicate claims of a bacteria being able to survive on arsenic.

Is this possible? I think so. Stack Exchange is already aiming to predict which users are best equipped to answer which questions (a goal with striking similarities to Neilson’s imagined future for scientists). Imagine a system that has knowledge of a lab’s skills based on the CVs or Linkedin profiles of the members and the “Methods” sections of publications, combined with a detailed knowledge of the lab’s inventory (say, based on some lab management software like Labguru or Quartzy) and potential collaborations (based on past collaborations from the publication record and knowledge of the lab’s social network, pulled from Mendeley or Linkedin or ResearchGate). Then cross reference all of this against, say, the methods sections of recent publications… or better yet, against semantic analysis of post-publication peer-review commentary about a publication (which is where most of the ideas about the “next experiment” are likely to be found) such as F1000, Twitter, research blogs, hypothes.is, etc.

The key here is that not only are scientists affected by what others publish, but we have the opportunity to produce research that affects the rest of the world. And maybe we can start building tools that help us know where to focus our energies and resources in order to do exactly that.

  • It should be noted that the current model of scientific publishing puts up a barrier to the “Anyone who wants to know” part, as access is currently limited to institutional subscribers or those who are willing to pony up $30+ per article… which isn’t “anyone”. The key barrier here for scientific publishing is enabling “anyone who wants to know” to have access to scientific publications.

“_MG_9934″ CC-BY RoyZilla92