Introduction
This chapter is probably
the bookıs most speculative, in that it discusses broad-based computational
access to scholarly literatures — a collection of developments that are likely to
happen largely as a consequence of increasing open access. Traditional open access is, in my view,
a probable (but not certain) prerequisite for the emergence of fully developed
large-scale computational approaches to the scholarly literature. It may not be
a sufficient prerequisite, particularly if the legal and systems architecture
frameworks currently being developed and deployed to support traditional open
access are not quickly adjusted to accommodate the needs of open computational
access. Indeed, even if such
accommodations are made, and if appropriate open access provisions were to be
universally established for all scholarly works going forward, there is still
an enormous, long-lasting problem with the established historical base of
scholarly literature. While
scholars tend to focus largely on new contributions to the literature, computational
technologies value and demand scale and comprehensiveness in the literature
base that they address; constraints on the use of the historical literature
will continue to represent a massive barrier to such computational uses. A move to open access may not help much
with this retrospective material.
I am confident that the
other chapters of this volume have done a fine job of describing the various
access models and practices that are being characterized by the term ³open
access² in different settings, and the virtues and benefits that they share in
terms of democratizing access to varying degrees and in varying
dimensions. Indeed, we are seeing
some of these benefits — for example, access by readers in developing
countries -- today not just as a result of author and publisher choices about
open access, but sometimes even as a result of publisher practices that could
only be termed any kind of real ³open access² by the most imaginative and
dedicated public relations functionary.
Similarly, we are seeing some developments in computational access to
literature — most prominently for indexing (think of Google and
similar search engines, and their explicit arrangements with publishers, or
their efforts to implicitly compromise with publishers within the framework of
copyright lawıs fair use provisions through the indexing of copyrighted text
but the presentation only of brief ³snippets² of copyrighted material) —
outside of the open access framework.
Some publishers are also making explicit provisions for experimental
text mining, or allowing rehosting under license agreements which opens the
door to arbitrary computational exploitation or representation of their
material within closed organizational contexts.
The case for the benefits
of open computational access to the scholarly literature is also much more
complex than the arguments usually marshaled for traditional open access — in
part because these benefits are indirect, and in part because they are still
considered largely speculative and unproven. They are indirect in that they merely open the way for
various players with good ideas to advance the progress of research and
scholarship in perhaps new and perhaps more accelerated ways; presumably, in
the long run, such research progress is of value to everyone. (Note that,
paradoxically, computational access to a scholarly literature for the purposes
of indexing may also make that literature more economically valuable in the
non-open access case, in that it may increase demand: witness the interest of commercial journal publishers in
having their material indexed in search engines.)
The benefits are
speculative in the sense that we are just beginning to understand and
demonstrate what we can accomplish, computationally, with large scholarly literature
corpora. A number of inter-related
technologies such as text mining and analysis are very active, vibrant and
well-funded research areas, attracting extensive participation and investment
from government and industry as well as academia. And, more recently, we are seeing experiments not only in
computing on literatures to derive insights, but in the actual rehosting of literatures within new analysis, usage and
curation environments: here a
scholarly literature is actually imported into a new usage environment that
adds value through computation and perhaps also through social interaction —
leading examples of this might include the work of the US National Center for
Biotechnology Information at the National Library of Medicine for the molecular
biology literature, or the fascinating experiments carried out by Greg Crane
and his colleagues at Tufts University in the Perseus Project. But it is
important to recognize that while researchers focusing specifically on
computational manipulation of scholarly literatures are reporting great
advances in their work, I think that the broad community of working scholars
remains to be convinced of the critical future contributions of such
technologies.
This brief chapter begins
an exploration of both the technical and the legal issues involved in enabling
widespread application of computational techniques and technologies to the
research literature. There are
many more questions than answers at this stage.
Technological Opportunities
Letıs perform a thought
experiment. Let us suppose, for
the moment, that the only copyright encumbrance on the scholarly literature was
that of attribution; articles could be freely replicated, and arbitrary
computations could be performed upon these articles. The results of these computations could be freely and widely
employed and shared. In such a
world, what do current technology trends suggest might be done with the
collection of articles that constitute the vast majority of the scholarly
literature in so many fields?
Clearly we would see the
widespread creation of copies of the scholarly literature, or very sizeable
subsets of this literature; these copies would reside in a great range of
personal, workgroup, and disciplinary settings for convenience of access and searching. Storage is getting very cheap, and
students and researchers cannot always count on the ubiquitous availability of
very inexpensive broadband connectivity.
We would see these copies of the published literature federated in
various ways with unpublished, preliminary, and proprietary materials, forming
knowledge bases that were unique to specific researchers, research groups,
corporations and other entities.
These federations would be facilitated by the ability to computationally
re-arrange and re-structure the literature.
We would also see an
explosion in services that provided access to this literature in new and
creative ways. Such services would
also incorporate specialized vocabulary databases, gazetteers, factual
databases, ontologies, and other auxiliary tools to enhance indexing and
retrieval. They would rapidly
transcend access to address navigation and analysis. One path here leads
towards more-customized rehosting of scholarly literatures and underlying
evidence into new usage and analysis environments attuned to the specific
scholarly practices of various disciplines.
We would also see a move
beyond federation and indexing to actual text mining and analysis, to the
extraction of hypotheses and correlations that would help to drive ongoing
scholarly inquiry. Indeed, the
literature would be embedded in a computational context that reorganized and
re-evaluated the existing body of knowledge as new literature became available. Initially, we would likely see a series
of leap-frog breakthroughs as these technologies rapidly advanced, but I think
it is likely that, over time, the state of the art in text mining and analysis
would stabilize or converge to a point where new computations over the common
literature base using the best state-of-the-art tools would only produce, at
best, modest incremental advances.
At this point the key leverage for wringing new discoveries from the
literature would pivot on two points of competitive advantage. The first would be early access to and
rapid integration of new contributions — including, most likely, preprints that had not, at
least yet, been peer reviewed, and perhaps segments of the historical
literature base newly entering the digital domain. The second would be the ability to quickly and successfully
integrate and exploit unreleased or non-public information — not
just unreleased preprints, but data, including negative data that had never
seen publication, in conjunction with the common shared public literature base
and ancillary public data and knowledge bases.
Itıs also near certain that
these innovations would not apply to all scholarly disciplines uniformly. Areas such as biomedicine or chemistry,
where much of the literature is relatively well-structured and where a base of
investment in the development of auxiliary knowledge structures such as factual
databases, ontologies, specialized vocabularies and vocabulary mappings and
similar tools has been extensive, would likely be fertile ground for early
advances. Indeed, in these fields
we are already seeing the beginning of a re-evaluation of authorial practices
that propose the incorporation of markup to facilitate exactly such
computational processing of the literature — consider the work of
scholars such as Peter Murray-Rust in chemistry, or the various proposals for
specialized markup languages in areas as diverse as history and molecular
biology. (In other web settings,
these efforts are being characterized as ³micro-formats².) Other ³hard² sciences, and certainly
many branches of the social sciences, would yield results more slowly. Many of the humanities would remain
recondite. And, of course, changes
in disciplinary practices of scholarly authoring would have a great
influence: to the extent that new
articles in the public literature base are routinely structured to facilitate
computational verification, integration or correlation, these disciplines would
presumably see greater payoffs for the applications of textual mining and
analysis. One can even imagine, in
certain highly competitive and commercially significant fields, deliberate
release of what is in effect disinformation to divert the attention of research driven by text
mining and literature analysis in deliberately unproductive directions.
Finally, in an environment
largely unencumbered by intellectual property issues, itıs likely that the
tension between distributed and centralized computation will be resolved
primarily according to the mandates of technical simplicity and universality
rather than being shaped by the contortions enforced by licensing agreements
and the services that individual publishers choose to make available. While in theory thereıs a performance
tradeoff between the choice of moving an interoperable, transportable network
based representation of the computation to the servers where the data resides,
and doing remote execution of procedural computational code on this remote
database — the concepts implicit in the seminal work of Kahn
and Cerf in their classic report ³The World of Knowbots² for example -- and the
infinitely simpler model that just copies all relevant data to a local store upon which computation occurs, it
seems to me most probable that in the absence of intellectual property concerns
and licensing constraints that the obvious and universally understood framework
of creating local copies will triumph.
The practical will dominate the theoretically optimal. The local replication model is so much
simpler and more reliable and predictable than the alternatives, where it seems
likely that every remote execution environment will have its local
idiosyncrasies and constraints, and where large-scale literature analysis will
have to adapt to the variety of interfaces offered by different
publishers. These interfaces will
inevitably incorporate a series of tradeoffs that publishers design to prevent
computational access from allowing actual copying of the literature base
(consider, for example, the as yet nebulous Open Text Mining Interface proposal
— see http://blogs.nature.com/wp/nascent/2006/04/open_text_mining_interface_1.html).
And it also avoids the very
real additional complexities of correlating and consolidating results from
multiple remote computations executing in a range of remote, most likely
publisher-based, literature silos.
So it seems absent proprietary content ownership constraints, the
dominant paradigm and the fastest path to the payoffs of textual mining and
analysis, of the application of new digital library technologies designed to
import and host literatures in ways that add value to that literature, will be
to accumulate a local representation of the relevant literature, and then to
perform ongoing computations on that literature locally.
Real-World Conundrums
Letıs move on from our
idealized thought experiment.
We are very unclear today
about whether even the systems that claim to offer ³open access² to collections
of scholarly literature are being — or should be -- designed to permit simple,
large-scale replication of these
collections in order to facilitate the creation of local resources that can be
computed upon. This is both a
technical question (is it easy to make a copy of the full collection?) and a
legal one (concerning what uses are allowed under the implicit or explicit
licenses). So one set of questions
is about whether we will provide the enabling technical infrastructure and
legal permission that facilitate computational access to scholarly literatures
even in the context of the various definitions of open access.
For the proprietary
scholarly literature, todayıs license agreements generally preclude the
creation of large literature subsets external to the publisherıs site, and,
indeed, user attempts to perform large-scale downloading have raised alarms and
led to difficult and awkward discussions involving publishers or aggregators,
licensing institutions (universities) and end users about the appropriateness
and legality of creating such local mirror databases. At least in theory, if the creation of local copies of
literature databases derived from large-scale downloads from various publishers
becomes a standard and accepted practice for faculty at licensing universities,
one might presume — or at least hope -- that most publishers (though
there would undoubtedly be holdouts) would revise and adapt their license
agreements to recognize and permit such practice.
For open access materials,
the creation of large-scale collections of copies is often ambiguous in the
absence of specific permissions; we are moving towards a legal understanding
that suggests public-access content is available for reading, but the ability
to re-host long lived copies is less clear. Open access content offered under terms such as the Creative
Commons license agreements reduces the uncertainty here — but
not necessarily for downstream use,
as I will shortly discuss.
Clear legal rights to make
large-scale copies of the literature are just the beginning of the legal
conundrums that will create barriers to open literature computation. What is the legal status of the results
of computations upon such copies?
What is the legal status of a re-hosting of these materials within a new
computational context that facilitates linkages, re-presentation, exploration
and analysis of a literature corpus?
As far as I can determine these questions are largely unexplored and
unresolved in law — both case law and legislation. We have the well-established concept of
a derivative work — for example, a translation or a work; creating a
derivative work requires permission from the rights holder of the original
work. At least when the process of
creating the derivative incorporates substantial new human intellectual effort,
new rights are overlaid upon those of the original author in the ownership of
the derivative. It is completely
unclear whether an algorithmic computation produces a true derivative work or
whether it is just considered a re-presentation of the original, but in either
case, rights in the algorithmic product certainly seem to include claims from
the source work. In cases where
the computation process takes as input an entire literature base, consisting of
perhaps hundreds of thousands of individual works the authors of each and
every one of these input works
might have a claim on the output.
It is not at all clear that we can make the case that only a small and
selected subset of the input works made a material contribution to the output
and thus have claims upon that output.
Is it the case, for example, that if we rerun the algorithm on a copy of
the literature base excluding a single article and get the same result as if we
had not excluded that article that we could argue this proved the result was
independent of the source article in question.
The sheer volume of rights
that need to be cleared may effectively preclude the application of
computational technologies to large literature bases. If the literature base is offered by a publisher operating
within a framework where authors transfer copyright to the publisher, then
presumably the publisher could grant the necessary rights to allow meaningful
text mining of the corpus, or the importation of the corpus into a new analysis
and presentation environment.
(Whether publishers will actually be willing to do so is another, and
doubtful, proposition.) In cases
where the corpus is produced through open access type arrangements, unless the
transfer of (most likely nonexclusive) permissions to the host of the corpus
are crafted with great care and specific focus on the computational
opportunities, text miners and those wanting to import materials into new use
environments will have to engage in completely impractical and unrealistic
author-by-author clearing of permissions.
The Creative Commons (CC)
license is a good case study here.
It is a very valuable tool in reducing ambiguity about the permitted
uses of scholarly works, but it also illustrates how little thought has been
given to computational applications.
The CC license offers authors options about whether to permit the
creation of derivative works, and also options about whether they can insist on
author attribution in downstream uses of their works. Permission to create derivative works seems to be a clear
prerequisite for computational use of articles; yet this is rather different
that the way that this choice is presented to authors creating a CC license to
their works today. Even the
attribution requirement may be a source of problems — will
we have to list author attributions for every work in a literature corpus as
part of the attribution for any computational result from this literature
corpus? And, if so, how will we
practically meet this mandate? Is
there a need for a new Creative Commons provision that specifically deals with
authorizing and enabling the potential to text-mine, re-host or otherwise
compute upon works offered under CC licenses?
Creative Commons is
beginning to examine some of these issues through its Neurocommons initiative
within the Science Commons program.
Preliminary Conclusions
As the scholarly literature
moves to digital form, what is actually needed to move beyond a system that
just replicates all of our assumptions that the this literature is only read,
and read only by human beings, one article at a time? What is needed to permit the creation of digital libraries
hosting these materials that moves beyond the ³incunabular² view of the
literature, to use Greg Craneıs very provocative recent characterization. What is needed to allow the application
of computational technologies to extract new knowledge, correlations and
hypotheses from collections of scholarly literature?
Part of the answer is
legal. Clearly we need freedom to
copy, rehost, repurpose and compute upon the components of this
literature. (Note that while I
have not explicitly discussed large-scale retrospective digitization projects
here, this is equally applicable to these efforts, not just to new
contributions to the scholarly literature.) We need license terms that minimize or render moot the
uncertainties surrounding the creation of derivative works and possibly even
the requirements of attribution for source materials that have contributed to
the production of these derivative works.
The Creative Commons licensing framework offers a particularly urgent
and compelling environment for exploring these requirements.
The other part of the
requirement is technical. We need
to see provisions in hosting systems for large-scale replication as well as
item-by-item downloads of occasional copies of parts of the scholarly literature. While in theory this need might be
mitigated by the availability of interfaces that allow us to export
computations to repositories, I suspect that these will not fully satisfy the
needs for literature analysis and for new content analysis and synthesis
environments that assume the ability to rehost materials.
The opportunities are truly
stunning. They point towards
entirely new ways to think about the scholarly literature (and the underlying
evidence that supports scholarship) as an active, computationally enabled representation
of knowledge that lives, grows and interacts with its contributors rather than
as a passive archive or record.
They suggest ways in which information technology can accelerate the
rate of scientific discovery and the growth of scholarship. It would be a disgrace if we allowed
the inertia of historic scholarly publishing practices and the intellectual
property arrangements that underlie these patterns to foreclose such
opportunities. Open access offers
an important simplification and reduction of the barriers if its development is
shaped in a way that is responsive to these opportunities, although it is
certainly not a panacea in its current form.
What is ultimately at stake
here is a fundamental reconceptualization of the roles and uses of scholarly
literatures and the evidence that supports scholarship. The traditional intellectual property
framework of scholarly publishing is not hospitable to this
reconceptualition. The implications
of resolving this incompatibility will ultimately have far more extensive
ramifications than what we might today characterize as the ³traditional² open
access movement; but they will be crucial to the future of science and
scholarship.