When You Don’t Want the Web to Forget You

Jonathan Eaton, London Business School


The recent dramas involving WikiLeaks, the so-called Panama Papers, high-profile hacks of customer data (TalkTalk), piracy of copyrighted materials and continuing well-founded concerns about personal privacy have all kept the debate on global information access squarely focused on unwanted data disclosure.  Whether it’s a national security agency, a wealthy individual wanting to minimise tax, or just someone now wishing their youthful indiscretions hadn’t just been posted by friends for posterity on social media, there’s now a chunk of internet content devoted to agony advice on managing your personal and corporate online reputation [http://www.wikihow.com/Ungoogle-Yourself].

But there’s also the opposite concern to consider.  What if you discovered a sudden loss of access to unique online information on which you personally rely – your work, or major achievements, or perhaps a body of other, socially-contributed data? This reared up most recently in the UK after the BBC responded to Government pressures during its Charter review to restructure its operations and cut costs by threatening to purge its website archive of cookery recipes [http://www.bbc.co.uk/news/uk-36308976].  Was this just a trivial scenario and an empty threat? Or actually instead a smart rejoinder by the broadcaster: simultaneously hurling a public relations barb at its political adversaries along with a stark reminder that such freely-accessible data produced under its public service mission are woven into the fabric of everyday lives?

So the issues around control – and persistence – of online data in the public domain can clearly pull us in several intellectually and emotionally different directions at once.  The concerns that accountability and ownership of data may rest with global web organisations that are just as fallible as individuals (not to mention commercial properties that can be traded and their content assets either stripped or junked) have been thoroughly and vividly documented in a recent New Yorker article ‘The Cobweb’ [http://www.newyorker.com/magazine/2015/01/26/cobweb].

If an online information provider either accidentally or intentionally deletes data important to individuals or a wider community, what practical recourse is there? That’s where counter-initiatives like the Internet Archive [http://archive.org/web/] can help by combining its archive robot (The WayBack Machine) with the globally-dispersed resources of librarians and other enthusiasts who identify and contribute selections to archive-it.org, rather in the way that Victorian era amateur scientists contributed significantly to the emerging contemporary print research literature in documenting their findings from their suburban villas and country parsonages.

A recent workplace example from academia may illustrate this dilemma and the difficult feelings of powerlessness that can result for affected individuals when the research data ecosystem suffers a proportionately tiny yet personally consequential data disruption. A senior professor contacted me for help last year.  One of his early career research papers, published in a highly prestigious academic journal in the mid-1970s, had garnered a respectable body of citations in Google’s Scholar service, building on those found in other, paid-for citation sources and so independently confirming his work’s reputation, importance and value.  Imagine his consternation when one day he dips into Google Scholar to check this up, only to discover that the paper, with all its linked unique machine-curated citations, has suddenly and arbitrarily vanished from the search results.

Anyone who has tried to contact Google as a ‘customer’ of its for-free services might be well be reminded of Franz Kafka’s novel ‘The Castle’, such is the experience of trying to communicate with the Information Titan.  For how readily can one categorically prove the data “was there the last time I checked”? Luckily the professor had kept some screen shots, but his attempts to report the issue to Google met only with stock automated responses.  Some systematic checks revealed that all the articles published in the same issue of the journal either side of his paper, continued to be indexed in Google Scholar, apart from his.  Meantime, his work continued to be omitted, with no guarantee when the publisher content corpus might be re-crawled to correct the error and more importantly, restore the fragile, unique body of linked citations to digital life.

Troubleshooting this problem is made more difficult by the way Google works and by the different parties (or ‘network nodes’ in computer science terms) involved.  Where does Google Scholar get its data?  As we know, that’s always undisclosed (by sharp contrast with a for-profit subscription research service like Web of Science or Scopus or Summon or Primo discovery, to name only two possible alternatives).  Close reading of Google’s published support documents and consultation with Internet content experts hinted at the most likely cause. Google will use the journal publisher’s website as a primary metadata indexing source.  But if during a periodic (re-)crawl, its robots encounter a temporary network hitch that might make an expected content link return a “404” or other not-found type error, then the record will be skipped and so risk omission from subsequently updated indexes and search results.  In this case, a butterfly clearly beat its wings…

As a librarian, to be asked by an eminent academic if you can get Google to re-crawl the web to restore his publications information to its search results seems equally Kafka-esque.  A task that is both professionally highly flattering as a request but at the same time overwhelmingly beyond the limits of one’s personal control.  Now, months later with still no correction in prospect, the only contingency looks to be to add the publication data for the professor’s paper to our institutional repository which is optimised for search engine crawling to the Google metadata format specification.  A strategic professional lesson to draw from this experience is that we should not give up or delegate away our control of publications data at either personal or institutional levels.  Beware the data authority with which you can’t effectively collaborate or directly influence, that can behave arbitrarily, forcing you into a time-consuming and possibly fruitless task of trying to ‘correct the record’.

T.S. Eliot’s famous lines in his 1934 pageant play collected as Choruses from ‘The Rock’ are often interpreted (with hindsight) to presage the modern Information Economy:

Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information? 

To which we might be tempted to add (prosaically): “Where is the control we have lost in web indexing?”

The author writes in a personal capacity.

Refer 32(2) Summer 2016




Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s