Paul Johnson, Staffordshire University
There is a lot of current concern about the sheer amount of web pages and digital documents being lost forever. In this definition “lost” implies destroyed. A report in 2011 by the Chesapeake Digital Preservation Group suggested that approximately 30% of a control group of 2,700 online law related materials disappeared in 3 years. Librarians have been highlighting this concern for many years and must take a lot of the credit for the introduction of so many successful web archiving initiatives – including the Non-Print Legal Deposit legislation enacted in the UK in 2013.
However, for this article I want to focus on a different kind of “lost” information, where the definition of lost refers to information that cannot be (easily) found. In December 2015 I gave a presentation at a Koha library systems event, which explored the changing environment of discovery services within the academic market. I decided to focus a small part of the talk on how sensible it was to trust large, mainly corporate, organisations to dominate the retrieval of digital data and present it without prejudice or bias. This required me to delve past my normal scope of what I would like to believe is a fairly strong “google-fu” and look at a snapshot of the huge ethical and technical issue of storing and delivering information in the digital age.
I started with my usual professional search strategy and “googled” it (quick question: are other vacuum cleaner manufacturers allowed to use the verb to ‘hoover’ yet? Hmmm I’d better google it!) and quickly fell upon the blog of one of my favourite digital librarians Aaron Tay – Musings about librarianship. One of Aaron’s posts from 2011, which originally came from Lane Wilkinson’s Sense & Reference blog (2011), was titled 3 things to show at library sessions and made me realise that, apart from some of the public Google spats, I had not really thought about lost information in the sense that the information cannot be found as opposed to having been destroyed.
Lane posted a Google search example, which can be used to show students that Google does not actually bring back all the results it claims to. The gist of the example is that searching for ‘alcoholism’ (substitute any common term) initially informs the user that Google has brought back 33,200,000 results in 0.44 seconds (variable). Then ask the user to jump through the first 23 or so pages before they hit the following message:
In order to show you the most relevant results, we have omitted some entries very similar to the 229 already displayed.
If you like, you can repeat the search with the omitted results included.
Clicking on the “repeat the search…” link runs the search again with no apparent limits on the amount of results returned. Except wait, there is a limit! The results suddenly cease at page 75 with absolutely no explanation of why, it just provides the normal message of:
Page 75 of about 33,200,000 results
A quick calculation of 75 pages with 10 results a page suggests approximately 33,199,250 results have not been returned. I was amazed that this had passed me by and while many of you may be aware of this due to your interest in information, I’m guessing the average person is not aware of this restriction.
This revelation led to another quick trawl of the Internet for some general web statistical reports (well at least the ones that Google allowed me to see!). I focussed on a 2013 report by Chitika Insights titled The Value of Google Result Positioning, which provided evidence that over 91% of users did not go past the first page of Google results. So is Google really to blame for not bothering to display results after page 75? The hard reality of commercialism would suggest not.
Ok, I thought, don’t panic – as experienced librarians we can just inform users that there is a multitude of other ways to retrieve hidden information. Perhaps we can convince them to embark on a complex literature review by using Boolean logic and erm? Really? Well maybe if they are on a Postgrad or PhD course or are just really into research, otherwise my experience is that it is pretty much a lost cause and, even worse, we run the risk of boring them into never talking to a librarian again.
Of course there are plenty of ways of continuing to enthuse students such as the old Google search trick of adding domain and file type filters, which usually brings a smile to those still awake in my library sessions e.g. search for:
site:.ac.uk format:pdf search google
But my feeling is that it is more or less inevitable that eventually information retrieval is going to be wholly shaped by the automated ‘relevance trimming’ algorithms created by the technology giants. So the question I believe we need to ask is: where does that leave the automatically ignored information? Is it lost to all but the most tenacious of researchers and then lost to posterity after a short time of not being found?
I think the answer might, unfortunately, be yes and it goes to the heart of the question of how “sensible” it is to trust large corporate giants with the guardianship of digital information retrieval. Recent global disclosures of “hidden” information such as the Snowden files and the Panama Papers have called into question the public’s trust in ‘official’ organisations to keep data safe and not misuse it. At the same time there is an increasing trend to use social media even to communicate and disseminate personal information with little regard to how the data will be re-used. This would suggest that people are either comfortable with these tech giants handling their information or at least resigned to it.
As to the other part of the question: will the results be retrieved without ‘prejudice’ or ‘bias’? I would suggest these two concepts have to be considered separately. Society is beginning to take steps to ensure results are returned without prejudice by starting to police it with a combination of legal action and public outcry, both of which will have far reaching consequences to the future of information archiving.
One of the biggest shockwaves to information retrieval happened in May 2014 when the European Court of Justice ruled that individuals had a ‘right to be forgotten’ on the internet by ordering Google to remove links that are “inadequate, irrelevant or no longer relevant”. This ruling does not force the removal of the actual information, simply the ability to find it via Google. This and a number of global spats may explain why Google now produce a Google Transparency Report listing all the governmental requests it receives to remove content, possibly in an attempt to avoid being accused of censorship.
Bias, however, is a completely different proposition. It simply has to be part of any search engine relevance algorithm, the concern here is for the accountability of those who decide/program how the bias is to be applied to retrieval results. Can we trust multi-billion grossing corporations not to influence the shape of the digital landscape to advantage themselves? Do we actually have any choice? Unfortunately, at a time of decreasing public spending by governments across the world, it is hard to imagine any publicly funded alternative.
The onus, rightly or wrongly, seems to be shifting to the creators/distributors of the information to make sure the information can be found on the popular/relevant web services. If they cannot ensure constant visibility then the information may well become irrelevant and ‘lost’ (cannot be found) followed by ‘lost’ (destroyed) in a very short time compared to print and physical examples.
Luckily there are some solutions for academia, nearly all revolving around the Open Access initiatives. However, even with all the various Open Access solutions, a big part of the success so far is due to Google and Google Scholar indexing the articles and providing reliable access to them, in my experience, better access than the proprietary discovery solutions. As for the wealth of unmoderated social media posts, blogs and websites? Hopefully they can be saved for posterity by the various archiving initiatives mentioned before but for me personally I will be disappointed if in ten years’ time I can still find links to the thousands of stock inspirational quotes that are shared in their millions on social media but cannot find any links to a “Pulitzer-finalist 34 part series of investigative journalism”.
Chesapeake Digital Preservation Group (2011) “Link Rot” and Legal Resources on the Web: A 2011 Analysis. http://worldcat.org/arcviewer/5/LEGAL/2011/06/15/H1308163631444/viewer/file2.php
Chitika (2013) The Value of Google Result Positioning. https://chitika.com/google-positioning-value
Google (2016) Google Transparency Report. https://www.google.com/transparencyreport/removals/government/
Lafrance, A. (2015) Raiders of the lost web. The Atlantic. 14 October 2015. http://www.theatlantic.com/technology/archive/2015/10/raiders-of-the-lost-web/409210/
McDonald-Gibson, C. (2014) Google must delete ‘irrelevant’ links at the request of ordinary individuals, rules top EU court. The Independent. 13 May 2014. http://www.independent.co.uk/news/world/eu-court-says-google-must-delete-irrelevant-data-at-the-request-of-ordinary-individuals-9360707.html
Tay, A. (2011) 3 things to show at library sessions. http://musingsaboutlibrarianship.blogspot.co.uk/2011/11/3-things-to-show-at-library-sessions.html
Wikipedia (2016) List of Web archiving initiatives. https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives
Wikipedia (2016) Open Access. https://en.wikipedia.org/wiki/Open_access
Wilkinson, L. (2011) Google has everything! (but the library has more!). https://senseandreference.wordpress.com/2011/11/10/google-has-everything/
Refer 32 (2) Summer 2016