Refer Summer 2016


Table of Contents

Proposal for the CILIP Information Services Group (ISG) to take a key role in the Knowledge and Information Management Special Interest Group (KIM SIG)

David Smith, Chair ISG National Committee

Where Did All the Information Go? Well at Least the Important Stuff

Paul Johnson, Staffordshire University

When You Don’t Want the Web to Forget You

Jonathan Eaton, London Business School

Google Just Gets to the Tip of the Iceberg: How to Get to the Gems in the Deep Web

Abe Lederman, Deep Web Technologies

The End of Print Reference Books: True or False?

Peter Chapman, Former Editor Refer

How to Protect Official Information From Being Amended

Donna Ravenhill, Dandy Booksellers Ltd

Online British Official Publications from the University of Southampton

Joy Caisley, Julian Ball, Matthew Phillips. University of Southampton Library

Recent Changes to Parliamentary Publishing in the UK

Steven Hartshorne, Secretary of SCOOP

User Experience in Libraries: Can Ethnography Help?

Helen Edwards, Editor Refer

My Library School Experience – What a Journey: Part 1

Simone Charles

 Refer: the journal of the Information Services Group of the Chartered Institute of Library and Information Professionals (CILIP), is published three times a year and distributed free to members of the Group.

 Editor: Helen Edwards

Editorial team: Lynsey Blandford, Ruth Hayes

Cover Design: Jonas Herriot

Contact: Helen Edwards 07989 565739;

Copyright © The contributors and the ISG 2016

Online edition

The ISG Reference Awards 2016 will take place on 9 November 2016 at CILIP HQ

ISG Regional Groups hold a number of events. Details can be found at:



Proposal for the CILIP Information Services Group (ISG) to take a key role in the Knowledge and Information Management Special Interest Group (KIM SIG)

David Smith, Chair ISG National Committee

The Proposal

It is proposed that the Information Services Group forms a founding and integral part of the proposed Knowledge and Information Management Special Interest Group (KIM SIG) which is to be set up by the end of 2016 as a focal point for the professional IM community within CILIP. This paper sets out the reasons for why this is being proposed and some ideas as to what should carry over into the new SIG. This document should be read in conjunction with another document produced by CILIP “Call for expressions of Interest from CILIP Special Interest Groups in being part of (or otherwise involved in) a Knowledge and Information Management Specialist Interest Group”. A briefing of the CILIP proposal has been written by Karen McFarlane and is featured in CILIP Update, June 2016, p11 (see p11 at

 For the purposes of this consultation, a definition of Knowledge Management is “the process of capturing, distributing, and effectively using knowledge”. (See…/What-is-KM-Knowledge-Management-Explained-82405.aspx).  However the new KIM SIG will ultimately make a decision on this.

  1. The Context

We all know that libraries and information services are changing rapidly across all sectors. Some might say not for the best, but we need to be mindful of today’s culture of hunger for information, and the wealth of information available at the click of a mouse. A traditional public reference library is rapidly becoming a thing of the past, although the skills required to negotiate through the information overload encountered by most people are needed as much now as they have ever been. This is an issue common to all library sectors.

The formation of the ISG within CILIP was to provide a focal point for reference and information work. Whilst the need for training on information services provision still exists, it is no longer seen as the realm of the specialist Reference Librarian but more as something that all library staff should benefit from. Also, more and more dedicated information specialists within CILIP are recognising that a more KIM focussed approach is needed to support them in their current duties. A number of units, especially those in companies, already have a knowledge management role, and some have the title of knowledge centre rather than library or information service to demonstrate this. Therefore a new KIM SIG will be formed regardless of ISG’s support.

ISG has been successful over the years in promoting the field of Reference & Information Services across all library sectors, and it is important that we celebrate its success and current activity. There is a national Committee with committed officers. Activities include:

  • Two active regional sub groups (London & South East and East of England)
  • Refer Journal
  • Reference Awards (annual)
  • SCOOP (Standing Committee for Official Publications)
  • IFLA support for Reference & Information Services

From a financial perspective, our Treasurer has stated that ISG in its current format is being mostly kept afloat by its cash reserves and a regular transfer from reserves to current account to enable it to keep going. Whilst this is sustainable in the short term in the middle to longer term, 2018-20, it is a recipe for failure – unless other sources of income are developed. We now need to find a way of keeping the best bits of ISG going in a new form.

In its Business Plan for 2016, ISG agreed to take part in the KIM discussions and the National Committee agreed to be represented by its Chair. He attended an initial meeting on 26th April 2016 at CILIP HQ. With the background and experience of the ISG membership, he believes that ISG is in a strong position to make a key contribution to supporting and underpinning CILIP’s KIM activity. Also around the table were the UK Electronic Information Group (UKeIG), Government Information Group (GIG), Commercial, Legal and Scientific Information Group (CLSIG) and Health Libraries Group (HLG) who all said that their SIGs would like to be represented on the new KIM Committee.

  1. The Proposal – that ISG is integrated into the proposed new wider KIM Special Interest Group

It is proposed that ISG ceases to exist in its current format.

  • The existing sub groups and successful activities listed above will be integrated into the new wider KIM SIG and report to the new KIM Committee.
  • That all existing ISG members be transferred into membership of the KIM SIG automatically unless they specifically request for this not to happen.
  • That any unallocated ISG funds be transferred to the new KIM SIG at the point when ISG has been integrated.
  • The Chair of ISG will serve on the KIM Committee and is happy to play a key role in its establishment should it be agreed by ISG Members, CILIP staff and KIM SIG sponsors.
  1. The Reasons
  • ISG Members would be able to support the KIM SIG activity with their professional backgrounds and experience
  • Information Services and Resources Management play a key part in KIM.
  • Integration with KIM may well be a necessary funding lifeline. There are definitely areas of synergy and cross over; and the development of a healthy number of supporting members is essential to ensure a longer term future. Integration may also give ISG members the opportunity to develop into new subject areas within the profession.
  • KIM will need a journal or some form of platform to promote KIM activity.  This could easily incorporate our existing publication Refer which is still relevant to the field – further discussion would be required as to how that would be achieved (may need a name change to reflect this, eg “Information & Resources Matters”)
  • The current Reference Awards are already promoting existing printed and electronic sources of information and could also maybe promote a KIM award as part of this
  • The two existing regional groups could support wider KIM activity at a more local level which will also hopefully help in recruiting new members to their Committees and provide a good focus for events
  • ISG is the parent body of SCOOP (The Standing Committee On Official Publications). The work of SCOOP has necessarily been turning increasingly to the digitisation of official information and publications. This in turn is more about managing and making users aware of digital content, which fits with the general KIM agenda.
  • The role and function of reference and information provision in all library sectors is both changing and developing in response to the online environment. Existing reference and information services still need professional support in changing circumstances. We are fortunate that London and South East still has a good number of specialist reference libraries, along with the East of England and Manchester / North West. However, the role of information specialist has now been dispersed across front line service staff which means that they are more in need of support and relevant training than ever. This proposal details a model that could ensure this still happens within a more stable financial setting.
  • ISG representation would also ensure that the public, customer facing roles and issues of those staff who are involved with information provision in its broadest sense are maintained. It would also champion the customer facing aspects that need to be considered by those working in a KIM environment on a corporate basis. The successful visits and events organised by our sub-groups can assist here.
  1. The Timetable

The proposed timetable is as follows:

  • June / July / August 2016: Formal period of consultation with the full ISG membership on the proposals contained within this paper
  • July 2016: Progress report on formation of the KIM SIG taken to the CILIP Board
  • August 2016: Responses to be assessed by the ISG Committee to determine the feelings of the wider ISG membership
  • End August 2016: CILIP staff and KIM SIG sponsors informed of the outcome of the ISG consultation
  • September 2016 CILIP Board to formally approve the introduction of the KIM SIG
  • July – November 2016: Setting up of the new KIM SIG, and transition committee to develop a KIM SIG programme for 2017
  • December 2016: Launch of the KIM SIG with the above elements of ISG integrated within it.

Please note that the KIM SIG sponsors on CILIP staff indicate that this is a draft outline timetable. Later elements of the timetable may need rescheduling in light of the results of the consultation on the Strategic Plan, the schedule of CILIP Board meetings (when decided), and the precise mechanics needed to set up a new KIM SIG.

  1. Next Steps and Queries

This proposal will help contribute to discussion on the setting up of a KIM SIG. This consultation will help refine the proposal and identify the challenges and problems that need to be addressed. Relevant CILIP staff will also provide appropriate assistance.

Any comments relating to this proposal, or suggestions for alternative options should be forwarded by 12/08/2016 to:

David Smith, Chair, ISG National Committee

Refer 32(2) Summer 2016

Where Did All the Information Go? Well at Least the Important Stuff

Paul Johnson, Staffordshire University


There is a lot of current concern about the sheer amount of web pages and digital documents being lost forever. In this definition “lost” implies destroyed. A report in 2011 by the Chesapeake Digital Preservation Group suggested that approximately 30% of a control group of 2,700 online law related materials disappeared in 3 years. Librarians have been highlighting this concern for many years and must take a lot of the credit for the introduction of so many successful web archiving initiatives – including the Non-Print Legal Deposit legislation enacted in the UK in 2013.

However, for this article I want to focus on a different kind of “lost” information, where the definition of lost refers to information that cannot be (easily) found. In December 2015 I gave a presentation at a Koha library systems event, which explored the changing environment of discovery services within the academic market. I decided to focus a small part of the talk on how sensible it was to trust large, mainly corporate, organisations to dominate the retrieval of digital data and present it without prejudice or bias. This required me to delve past my normal scope of what I would like to believe is a fairly strong “google-fu” and look at a snapshot of the huge ethical and technical issue of storing and delivering information in the digital age.

I started with my usual professional search strategy and “googled” it (quick question: are other vacuum cleaner manufacturers allowed to use the verb to ‘hoover’ yet? Hmmm I’d better google it!) and quickly fell upon the blog of one of my favourite digital librarians Aaron Tay – Musings about librarianship. One of Aaron’s posts from 2011, which originally came from Lane Wilkinson’s Sense & Reference blog (2011), was titled 3 things to show at library sessions and made me realise that, apart from some of the public Google spats, I had not really thought about lost information in the sense that the information cannot be found as opposed to having been destroyed.

Lane posted a Google search example, which can be used to show students that Google does not actually bring back all the results it claims to. The gist of the example is that searching for ‘alcoholism’ (substitute any common term) initially informs the user that Google has brought back 33,200,000 results in 0.44 seconds (variable). Then ask the user to jump through the first 23 or so pages before they hit the following message:

In order to show you the most relevant results, we have omitted some entries very similar to the 229 already displayed.
If you like, you can 
repeat the search with the omitted results included.

Clicking on the “repeat the search…” link runs the search again with no apparent limits on the amount of results returned. Except wait, there is a limit! The results suddenly cease at page 75 with absolutely no explanation of why, it just provides the normal message of:

Page 75 of about 33,200,000 results

A quick calculation of 75 pages with 10 results a page suggests approximately 33,199,250 results have not been returned. I was amazed that this had passed me by and while many of you may be aware of this due to your interest in information, I’m guessing the average person is not aware of this restriction.

This revelation led to another quick trawl of the Internet for some general web statistical reports (well at least the ones that Google allowed me to see!). I focussed on a 2013 report by Chitika Insights titled The Value of Google Result Positioning, which provided evidence that over 91% of users did not go past the first page of Google results. So is Google really to blame for not bothering to display results after page 75? The hard reality of commercialism would suggest not.

Ok, I thought, don’t panic – as experienced librarians we can just inform users that there is a multitude of other ways to retrieve hidden information. Perhaps we can convince them to embark on a complex literature review by using Boolean logic and erm? Really? Well maybe if they are on a Postgrad or PhD course or are just really into research, otherwise my experience is that it is pretty much a lost cause and, even worse, we run the risk of boring them into never talking to a librarian again.

Of course there are plenty of ways of continuing to enthuse students such as the old Google search trick of adding domain and file type filters, which usually brings a smile to those still awake in my library sessions e.g. search for: format:pdf search google

But my feeling is that it is more or less inevitable that eventually information retrieval is going to be wholly shaped by the automated ‘relevance trimming’ algorithms created by the technology giants. So the question I believe we need to ask is: where does that leave the automatically ignored information? Is it lost to all but the most tenacious of researchers and then lost to posterity after a short time of not being found?

I think the answer might, unfortunately, be yes and it goes to the heart of the question of how “sensible” it is to trust large corporate giants with the guardianship of digital information retrieval. Recent global disclosures of “hidden” information such as the Snowden files and the Panama Papers have called into question the public’s trust in ‘official’ organisations to keep data safe and not misuse it. At the same time there is an increasing trend to use social media even to communicate and disseminate personal information with little regard to how the data will be re-used. This would suggest that people are either comfortable with these tech giants handling their information or at least resigned to it.

As to the other part of the question: will the results be retrieved without ‘prejudice’ or ‘bias’? I would suggest these two concepts have to be considered separately. Society is beginning to take steps to ensure results are returned without prejudice by starting to police it with a combination of legal action and public outcry, both of which will have far reaching consequences to the future of information archiving.

One of the biggest shockwaves to information retrieval happened in May 2014 when the European Court of Justice ruled that individuals had a ‘right to be forgotten’ on the internet by ordering Google to remove links that are “inadequate, irrelevant or no longer relevant”. This ruling does not force the removal of the actual information, simply the ability to find it via Google. This and a number of global spats may explain why Google now produce a Google Transparency Report listing all the governmental requests it receives to remove content, possibly in an attempt to avoid being accused of censorship.

Bias, however, is a completely different proposition. It simply has to be part of any search engine relevance algorithm, the concern here is for the accountability of those who decide/program how the bias is to be applied to retrieval results. Can we trust multi-billion grossing corporations not to influence the shape of the digital landscape to advantage themselves? Do we actually have any choice? Unfortunately, at a time of decreasing public spending by governments across the world, it is hard to imagine any publicly funded alternative.

The onus, rightly or wrongly, seems to be shifting to the creators/distributors of the information to make sure the information can be found on the popular/relevant web services. If they cannot ensure constant visibility then the information may well become irrelevant and ‘lost’ (cannot be found) followed by ‘lost’ (destroyed) in a very short time compared to print and physical examples.

Luckily there are some solutions for academia, nearly all revolving around the Open Access initiatives. However, even with all the various Open Access solutions, a big part of the success so far is due to Google and Google Scholar indexing the articles and providing reliable access to them, in my experience, better access than the proprietary discovery solutions. As for the wealth of unmoderated social media posts, blogs and websites? Hopefully they can be saved for posterity by the various archiving initiatives mentioned before but for me personally I will be disappointed if in ten years’ time I can still find links to the thousands of stock inspirational quotes that are shared in their millions on social media but cannot find any links to a “Pulitzer-finalist 34 part series of investigative journalism”.


Chesapeake Digital Preservation Group (2011) “Link Rot” and Legal Resources on the Web: A 2011 Analysis.

Chitika (2013) The Value of Google Result Positioning.

Google (2016) Google Transparency Report.

Lafrance, A. (2015) Raiders of the lost web. The Atlantic. 14 October 2015.

McDonald-Gibson, C. (2014) Google must delete ‘irrelevant’ links at the request of ordinary individuals, rules top EU court. The Independent. 13 May 2014.

Tay, A. (2011) 3 things to show at library sessions.

Wikipedia (2016) List of Web archiving initiatives.

Wikipedia (2016) Open Access.

Wilkinson, L. (2011) Google has everything! (but the library has more!).

Refer 32 (2) Summer 2016

When You Don’t Want the Web to Forget You

Jonathan Eaton, London Business School


The recent dramas involving WikiLeaks, the so-called Panama Papers, high-profile hacks of customer data (TalkTalk), piracy of copyrighted materials and continuing well-founded concerns about personal privacy have all kept the debate on global information access squarely focused on unwanted data disclosure.  Whether it’s a national security agency, a wealthy individual wanting to minimise tax, or just someone now wishing their youthful indiscretions hadn’t just been posted by friends for posterity on social media, there’s now a chunk of internet content devoted to agony advice on managing your personal and corporate online reputation [].

But there’s also the opposite concern to consider.  What if you discovered a sudden loss of access to unique online information on which you personally rely – your work, or major achievements, or perhaps a body of other, socially-contributed data? This reared up most recently in the UK after the BBC responded to Government pressures during its Charter review to restructure its operations and cut costs by threatening to purge its website archive of cookery recipes [].  Was this just a trivial scenario and an empty threat? Or actually instead a smart rejoinder by the broadcaster: simultaneously hurling a public relations barb at its political adversaries along with a stark reminder that such freely-accessible data produced under its public service mission are woven into the fabric of everyday lives?

So the issues around control – and persistence – of online data in the public domain can clearly pull us in several intellectually and emotionally different directions at once.  The concerns that accountability and ownership of data may rest with global web organisations that are just as fallible as individuals (not to mention commercial properties that can be traded and their content assets either stripped or junked) have been thoroughly and vividly documented in a recent New Yorker article ‘The Cobweb’ [].

If an online information provider either accidentally or intentionally deletes data important to individuals or a wider community, what practical recourse is there? That’s where counter-initiatives like the Internet Archive [] can help by combining its archive robot (The WayBack Machine) with the globally-dispersed resources of librarians and other enthusiasts who identify and contribute selections to, rather in the way that Victorian era amateur scientists contributed significantly to the emerging contemporary print research literature in documenting their findings from their suburban villas and country parsonages.

A recent workplace example from academia may illustrate this dilemma and the difficult feelings of powerlessness that can result for affected individuals when the research data ecosystem suffers a proportionately tiny yet personally consequential data disruption. A senior professor contacted me for help last year.  One of his early career research papers, published in a highly prestigious academic journal in the mid-1970s, had garnered a respectable body of citations in Google’s Scholar service, building on those found in other, paid-for citation sources and so independently confirming his work’s reputation, importance and value.  Imagine his consternation when one day he dips into Google Scholar to check this up, only to discover that the paper, with all its linked unique machine-curated citations, has suddenly and arbitrarily vanished from the search results.

Anyone who has tried to contact Google as a ‘customer’ of its for-free services might be well be reminded of Franz Kafka’s novel ‘The Castle’, such is the experience of trying to communicate with the Information Titan.  For how readily can one categorically prove the data “was there the last time I checked”? Luckily the professor had kept some screen shots, but his attempts to report the issue to Google met only with stock automated responses.  Some systematic checks revealed that all the articles published in the same issue of the journal either side of his paper, continued to be indexed in Google Scholar, apart from his.  Meantime, his work continued to be omitted, with no guarantee when the publisher content corpus might be re-crawled to correct the error and more importantly, restore the fragile, unique body of linked citations to digital life.

Troubleshooting this problem is made more difficult by the way Google works and by the different parties (or ‘network nodes’ in computer science terms) involved.  Where does Google Scholar get its data?  As we know, that’s always undisclosed (by sharp contrast with a for-profit subscription research service like Web of Science or Scopus or Summon or Primo discovery, to name only two possible alternatives).  Close reading of Google’s published support documents and consultation with Internet content experts hinted at the most likely cause. Google will use the journal publisher’s website as a primary metadata indexing source.  But if during a periodic (re-)crawl, its robots encounter a temporary network hitch that might make an expected content link return a “404” or other not-found type error, then the record will be skipped and so risk omission from subsequently updated indexes and search results.  In this case, a butterfly clearly beat its wings…

As a librarian, to be asked by an eminent academic if you can get Google to re-crawl the web to restore his publications information to its search results seems equally Kafka-esque.  A task that is both professionally highly flattering as a request but at the same time overwhelmingly beyond the limits of one’s personal control.  Now, months later with still no correction in prospect, the only contingency looks to be to add the publication data for the professor’s paper to our institutional repository which is optimised for search engine crawling to the Google metadata format specification.  A strategic professional lesson to draw from this experience is that we should not give up or delegate away our control of publications data at either personal or institutional levels.  Beware the data authority with which you can’t effectively collaborate or directly influence, that can behave arbitrarily, forcing you into a time-consuming and possibly fruitless task of trying to ‘correct the record’.

T.S. Eliot’s famous lines in his 1934 pageant play collected as Choruses from ‘The Rock’ are often interpreted (with hindsight) to presage the modern Information Economy:

Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information? 

To which we might be tempted to add (prosaically): “Where is the control we have lost in web indexing?”

The author writes in a personal capacity.

Refer 32(2) Summer 2016



Google Just Gets to the Tip of the Iceberg: How to Get to the Gems in the Deep Web

Abe Lederman, Deep Web Technologies


The Web is divided into 3 layers: the Surface Web, the Deep Web and the Dark Web. The Surface Web consists of several billion web sites, different subsets of which are crawled by search engines such as Google, Yahoo and Bing. In the next layer of the Web, the Deep Web consists of millions of databases or information sources – public, subscription or internal to an organization. Deep Web content is usually behind paywalls, often requires a password to access or is dynamically generated when a user enters a query into a search box (e.g. Netflix), and thus is not accessible to the Surface Web search engines. This content of the Deep Web is valuable because, for the most part, it contains higher quality information than the Surface Web. The bottom-most layer, called the Dark Web, gained a lot of notoriety in October 2013 when the FBI shut down the Silk Road website, an eBay-style marketplace for selling illegal drugs, stolen credit cards and other nefarious items. The Dark Web guarantees anonymity and thus is also used to conduct political dissent without fear of repercussion. Accessing the gems that can be found in the Deep Web is the focus of this article.

Michael Bergman, in a seminal white paper published in August 2001 entitled – The Deep Web: Surfacing Hidden Value, coined the term “Deep Web”. The Deep Web is also known as the Hidden Web or the Invisible Web. According to a study conducted in 2000 by Bergman and colleagues, the Deep Web was 400-550 times larger than the Surface Web, consisting of 200,000 websites, 550 billion documents and 7,500 terabytes of information. Every few years while writing an article on the Deep Web, I search for current information on the size of the Deep Web and I’m not able to find anything new and authoritative. Many articles that I come across still, like this article, refer to Bergman’s 2001 white paper.

Many users may not be familiar with the concept of the Deep Web. However If they have searched the U.S. National Library of Medicine PubMed Database, if they have searched subscription databases from EBSCO or Elsevier, if they have gone and searched the website of a newspaper such as the Financial Times or went to purchase a train ticket online, then they have been to the Deep Web.

If you are curious about what’s in the Deep Web and how can to find some good stuff, here are some places you can go to do some deep web diving:

20 Ways to Search the Invisible Web Although a bit commercial, this site by has a good resources list, including #6 ( developed by Deep Web Technologies. Explore around the site and follow links to some interesting web pages.
DMOZ The largest, most comprehensive human-edited directory of the Web DMOZ, which started out as the Open Directory Project and is now owned by AOL, Inc. (originally known as America Online), is a Wikipedia for web sites. 90,000 volunteer editors have categorized 4,000,000 web sites into 1,000,000 categories.


Fun to explore, although lots of links are to web sites and not deep web resources.

Library of Congress E-Resources Online Catalog


1400 resources cataloged in 30+ subject areas, including hundreds of resources that are freely accessible.
ResourceShelf Although this site is no longer active (they stopped posting new reviews of resources in February 2016), they had a great 15 year run. Gary Price and staff published 26,773 items, and is still accessible via their archives or via a
site: search.
Virtual Private Library Includes thousands of curated resources across 50+ subject areas including Business Intelligence, Genealogy, Healthcare and the World Wide Web itself. The site has been developed by Marcus Zillman and uses Subject Tracer Bots (which is trademarked by VPL) that continuously search and monitor the Web for resources to add to these subject area guides.

A good source of deep web databases, although many are not publicly available, are the database lists available through the libraries of most academic institutions including these sites: Harvard University, MIT, Oxford University, Princeton University, Stanford University and University of Cambridge

One trick I’d like to share with you for finding interesting resources in the Deep Web is to leverage the 430,000 and growing number of subject guides created by librarians and people who are librarians at heart using the LibGuides service provided by Springshare (a ProQuest company). So continuing on Helen Edwards’s theme in Dogs Revisited: Information for Dog Owners, I went to Google and entered the following search:

site: dogs

Google (the U.S. version) quickly returned 3160 subject guides ranked by popularity (i.e. showing first the subject guides on dogs that others found useful and linked to). Included in the top subject guides are guides on service dogs and working dogs. As you might expect there are also a number of guides on dog laws such as these from Illinois and Louisiana. I also found an interesting and comprehensive guide on Dog Ownership, Nutrition and Care created as a test guide by the folks at Springshare.

Of course, it would be remiss if I didn’t talk about what my company, which, after all, is named Deep Web Technologies, does and how it can help users find gems in the Deep Web.

Deep Web Technologies develops state-of-the art solutions using Explorit Everywhere!TM federated search (see definition below) technology that provides one-stop access to as many as hundreds of deep web sources at once.

Federated Search is an application or service

that allows users to submit a real-time search

in parallel to multiple, distributed information sources

and retrieve aggregated, ranked and de-duplicated results

Deep Web Technologies works with Academic libraries, Corporate clients, Government agencies and other research intensive organizations to build solutions that provide a single point of access to all the information sources that are important to these organizations and their users. These information sources include public sources, subscription sources, and as well as sources internal to the organization.

We have developed a number of major public portals such as,, the U.S. National Library of Energy and that provide good public demonstrations of sites that access the Deep Web.

I would like to conclude this article with a brief discussion of our vision for a future where access to the gems in the Deep Web is as easy as doing a Google search today.

Our grand vision centres on building a comprehensive Catalogue of ALL (used loosely) the quality information sources in the world, independent of language. I envision thousands of contributors, similar to how Wikipedia and DMOZ operate, contributing to the identification, description and rating of sources for the Catalog. We will also need thousands of contributors to create the connectors needed to search the information sources available via the Catalog.

An early version of this vision is laid out in a presentation that I gave in June 2009 at the Special Libraries Association Annual Conference entitled – Science Research: Journey to 10,000 Sources accompanied by an article available here.

Refer 32(2) Summer 2016

The End of Print Reference Books: True or False?

Peter Chapman, Former Editor Refer


 At the end of 2005, ISG stalwarts Diana Dixon and Amanda Duffy (in conjunction with Richard Fuller and Francesca McGrath) published what proved to be the final edition of the popular Group guide Basic Reference Resources for the Public Library (ISBN 0-946347-42-5).

Eleven years on from the mid-2005 cut-off date for updates to the previous 1998 edition, I came across a copy and wondered what a 2016 edition would contain. Naturally, my hypothesis was that few of the resources listed would still be in print, leaving Wikipedia to reign supreme.

To test this hypothesis, I took a 10% sample (randomly generated by of the almost 500 entries in the work and tried to trace a more recent edition. The table below shows the results.

Fate of entry % of 2005 entries
Out of Print – no direct web equivalent 14
Web site Only – print equivalent discontinued 14
Updated or existing edition in print 58
Updated web site address (2005 entry was web site only) 10
Web site died (2005 entry was web site only) 4

I was pleasantly surprised by the remaining print representation (though I’m sure that working reference librarians would not be). Hats off to publishers such as Oxford Reference, Cengage, and Dandy Booksellers who remain true to print whilst developing parallel online equivalents.

Illustrative entries from the sample:

  1. Out of Printno direct web equivalent.

Dictionaries of Abbreviations were a dying print resource even when the 2005 Guide was being prepared. A case of the Web’s currency being more effective BUT will the web ‘remember’ past uses?

The publications of CBD Research were a staple of my working life as a newspaper librarian but the company seems not to have survived the transition to the web…

Parliamentary Monitoring Services (PMS Guide to Interest Groups) has been absorbed by the Dod’s Group but the Guide seems to have disappeared behind the online subscription wall.

  1. Web site Only – print equivalent discontinued

Record Depositories in Great Britain continues to be available through the National Archives web site.

United Kingdom Economic Accounts is typical of the datasets now available freely on the web site of the Office of National Statistics and available to commercial publishers if they want to print them. Palgrave Macmillan used to do this one.

Likewise the information formerly printed in The Big Guide: the official universities and colleges entrance guide is now detailed on the UCAS web site. In this case, commercial publishers have produced similar guides for many years, many of which remain in print (for example HEAP: University Degree Course Offers).

  1. Updated or existing edition in print

To my delight, many favourite annuals continue to be updated in print: Dod’s Parliamentary Companion; Civil Service Yearbook; Charities Digest; The RHS Plant Finder to name but a few.

A surprising number of standard works remain in print, often updated post-2005 but not in the recent past. Examples are: The Libraries Directory:… (50th ed. 2009); Directory of Museums, Galleries… (5th rev ed 2013); The Encyclopedia of Religion (2nd ed 2005); The Cambridge Encyclopedia of Language (3rd ed 2010). Will they be ever update in print again?

  1. Updated web site address (2005 entry was web site only)

Due to changes in name of the sponsoring organisation (eg now points to Education Scotland’s site

  1. Web site died

Biggest casualty is KellySearch – marking the end of the historic series of Directories which recorded the development of commerce up and down the UK

In the introduction to the work, the authors state that it ‘is intended to recommend the minimum adult reference stock requirements for a public library serving a population in excess of 75,000’. Fortunately for my research there is a metropolitan borough library service near to where I live which boasts a separate Reference Library in its Central Library building. Would I find its shelves reflecting the continued availability of print reference books?

I’m pleased to report that it did. The Quick Reference section (pictured) and the accompanying Business & Education information sections displayed up-to-date stalwarts such as:

  • Crockford’s
  • Guinness World Records
  • Willings
  • Municipal Yearbook
  • Who’s Who
  • Magistrates’ Court Guide
  • Directory of Grant Making Trusts
  • Kompass
  • UK Primary Education Guide

and many more

Across from the Quick Reference section were up-to-date BT phone directories and Yellow Pages (pictured), whilst the extensive reference shelves had good collections of standard texts along with dictionaries, car guides, atlases, directories, and of course Wisden and the Sky Sports Football Yearbook.

So my expectation of the death of the printed reference resource has proved unfounded. Perhaps a subject to be revisited in a further five years?


Refer 32(2) Summer 2016


Online British Official Publications from the University of Southampton


Joy Caisley, Julian Ball, Matthew Phillips. University of Southampton Library



The Library at the University of Southampton has a particularly strong collection of printed British Official Publications, known as the Ford Collection.  The collection is named after the late Professor Percy Ford and his wife Dr Grace Ford who brought the collection to the University of Southampton in the 1950s from the Carlton Club and conducted research based on the collection.  Hoping to “increase both the appreciation and the use of…”official publications, Ford (1951, p. vii), the Fords compiled ‘breviates’ or ‘select lists’, in seven volumes covering the years 1833 – 1983.  These were not catalogues of all British Official Publications.  Instead the Fords identified and summarised documents which “have been, or might have been, the subject of legislation or have dealt with ‘public policy’ ”, Ford (1951, p. ix).

The work of the Fords was the impetus behind our later activities when, in 1995, we began using a database rather than a card-file to catalogue and abstract new additions to the Ford Collection.  Initially this was on a stand-alone computer for our own internal use, but later offered as an internet service, BOPCAS (British Official Publications Current Awareness Service), which some of you will remember.  As technology moved on and external funding opportunities arose, we extended our indexing to cover older publications and moved into full-text digitisation, so making many publications freely available to all.  Although funding sources were for specific tasks and periods, the Library continues to work unfunded with these valuable digital collections in 2016 to ensure that they are made fully accessible for readers worldwide. The notes which follow are primarily about ‘what happened next’ with regard to BOPCRIS and EPPI, two of a cluster of official publications digitisation projects with which we were concurrently involved from 2001 to 2007.

Historic Background

BOPCRIS and EPPI were projects funded by NOF and AHRC, which aimed to make British Official Publications more ‘findable’, available online and all free of charge to the user.

BOPCRIS: British Official Publications Collaborative Reader Information Service.  This project aimed to index and abstract a selection of 18th – 20th century official publications, primarily Parliamentary but also with some non-Parliamentary.  The documents chosen for inclusion were mainly those selected by the Fords.   Information about locations of printed copies was to be provided and it was planned to provide digital copies of a selection.  The best description still available online can be seen here, from our then technical partners at Bristol University’s ILRT division:

However, time, money, storage costs and no exit strategy meant that the University of Southampton had to abandon some of our aims (w.e.f. 1st August 2009).  We moved the bibliographic records to our online catalogue, WebCat (, which ensures that the records can be found via COPAC, but the abstracts could not be imported.  The digitised versions of those selected were stored, in the hope that they would in future find a home.  Sadly, not all of those scans survived, but some did.

EPPI: Enhanced Parliamentary Papers on Ireland, 1801 – 1922.  This project began with the explicit mission to not only catalogue, but also to provide digitised versions of about 14,000 British Parliamentary papers on Ireland.  The EPPI website proved of great interest to family historians as a full-text search could be conducted across the whole set.  Its demise led to many cries of anguish from that quarter!

As with BOPCRIS, in August 2009 we moved EPPI’s bibliographic records to our online catalogue, WebCat.  The scans were archived, but we also passed the digital files to ‘DIPPAM – Documenting Ireland: Parliament, People and Migration’ ( whilst retaining our right to further publish them given the right circumstances.

The scanned EPPI documents were later made freely available via links from the WebCat records. The images were stored on our internal servers and storage costs continued to be an issue.  Apart from the storage costs, linking to PDFs via WebCat is rather clumsy as long papers had to be split down into small files to achieve a reasonable download time.

18th Century Parliamentary Publications (1688-1834).  With JISC funding we digitised and presented over one million pages of printed parliamentary texts sourced from the University of Cambridge, British Library and the University of Southampton. Without any post-project funding and a requirement to curb our annual storage fees for 15TB of data we entered into a time-limited exclusive agreement with Chadwyck-Healey/ProQuest and they added the 18th century materials to HCPP (House of Commons Parliamentary Papers Online, now known as U. K. Parliamentary Papers).  This material remains free to the UK Higher Education community via that ProQuest service.


What’s happening now?

Partly through pragmatism (storage and maintenance costs), but also some idealism (the greater good of the wider research community), we have started moving the existing digitised items to the Internet Archive.  Two sub collections have been established under the ‘Library Digitisation Unit, University of Southampton’ in Internet Archive (IA) called ‘British Parliamentary Publications

( and ‘British non-Parliamentary Publications

( These sub collections hold all the previously digitised EPPI papers, BOPCRIS papers and the non-Parliamentary Publications that are being currently digitised by the in-house Hartley Library Digitisation Unit (LDU). These and other resources from the University of Southampton can be found at:

In total there are 14,270 digitised papers in the ‘Parliamentary Publications’ collection on IA. The non-Parliamentary set is still growing, but currently comprises about 1,850 documents selected for BOPCRIS, most dating from the 1950s through to the mid-1980s.  We are particularly proud of the non-HMSO/TSO documents, e.g. , Trespass on residential premises, a 1983 Home Office consultation, 26 pages, typed, published by the Home Office itself.

We are also working on digitising some early 20th century publications relating to transport in order to support a specific current strand of research at the University of Southampton,

e.g. .  We hope to continue steadily digitising from our own collections of non-Parliamentary publications, irrespective of subject matter.

Internet Archive was chosen for the delivery of the materials from the Hartley Library as:

  • It has a large corpus of materials that readers worldwide know about and can easily access
  • Materials added from our collection join similar subject resources made available by other institutions thus building towards an extensive collection
  • It provides a sustainable delivery mechanism for resources
  • Materials are also made available from other portals that download or link to data in Internet Archive
  • IA provides an archive of master imagery that can be retrieved by the depositing institution
  • The printed texts are optically character read (OCR) by IA using the Abbyy software
  • IA provides a free delivery service for institutions wishing to provide free open access to their digital collections however large or small



We provide the resources online under the Open Government Licence and the metadata of each publication carries the text, “Contains public sector information licensed under the Open Government Licence v3.0.” A Creative Commons licence is also associated with the metadata for each publication with Usage Attribution-Noncommercial-No Derivative Works 3.0,

Can you help?

If you see errors (e.g. missing pages or poor scans) please let us know.  If you spot that we have digitised most of a series but are missing bits you have, do get in contact!

We know that our efforts regarding the non-Parliamentary publications cannot be comprehensive.  Our original collection was formed by the Fords, who ‘rescued’ many items as well as collecting in their own areas of interest.  There is no definitive catalogue against which we can check departmental publishing.  We know that we are privileged to have this collection, even with its gaps, so our aim is simply to broaden access to the extent we can.

We will always welcome any approaches regarding collaborative bids to digitise even more content.

References and acknowledgement

The portrait of the Fords was painted by Juliet Pannett, © 1974, and is reproduced by the kind permission of her son, Denis Pannett.

Other images are from: Great Britain. Department of Scientific and Industrial Research (1959) What they read and why: the use of technical literature in the electrical and electronics industries. London: Her Majesty’s Stationery Office. (Problems of Progress in Industry -4) [Online]. Available at

Ford, P. and Ford, G. (1951) A breviate of Parliamentary papers 1917-1939. Oxford: Basil Blackwell.

All web pages accessed and addresses correct, May 16th 2016.

Refer 32(2) Summer 2016