Google Just Gets to the Tip of the Iceberg: How to Get to the Gems in the Deep Web

Abe Lederman, Deep Web Technologies

 Abe

The Web is divided into 3 layers: the Surface Web, the Deep Web and the Dark Web. The Surface Web consists of several billion web sites, different subsets of which are crawled by search engines such as Google, Yahoo and Bing. In the next layer of the Web, the Deep Web consists of millions of databases or information sources – public, subscription or internal to an organization. Deep Web content is usually behind paywalls, often requires a password to access or is dynamically generated when a user enters a query into a search box (e.g. Netflix), and thus is not accessible to the Surface Web search engines. This content of the Deep Web is valuable because, for the most part, it contains higher quality information than the Surface Web. The bottom-most layer, called the Dark Web, gained a lot of notoriety in October 2013 when the FBI shut down the Silk Road website, an eBay-style marketplace for selling illegal drugs, stolen credit cards and other nefarious items. The Dark Web guarantees anonymity and thus is also used to conduct political dissent without fear of repercussion. Accessing the gems that can be found in the Deep Web is the focus of this article.

Michael Bergman, in a seminal white paper published in August 2001 entitled – The Deep Web: Surfacing Hidden Value, coined the term “Deep Web”. The Deep Web is also known as the Hidden Web or the Invisible Web. According to a study conducted in 2000 by Bergman and colleagues, the Deep Web was 400-550 times larger than the Surface Web, consisting of 200,000 websites, 550 billion documents and 7,500 terabytes of information. Every few years while writing an article on the Deep Web, I search for current information on the size of the Deep Web and I’m not able to find anything new and authoritative. Many articles that I come across still, like this article, refer to Bergman’s 2001 white paper.

Many users may not be familiar with the concept of the Deep Web. However If they have searched the U.S. National Library of Medicine PubMed Database, if they have searched subscription databases from EBSCO or Elsevier, if they have gone and searched the website of a newspaper such as the Financial Times or went to purchase a train ticket online, then they have been to the Deep Web.

If you are curious about what’s in the Deep Web and how can to find some good stuff, here are some places you can go to do some deep web diving:

20 Ways to Search the Invisible Web Although a bit commercial, this site by About.com has a good resources list, including #6 (Science.gov) developed by Deep Web Technologies. Explore around the site and follow links to some interesting web pages.
DMOZ The largest, most comprehensive human-edited directory of the Web DMOZ, which started out as the Open Directory Project and is now owned by AOL, Inc. (originally known as America Online), is a Wikipedia for web sites. 90,000 volunteer editors have categorized 4,000,000 web sites into 1,000,000 categories.

 

Fun to explore, although lots of links are to web sites and not deep web resources.

Library of Congress E-Resources Online Catalog

 

1400 resources cataloged in 30+ subject areas, including hundreds of resources that are freely accessible.
ResourceShelf Although this site is no longer active (they stopped posting new reviews of resources in February 2016), they had a great 15 year run. Gary Price and staff published 26,773 items, and is still accessible via their archives or via a
site: resourceshelf.com search.
Virtual Private Library Includes thousands of curated resources across 50+ subject areas including Business Intelligence, Genealogy, Healthcare and the World Wide Web itself. The site has been developed by Marcus Zillman and uses Subject Tracer Bots (which is trademarked by VPL) that continuously search and monitor the Web for resources to add to these subject area guides.

A good source of deep web databases, although many are not publicly available, are the database lists available through the libraries of most academic institutions including these sites: Harvard University, MIT, Oxford University, Princeton University, Stanford University and University of Cambridge

One trick I’d like to share with you for finding interesting resources in the Deep Web is to leverage the 430,000 and growing number of subject guides created by librarians and people who are librarians at heart using the LibGuides service provided by Springshare (a ProQuest company). So continuing on Helen Edwards’s theme in Dogs Revisited: Information for Dog Owners, I went to Google and entered the following search:

site: libguides.com dogs

Google (the U.S. version) quickly returned 3160 subject guides ranked by popularity (i.e. showing first the subject guides on dogs that others found useful and linked to). Included in the top subject guides are guides on service dogs and working dogs. As you might expect there are also a number of guides on dog laws such as these from Illinois and Louisiana. I also found an interesting and comprehensive guide on Dog Ownership, Nutrition and Care created as a test guide by the folks at Springshare.

Of course, it would be remiss if I didn’t talk about what my company, which, after all, is named Deep Web Technologies, does and how it can help users find gems in the Deep Web.

Deep Web Technologies develops state-of-the art solutions using Explorit Everywhere!TM federated search (see definition below) technology that provides one-stop access to as many as hundreds of deep web sources at once.

Federated Search is an application or service

that allows users to submit a real-time search

in parallel to multiple, distributed information sources

and retrieve aggregated, ranked and de-duplicated results

Deep Web Technologies works with Academic libraries, Corporate clients, Government agencies and other research intensive organizations to build solutions that provide a single point of access to all the information sources that are important to these organizations and their users. These information sources include public sources, subscription sources, and as well as sources internal to the organization.

We have developed a number of major public portals such as Science.gov, WorldWideScience.org, the U.S. National Library of Energy and Biznar.com that provide good public demonstrations of sites that access the Deep Web.

I would like to conclude this article with a brief discussion of our vision for a future where access to the gems in the Deep Web is as easy as doing a Google search today.

Our grand vision centres on building a comprehensive Catalogue of ALL (used loosely) the quality information sources in the world, independent of language. I envision thousands of contributors, similar to how Wikipedia and DMOZ operate, contributing to the identification, description and rating of sources for the Catalog. We will also need thousands of contributors to create the connectors needed to search the information sources available via the Catalog.

An early version of this vision is laid out in a presentation that I gave in June 2009 at the Special Libraries Association Annual Conference entitled – Science Research: Journey to 10,000 Sources accompanied by an article available here.

Refer 32(2) Summer 2016

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s