Abe Lederman, Deep Web Technologies
The Web is divided into 3 layers: the Surface Web, the Deep Web and the Dark Web. The Surface Web consists of several billion web sites, different subsets of which are crawled by search engines such as Google, Yahoo and Bing. In the next layer of the Web, the Deep Web consists of millions of databases or information sources – public, subscription or internal to an organization. Deep Web content is usually behind paywalls, often requires a password to access or is dynamically generated when a user enters a query into a search box (e.g. Netflix), and thus is not accessible to the Surface Web search engines. This content of the Deep Web is valuable because, for the most part, it contains higher quality information than the Surface Web. The bottom-most layer, called the Dark Web, gained a lot of notoriety in October 2013 when the FBI shut down the Silk Road website, an eBay-style marketplace for selling illegal drugs, stolen credit cards and other nefarious items. The Dark Web guarantees anonymity and thus is also used to conduct political dissent without fear of repercussion. Accessing the gems that can be found in the Deep Web is the focus of this article.
Michael Bergman, in a seminal white paper published in August 2001 entitled – The Deep Web: Surfacing Hidden Value, coined the term “Deep Web”. The Deep Web is also known as the Hidden Web or the Invisible Web. According to a study conducted in 2000 by Bergman and colleagues, the Deep Web was 400-550 times larger than the Surface Web, consisting of 200,000 websites, 550 billion documents and 7,500 terabytes of information. Every few years while writing an article on the Deep Web, I search for current information on the size of the Deep Web and I’m not able to find anything new and authoritative. Many articles that I come across still, like this article, refer to Bergman’s 2001 white paper.
Many users may not be familiar with the concept of the Deep Web. However If they have searched the U.S. National Library of Medicine PubMed Database, if they have searched subscription databases from EBSCO or Elsevier, if they have gone and searched the website of a newspaper such as the Financial Times or went to purchase a train ticket online, then they have been to the Deep Web.
If you are curious about what’s in the Deep Web and how can to find some good stuff, here are some places you can go to do some deep web diving:
|20 Ways to Search the Invisible Web||Although a bit commercial, this site by About.com has a good resources list, including #6 (Science.gov) developed by Deep Web Technologies. Explore around the site and follow links to some interesting web pages.|
|DMOZ||The largest, most comprehensive human-edited directory of the Web DMOZ, which started out as the Open Directory Project and is now owned by AOL, Inc. (originally known as America Online), is a Wikipedia for web sites. 90,000 volunteer editors have categorized 4,000,000 web sites into 1,000,000 categories.
Fun to explore, although lots of links are to web sites and not deep web resources.
|Library of Congress E-Resources Online Catalog
|1400 resources cataloged in 30+ subject areas, including hundreds of resources that are freely accessible.|
|ResourceShelf||Although this site is no longer active (they stopped posting new reviews of resources in February 2016), they had a great 15 year run. Gary Price and staff published 26,773 items, and is still accessible via their archives or via a
site: resourceshelf.com search.
|Virtual Private Library||Includes thousands of curated resources across 50+ subject areas including Business Intelligence, Genealogy, Healthcare and the World Wide Web itself. The site has been developed by Marcus Zillman and uses Subject Tracer Bots (which is trademarked by VPL) that continuously search and monitor the Web for resources to add to these subject area guides.|
A good source of deep web databases, although many are not publicly available, are the database lists available through the libraries of most academic institutions including these sites: Harvard University, MIT, Oxford University, Princeton University, Stanford University and University of Cambridge
One trick I’d like to share with you for finding interesting resources in the Deep Web is to leverage the 430,000 and growing number of subject guides created by librarians and people who are librarians at heart using the LibGuides service provided by Springshare (a ProQuest company). So continuing on Helen Edwards’s theme in Dogs Revisited: Information for Dog Owners, I went to Google and entered the following search:
Google (the U.S. version) quickly returned 3160 subject guides ranked by popularity (i.e. showing first the subject guides on dogs that others found useful and linked to). Included in the top subject guides are guides on service dogs and working dogs. As you might expect there are also a number of guides on dog laws such as these from Illinois and Louisiana. I also found an interesting and comprehensive guide on Dog Ownership, Nutrition and Care created as a test guide by the folks at Springshare.
Of course, it would be remiss if I didn’t talk about what my company, which, after all, is named Deep Web Technologies, does and how it can help users find gems in the Deep Web.
Deep Web Technologies develops state-of-the art solutions using Explorit Everywhere!TM federated search (see definition below) technology that provides one-stop access to as many as hundreds of deep web sources at once.
|Federated Search is an application or service
that allows users to submit a real-time search
in parallel to multiple, distributed information sources
and retrieve aggregated, ranked and de-duplicated results
Deep Web Technologies works with Academic libraries, Corporate clients, Government agencies and other research intensive organizations to build solutions that provide a single point of access to all the information sources that are important to these organizations and their users. These information sources include public sources, subscription sources, and as well as sources internal to the organization.
We have developed a number of major public portals such as Science.gov, WorldWideScience.org, the U.S. National Library of Energy and Biznar.com that provide good public demonstrations of sites that access the Deep Web.
I would like to conclude this article with a brief discussion of our vision for a future where access to the gems in the Deep Web is as easy as doing a Google search today.
Our grand vision centres on building a comprehensive Catalogue of ALL (used loosely) the quality information sources in the world, independent of language. I envision thousands of contributors, similar to how Wikipedia and DMOZ operate, contributing to the identification, description and rating of sources for the Catalog. We will also need thousands of contributors to create the connectors needed to search the information sources available via the Catalog.
An early version of this vision is laid out in a presentation that I gave in June 2009 at the Special Libraries Association Annual Conference entitled – Science Research: Journey to 10,000 Sources accompanied by an article available here.
Refer 32(2) Summer 2016