RSNA.org
Katarzyna J. Macura, M.D., Ph.D.
The Russell H. Morgan Department of Radiology and Radiological Science •  Johns Hopkins Medical Institutions

Part 6 — Searching the WWW
by Katarzyna J. Macura, M.D., Ph.D.

Finding Web documents can be easy or seem impossibly difficult. This is due, in part, to the sheer size of the World Wide Web (WWW), currently estimated to contain several billion documents. It is also because the WWW is not indexed in any standard vocabulary, unlike a library's catalogs which use accepted standardized subject descriptors to their documents. To make the WWW searches more efficient, special search tools have been created. When the search tool is used to search the Web, the user does not search the Web directly. It would take a very long time to perform a direct search among over a billion Web pages residing on computers all over the world. When the search tools are used, users get access to the intermediate databases, which were created to organize Web pages registered with the search tool. Those databases are part of the search tools and they provide users with URL links to the actual Web pages.

There are three types of the search tools: search engines, meta search engines, and subject directories/guides.

 

Search Engines

Search engines are programs written to query and retrieve information stored in a database containing selections of Web pages. Databases are compiled by spiders, also called crawlers or robot programs, which automatically gather information from all over the Web. Spiders work around the clock, visit the Web, scan pages on the fly, use hypertext links on each page to discover and read a site's other pages, and download Web documents into the search engine server. This way spiders keep the search engine database up to date. They obtain new pages, update already catalogued pages and delete broken links. There is very minimal human oversight. It can take six months for spiders to cover the Web, resulting in a certain degree of delay. The search engines are like the online "librarians" tracking locations of Web pages. When queried, they retrieve records from their own archives, so pages may no longer exist on the Web or may have moved.

Examples of search engines include Infoseek, AltaVista, Google, Ask Jeeves, Lycos and HotBot.

 


Meta-Search Engines

Meta-search engines are very quick search tools that superficially search several individual search engines. In a meta-search engine, a user submits keywords that are applied simultaneously to several individual search engines and their databases of Web pages. Users get back results from all the search engines queried. Aggregate results are based on the "vote" of the individual sites. Meta-search engines do not own a database of Web pages. They send search terms to the databases maintained for other search engines. Because they catch only about 10 percent of search results in any of the search engines they visit, they are called "quick and dirty." Meta-search engines can be downloaded for free and can be customized to search selected search engines with complex search features.

Examples of meta-search engines include Copernic, Ixquick, ProFusion and Surfwax. A growing number of meta-search engines are becoming portals, such as Excite. Portals are sites that offer searching and links to thematic resources in addition to many other services such as stock quotes, airline tickets, shopping malls, news links, games, chat rooms, free e-mail and much more.

 

Directories/Guides

Web directory (also called catalog or guide) is a Web site and another tool for locating information on the WWW. Subject directories are collections of hand-selected sites organized into hierarchical subject categories and compiled by professional or volunteer editors, subject specialists, agencies, associations or hobbyists. Unlike search engines, Web directories contain organized lists of topics and subtopics leading to the sources of information. Directories provide categorized lists of Web sites with brief descriptions. The user can move from menu to menu, making one selection after another, until he/she gets to the level where sites of interest are listed. The user is looking for a general category of things, and that is how directories are organized.

Examples of Web directories include Yahoo!, Google's Directory and university libraries, which maintain their own subject directory.

The main difference between a search engine and a directory is that the search engine indexes all the information on all the Web pages it finds, whereas a directory categorizes Web sites and contains very little information about them (just the description). The search engine indexes are generated automatically, based on the words and phrases that are found on the Web pages. There is no human judgment filtering the information. The subject directories are created from human input; therefore directories contain far fewer sites. Also, since directories are updated manually, which is very time consuming, some old sites that are no longer valid (dead links) are listed long after their demise. A search engine takes the user to the exact Web page on which the words and phrases the user is looking for appear. A directory takes users to the home page of a Web site, and from there they can browse. The search engine should be used when the aim is to get to a particular piece of information quickly, when the user has limited time. A directory is very helpful to users who have only a vague idea of what they want and would like guidance. A directory functions like "yellow pages," the user knows what he/she is looking for but the exact name is unknown. The search engine becomes very helpful whenever the user knows the exact name.

Many search services use both schemes in a hybrid combination. These services send out a spider to collect Web sites, alongside people cataloging sites submitted by developers. Examples include Infoseek, Excite and Google.

 

Search Engine Strategies

There are two search strategies that can be used to find relevant Web documents. The first is a simple keyword search, where the user enters one or more keywords separated by spaces in the search box. In this type of search, the user accepts the system's defaults and may be overwhelmed by too many off-target results, especially when searching large databases. But when searching small and specialized databases, this is the best strategy. The small size of the databases makes more complex searching unnecessary and may even exclude many relevant documents. A second type of search strategy is the advanced search. The advanced search techniques include phrase searching, truncation, Boolean logic, grouping terms, sub-searching, and field searching.

Phrase searching requires that the terms entered in the search box appear in exactly the same order in the documents retrieved. To perform a phrase search, for example, a proper name, name of organization or movement or a distinct phrase, the phrase has to be enclosed in double quotations " ". Truncation can be used when the user is looking for terms with many possible endings. Truncation permits retrieving all these variations in one search term. Some systems search word endings variants automatically (femini*, for feminism, feministic, feminine, etc.). The Boolean logic can be applied as a way to combine terms using "operators" such as AND, OR, AND NOT and sometimes NEAR (within 10 words). For example, AND forces all the terms to be present in all documents retrieved. OR retrieves records with either term. It will help if there are synonyms or spelling variations (women OR females). AND NOT excludes terms. If the user anticipates a lot of search results with terms he/she does not want, Boolean AND NOT will help. For example, when searching for biomedical engineering AND cancer, such a query will bring a long list of academic programs. If the user just wants to search for their research reports, to exclude the documents containing the words department of or school of, a query should contain "biomedical engineering" AND cancer AND NOT "department of" AND NOT "school of". Grouping terms is possible with the use of parenthesis ( ). As in algebra, what appears inside the parenthesis is processed first during searches. Some search engines permit subsearching, which is searching within the results, allowing for subsequent narrowing of the list of hits. The field searching can be used to search within specific parts of the Web pages, designated as titles, authors, etc. For the advanced searches, the AltaVista search engine works very well. It has great coverage, claims hundreds of million of sites and allows using advanced query features such as Boolean logic, sub-searching and truncation. It also allows limiting the number of documents by date and translates to and from foreign languages.

Advanced Search Techniques
PHRASE
SEARCHING:
Requires terms entered in search box to appear in exactly the same order in documents retrieved. Phrase has to be enclosed in double quotations “ ”.
TRUNCATION:
Used when looking for terms with many possible endings. For example, typing in: femini* returns hits for feminism, feministic, feminine, etc.
BOOLEAN
LOGIC:

A way to combine terms using “operators” such as AND, OR, AND NOT and sometimes NEAR (within 10 words).
GROUPING
TERMS:
Search terms are grouped with the use of parenthesis ( ). What appears inside the parenthesis is processed first during searches.
SUB-SEARCHING:
Searching within the results of a previous search, allowing for subsequent narrowing of the list of hits.
FIELD
SEARCHING:
Used to search within specific parts of Web pages, designated as titles, authors, etc.
 

Image Databases

During the last few years we have witnessed an explosion in the use of image databases to include image archives available over the Internet. As the image databases grow larger, the traditional methods to retrieve images of interest from such large collections are no longer sufficient. Keywords are still the most common technique used to provide information about the content of the image database. However, to describe the images to a satisfying degree of detail, sophisticated keyword systems are needed. One serious drawback in keyword approach is the need of trained personnel not only to attach keywords to each image, which is time consuming, but also to retrieve images by selecting keywords that are the best descriptors for image content, which requires the knowledge of the index used to catalog images.

The newest image retrieval techniques are focusing on automated content-based image retrieval (CBIR). The basic principle underlying all current CBIR techniques is the use of image analysis algorithms to automatically extract a number of image attributes at the time of image archive implementation. These attributes may include numerical values, histograms, color, texture, shape, small index pictures, etc. Due to the complexity of the image information, the user cannot expect exact matches between the query and retrieved images from the database. The similarity between the user's query and images in the archive is assessed by similarity matrices; finding the nearest neighbors of the query among the images in the database with respect to a suitable pre-defined metric. Image queries can be divided into three levels of abstraction: primitive features such as color, texture or shape, logical features such as the identity of objects shown, and abstract attributes such as the significance of the scenes depicted. While current CBIR systems operate effectively only at the lowest, primitive features levels, most users demand higher levels of retrieval. After a decade of intensive research, CBIR technology is now beginning to move out of the laboratory into the marketplace, in the form of commercial products like IBM's QBIC, or Virage's VIR Image Engine.

Demonstration versions of numerous experimental systems can be viewed on the Web, including QBIC (this system allows searches by visual image content such as color percentages, color layout and textures), MIT's Photobook (a fully automatic system for detection, recognition and model-based coding of faces for potential applications such as video telephony and automatic face recognition), Columbia University's WebSEEk (a content-based image and video search and catalog tool for the Web), and Viper (a system for visual information processing for enhanced retrieval) from the University of Geneva. Also, AltaVista, Yahoo! and Google search tools now have image retrieval facilities. The image search engines, such as Google Image and AltaVista Image, allow finding images on the Web based on keywords or phrases.

 


Video images

Video sequences are an increasingly important form of image data for many users and pose their own special challenge to those responsible for their storage and retrieval, both because of their additional complexity and their sheer volume. Video images contain a wider range of primitive data types (the most obvious being motion vectors), occupy far more storage, and can take hours to review, while the comparable process for still images takes seconds at most. All but the shortest videos are made up of a number of distinct scenes, each of which can be further broken down into individual shots depicting a single view, conversation or action. A common way of organizing a video for retrieval is to prepare a storyboard of annotated still images (often known as key frames) representing each scene. Another is to prepare a series of short video clips, each capturing the essential details of a single sequence—a process sometimes described as video skimming. Carnegie-Mellon University's Informedia project has pioneered new approaches for automated video and audio indexing, navigation, search and retrieval. The Informedia approach uses combined speech, language and image understanding technology to automatically transcribe, segment and index the video. The same tools are applied to accomplish intelligent search and selective video retrieval.

No one service catalogs the whole Web. Each service logs parts of it and a certain overlap exists. It is estimated that each of the search engines provides about 40 percent unique content and there is an overlap of about 60 percent among search engines. Therefore, to obtain a broad coverage it is recommended to try more than one search tool. Services also differ in a way they rank hits. For instance, some advertisers pay for their sites to be listed on some services, so their sites get priority listing, being listed in the search even if their site has nothing to do with what the user is seeking. Some search engines, such as Google, use unique ranking algorithms that are based on how many other sites link to a particular site. Google adds weight to frequent citations. The popularity ranking is based on the assumption that other pages would create a link to the "best" pages. This type of ranking usually works very well, returning quality documents. Typical ranking algorithms used by search tools involve the location and frequency of keywords on a Web page, the location/frequency method. Pages with the search terms appearing in the HTML title tag are often assumed to be more relevant than others to the query. Search engines also check if the search keywords appear near the top of a Web page, such as in the headline or in the first few paragraphs of text. They assume that any page relevant to the topic will mention those words right from the beginning. Frequency is the other major factor in how search engines determine relevancy. A search engine will analyze how often keywords appear in relation to other words in a Web page. Those with a higher frequency are often deemed more relevant than other Web pages.


Conclusion

There are hundreds of search tools on the Web. Try using a variety of search services using your favorite key words and you will see how radically different the hits are with each search engine or directory. No one service is perfect, so use as many as you have time for. Using many search engines will help you get a feel for how the different kinds of services work. You will soon find a favorite search tool that will allow you to get to all the information you need quickly and painlessly.


www.aawr.org
Editor's Note: The original Mini-Tutorial on the Internet by Katarzyna J. Macura, M.D., Ph.D., was published in the AAWR Newsletter Focus. Dr. Macura updated her series for RSNA News.

Copyright © 2008 Radiological Society of North America, Inc., 820 Jorie Blvd, Oak Brook, IL 60523-2251
Tel. 1-630-571-2670 || fax 1-630-571-7837 || U.S. and Canada: Main 1-800-381-6660, Membership 1-877-RSNA-MEM (776-2636)