Background of the Study
Introduction
Lycos was one of the earliest search engines, first developed in 1994 by Dr. Michael L. Mauldin and a team of researchers at the Carnegie Mellon University Center for Machine Translation. Lycos’ name comes from the Latin name for the wolf spider that is a hunter, actively stalking their prey.
LYCOS is one of the original and most widely known Internet brands in the world, evolving from one of the first search engines on the web, into a comprehensive digital media destination for consumers across the world. LYCOS has been a pioneer in intelligent spidering search technology, combining its proprietary technology with other best-in-class services to provide a simple yet a powerful internet experience to its users and clients.
When Lycos was first launched on July 20, 1994 it had a catalog of 54,000 documents but less than a month later its crawler had recorded more than 390,000 documents in its index. By the start of 1995, Lycos had indexed 1.5 million documents, then growing rapidly to over 60 million documents by the end of 1996, making it the largest search engine at the time.
Carnegie Mellon licensed the Lycos technology to a newly created company founded by Mauldin and jointly backed by Carnegie Mellon and CMGI in June 1995 and the company went public the following year. In 1998 the company acquired Wired Digital, the owner of the HotBot search engine.
Significance of the Study
Search engine indexing is like creating a massive catalog or index of all the content available on the internet.
It involves systematically scanning, analyzing, and organizing web pages, documents, images, videos, and all other types of content so that search engines can quickly retrieve relevant information in response to audience queries.
The process of search engine indexing involves the following stages:
- Crawling: Search engine crawlers, also known as spiders or bots, systematically navigate the web, visiting web pages and following links to discover new content.
- Indexing: This stage involves processing and analyzing textual content, as well as key tags and attributes like <title>, alt attributes for images, videos, etc. The extracted information is then stored in a structured index database, which allows the search engine to quickly retrieve and serve relevant content in response to user queries.
2 Importance of Search Engine Indexing
Indexing is important for search engines because it allows them to process and retrieve information efficiently from the internet.
Without indexing, search engines will struggle to deliver accurate and timely results to the audience.
Indexing enables search engines to quickly locate and retrieve relevant information from billions of web pages, ensuring your audience receives timely and accurate search results.
3.1 Crawling
Crawling, in the context of search engines, refers to the process of systematically browsing the web to discover and retrieve web pages and other online content.
Lycos provides a small query box into which search keywords can be typed. The Search Options menu enables the user to choose and and or Boolean operators (the not operator can be implemented by placing a dash (-) in front of a keyword). Various levels of matching can also be chosen – a loose match on the keyword computer would also retrieve the plural computers. The user can also determine the format of the results they retrieve, so that Standard Results will give a title and brief abstract of a document whereas Detailed Results will give information on indexing keywords and date of publication. Help is provided through a hypertext link to Search Language Help. Once a search has been constructed it can be initiated by clicking on the Search button. Lycos provides a feature known as relevancy and adjacency ranking; and this is implemented when it sorts your search hits. Documents which contain a high incidence of your particular choice of keywords, and a high incidence of your keywords appearing close together within a document will score higher than those which don’t and so appear higher up your list of hits. Alta Vista only performs relevancy ranking on your hits if you specifically ask it to by specifying keywords in the Results Ranking Criteria box
Lycos Retriever1 ;a patent-pending information fusion engine. That is, unlike a search engine, rather than returning ranked documents links in response to a query, Lycos Retriever categorizes and disambiguates topics, collects documents on the Web relevant to the disambiguated sense of that topic, extracts paragraphs and images from these documents and arranges these into a coherent summary report or background briefing on the topic at something like the level of the first draft of a Wikipedia2 article. These topical pages are then arranged into a browsable hierarchy that allows users to find related topics by browsing as well as searching
Document Retrieval After a topic was categorized and disambiguated, the disambiguated topic was used to identify up to 1000 documents from Lycos search provider. For ambiguous topics various terms were added as optional ëboostí terms, while terms from other senses of the ambiguous topic categories were prohibited. Other query optimization techniques were used to get the most focused document set, with non-English and obscene pages filtered out
By the end of January 1996, Lycos has indexed over 95% (ca. 19 million unique URLs including FTP and Gopher) of Web resources, making it the largest Web search engine in its family. Nevertheless, it does not index the full text of a Web page. Rather, it only extracts the title and a portion of a document (e.g., the smaller of the first 20 lines or 20% of the document). This practice has been singled out by Lycos’ competitors as its most salient weakness. Around 50,000 documents are added, deleted, or updated in the Lycos index everyday. Lycos supports Boolean logic, and furthermore, it incorporates that feature in such a way that the users do not have to type the Boolean operators when conducting a search. For example, one only needs to select the search option “Match all terms (AND)” to use the AND operator. Another search feature Lycos provides is to match query terms against Web documents at 5 different levels, namely, Loose match, Fair match, Good match, Close match, Strong match. Nevertheless, no specific explanation is given as to how the different levels of match are determined. Truncation is automatically done in Lycos during a search, which may result in some unwanted search outcome. Phrase search is not supported by Lycos so any queries with phrases cannot be appropriately executed. On the other hand, Lycos implements a wide variety of display options. Users are given the choices of viewing 10, 20, 30, or 40 research results a time. In addition, each search result can be displayed using the summary, standard, or detailed format. The detailed format corresponds with the long abstracts Lycos prepares, which include URL, title, outline, keys, abstract, description, date, and other related information. The summary format contains what Lycos’ short abstracts have: URL and descriptions. In terms of coverage, the standard format lies somewhere between the summary and detailed formats. The online documentation available at Lycos’ Web site describes the composition of each output segment (e.g., outline and keys) in detail.
Lycos went public with a catalog of 54,000 documents. In addition to providing ranked relevance retrieval, Lycos provided prefix matching and word proximity bonuses. But Lycos’ main difference was the sheer size of its catalog: by August 1994, Lycos had identified 394,000 documents; by January 1995, the catalog had reached 1.5 million documents; and by November 1996, Lycos had indexed over 60 million documents–more than any other Web search engine. In October 1994, Lycos ranked first on Netscape’s list of search engines by finding the most hits on the word “surf.”
Foraging
All Web spiders use essentially the same algorithm to locate documents on the Web:
1. Create a queue of pages to be explored, with at least one Web page in the queue.
2. Choose a page from the queue to explore.
3. Fetch the page chosen in Step 2 and extract all the links to other pages. Add any unexplored page links to the exploration queue.
4. Process the page fetched in Step 3 to extract information such as title, headers, key words, or other information. Store this information in a database.
Lycos pioneered the use of automated abstracts. By using standard information-retrieval statistical methods, Lycos identifies the 100 most “weighty” terms (the 100 keywords most related to the document being indexed). Combining these weighty terms with the titles, header text, and an excerpt of the first 20 lines, or 10% of the document, Lycos creates an abstract that is about one-fourth the size of the original document. Lycos can display these abstracts along with the list of links during the retrieval process, allowing users to quickly determine which of the matched documents they wish to examine. Because the Lycos Catalog comprises abstracts and not copies of documents, it can also be licensed or sold to third parties, generating additional funding sources for the service.
In the original Lycos implementation (which was used with only minor revisions from May 1994 until February 1996), the spider was written in Perl and used an associative array to hold the queue. Perl allows such arrays to be kept on disk using the Unix DBM database manager, which affords very efficient access of very large queues. Lycos now uses a proprietary spider program written entirely in C.
Stockpiling
The information found by foraging must be stored on disk in some kind of database. Although the various services (including Lycos) closely guard the details of these databases, they share certain characteristics. They must
- allow efficient insertion of new documents (some are optimized for bulk insertion of many documents at once);
- allow efficient update of documents records, when spiders revisit pages already described in the database (again, this step might be optimized for bulk update);
- allow for random read access to any particular document record (required during the retrieval phase); and
- be efficient in terms of disk space, because they must store tens of millions of documents.
One design decision is whether to store the exploration queue in the same data store as the stockpile of exploration results. The early Lycos system used a DBM database to store the exploration queue and a separate sequential file format to store the Lycos Catalog. This system used a sort/merge technique to update the catalog weekly with new Web page information and to replace updated records during the merge.
Retrieval
Once the spiders have collected their information and it has been safely stored in a database, the final step is to process queries from individual users and to return lists of links to matching documents.