LibGuides: Search Technique Guide: How a Search Engine Works

The Basics

The basic functions of a search engine can be described as crawling, data mining, indexing, and query processing. Crawling is the act of sending small programed bots out to collect information. Data mining is storing the information collected by the bots. Indexing is ordering the information systematically. And query processing is the mathematical process in which a person's query is compared to the index and the results are presented to that person.

Crawling and Datamining

Crawling the Web by bots, also called spiders, is the process in which small programmed entities go out from the central computer to collect data. They are pre-programmed to start at one website and collect all its information and links. Those links are then recorded. That list of links then becomes the order in which the bot will continue its path of data collection. So, a spider might start at smwc.edu, but the links to our sports conference and to Higher Learning Commission become the next places that the spider will go after processing everything under the smwc.edu domain. After the spider is full, or a set time, the bot returns and uploads the content of the webpages and all the links back to the central computer.

Data mining is the collection of all the data that the bot returned. Entire webpages, preserved in HTML, are stored on the servers of the search engine. The stored version is not the live version of the webpage, what you see when you enter the URL in your browser, but an historical version called the cached version.

Bots can be told to return to webpages often, if the content changes often. So, a website like BBC News would request that the bots return often because of the frequency that their content changes.

Bots will not find everything on the web. If there are no links to a page then it is basically invisible to search engines. If it is a web page that requires a password, or is generated as a results to a query, it will never be stored in a search engine. Those webpages that will never be searched are called deep web or the invisible web.

Indexing

Indexing is the process of recording EVERY word and character in a webpage and its location. The same concept is found in the back of a book, where major words are listed and what pages they occur on. The search engine version of indexing is where the word occurs within any page and its EVERY occurence in EVERY website that has been crawled. Google's index, the largest known internet index, called the Big Table, is so large it has to have indices to the indices; there is huge amounts of data present.

The indexing process, not only cites locations, but converts everything in numbers. Computers function on 1's and 0's, not on the English alphabet, or any other for that matter. The process of converting the words to numbers is important, because the process of searching is not based on words and letters, but on math.

Query Processing

The query, what you enter in the search box, has to be converted to numbers, so that the engine can process your request. Before it converts to numbers though, the search engine will get rid of several terms. Most search engines have a list of stop words, words that will not be searched. Most search engines will not search for the, and, it, be, will, etc. Those short words are just filler to the computer. If you absolutely need those words in the search then you must include them in quotation marks, or in Google add the plus sign before the term. Once the terms are converted to numbers, the engine then calculates what indexed terms are closest mathematically with what you asked for. The algorithm is complex, but it returns items based on how close it is mathematically to your query. Those closer are listed higher on the results list. Some engines will even show a percent of relevance.

Higher scores for relevance are shaped by: if the words are in the title as opposed to just being in the text, if the word occurs in bold or italics on the page, how many times the word occurs on a page, number and quality of links to that page, and if the words occur in the header (invisible cloud of tags created by the web programmer).

Something to keep in mind, you are not searching the entire internet when you search. You are only searching an index of the internet. Google has the largest index and will return billions of hits, Yahoo is smaller and will return fewer hits. The difference is not just how many hits, but also that they are different hits. Each search engine sent bots in different directions, so they have indexed different parts of the web. Not only that, but the results list will be different because they work of different algorithms (many exist and some are guarded secrets).

So, What Difference Does This Make?

Now that you know that you are searching an index and that the index is not the words, but mathematical representations, then constructing a search query should make more sense.

Keyword searching then, is just a matter of matching numbers in the index. Not a problem.

Phrase searching is looking for exact matches of number strings. Not a problem for the search engine.

Wildcards and truncation work because the token (the number representing a term) can be searched and wildcards can be put into it. My example in the other page was savior vs saviour. In the index they might be represented by something like (this is entirely made up for the example) 813612 vs 8136132. The wildcard would then tell the search engine to look at the index and look for any number in the extra space; and the wildcard would just cut both tokens down to a root of 8136.

Boolean operators force the search engine to use multiple entrees in the index. OR basically asks for 2 searches and combines the results. AND searches for both terms, but only returns those results that are in common; it has to compare results. NOT, is just the removal of common results, with the common page being left out of the results list.

Knowing how the search engine works may help you think more about how you formulate your query; it should.