When the World Wide Web surged into existence during the 1990s, we were introduced to the problem of how to actually find something in this ever-ballooning construction zone that easily outpaced even the fastest post-WW2 urban sprawl. Although domain names provided a way to find servers using DNS rather than having to mash in IP addresses, you still somehow had to know the relevant URL.
A range of solutions were thought up over time, ranging from printed Yellow Pages type guides, to online curated lists of resources, as well as things like web rings where one website would link to a relevant similar website. This was the time when word-of-mouth was also very relevant, with people proudly announcing their own website on Geocities or other hosting service.
Search engines already existed long before the WWW became the hot new thing during the 1990s, but it was the WWW that would really push them to their limits. As anyone who used search engines for the WWW can attest, they had many issues. Often you’d end up using multiple search engines to find something, and despite fierce competition between web search engines to become the starting page for their browser, actually finding things on the WWW remained a tough problem.
Since a web search engine ‘just’ has to index the WWW and match a search query against the results, why was this such a hard problem that persisted until Google apparently cracked the code?
Unplanned Sprawl

A nice thing about the WWW is that it was designed to be accessible to all, requiring only an Internet connection and thus opening up the possibility of setting up your own webserver. This unsurprisingly led to a very rapid growth of pages on the WWW, with content appearing, being modified and sometimes vanishing at an ever-increasing pace, making it extremely hard to keep up with.
This is however not how things started when the World Wide Web was created in 1989. Before its opening to the public in 1993 the pace of growth was slow enough that a manually maintained index was maintained. This was kept up until late 1992, with the last version of said index still online on the W3 website.
Over the course of a short few years, the WWW would change the face of the world forever alongside a surge of IBM-compatible PCs, exploding multimedia content, all the dot-com hype and perhaps best of all endless ‘free’ hosting services as long as you didn’t mind an advertising banner plastered above your personal homepage’s content.
Even internet service providers (ISPs) would often offer their own hosting service, along with endless n00b-friendly tools to make something resembling a website for whatever hobby you fancied. In addition to proving that one can absolutely argue about style and the prevalence of colorblindness, this would also serve to balloon the number of websites at an exponential rate.
Whether or not the WWW killing off the Gopher-based internet was a bad thing remains the topic of debate, though it’s beyond question that Gopher integrated search functionality into its protocol, mirroring a file system.
Infinite Library Indexing
Without any provisions in the HTTP protocol of the WWW, the only realistic way for search engines to create an index of the ever-expanding and changing WWW is to perform so-called web crawling. This means going through every known document, following any links found in them, and making sure to revisit any documents in case their contents got changed since the last visit.
The first complication here is that since the search engine’s database is the only real index for the web, initial discovery is purely organic, starting from a certain number of URL seeds in what is called the crawl frontier. This forms an integral part of a web crawler.
Development of the algorithms and architecture behind these crawlers formed a major part of the early WWW, with IBM researchers on the WebFountain project in 2001 estimating a grand total of about 500 million pages, with – as they put it – web crawlers caught between the comfortable cushion of Moore’s Law and the hard place of the web’s exponential growth. Today this number is probably closer to forty billion pages.
Although the Google Search web crawler was already pretty good back in 2001, WebFountain improved on it by using a distributed system, with ‘ants’ working through their own list of URLs to crawl, as described in the development paper by Jenny Edwards et al.
Beyond the basic recursive following of links in a document there are many confounding factors, such as when to recrawl a URL, which very much depends on how often the content on it is expected to be updated. Here one dives into the territory of statistics, as depending on the type of site we can make an educated guess on how often it is expected to be updated. For example, a government’s historical news pages are unlikely to see frequent updates, whereas the front page of a news site can see updates practically every few minutes.
Inverted Indexing
As complex the topic of web crawling is, the fun part begins when you have pruned all duplicate documents and stripped all the irrelevant fluff that’s not text to be indexed. In order to make the resulting search index at all searchable before the heat death of the Universe you cannot simply do a full text search on every single document whenever someone enters a search query.
Instead an index is constructed whereby certain keywords are mapped to documents. This inverted index is generally implemented as a hash table or similar data structure where it provides a quick access into the full text documents, not unlike the keyword index in the back of a book, or the more elaborate concordance of yesteryear. These latter works also provide a keyword index, but add accompanying text to provide immediate context to further save time.
Creating an inverted index is a fairly labor-intensive process, with a new document often used for a forward index that decomposes the text into its keywords prior to updating (or creating) the inverted index. As with all of such text processing related tasks and data structures in general there are many ways to go about it, with some fun curveballs thrown into the mix such as parsing languages that do not separate words with spaces, like Japanese.
All of which is to say that implementing a search engine is easy, but making it performant, accurate and efficient at the same time is a minor nightmare. This is basically why search engines took so long to stop being so terrible, as the engineers behind them were trying to solve many rather complex problems, presumably with the C-suite and investors breathing down their necks during the dot-com days.
Search Battles
Over on the Wikipedia entry for ‘Search engine‘ we find a pretty good timeline of web search engines, along with their current status. Perhaps unsurprisingly none of the 1993-era ones made it, but 1994’s WebCrawler somehow crawled into the modern age, along with Lycos. Much like 1990’s Archie search engine and similar for the Gopher web, many of these early search engines simply couldn’t compete in the rapidly changing years leading up to the new millennium.
This was also the era in which some figured that the WWW simply needed to become more ‘3D’ with virtual environments using VRML, bringing it closer to sci-fi like that portrayed in Snow Crash or Tron. Perhaps unfortunately the WWW remained the domain of mostly text and images, although most recently the flood of JavaScript frameworks appear to want to turn once simple HTML documents into full-blown desktop-like applications, all probably to the delight of web crawler engineers.
Meanwhile some search engines figured that they could lift along on the hard work of others, with so-called meta search engines collating the results from multiple search engines to save people the trouble of querying them individually. Here 1996’s Dogpile is still going strong.
Some search engines are missing from the list, such as Marginalia, which boasts the use of open source software for its indexing and crawling, while focusing on non-commercial content. There is also the ever excellent Frog Find that provides a bridge between modern search engines and systems that really cannot run the latest web browser.
Today’s Survivors
The search engine landscape remains a brutal one today, with us having to recently say farewell to Jeeves, of Ask Jeeves fame, most recently seen carrying the Ask.com name. Personally I didn’t really Ask Jeeves much back in the day, instead mostly using AltaVista (RIP) and probably Lycos and a few others that I do not recall off the top of my head.
Having Google Search burst on the scene by 2000 was definitely quite the event, which was certainly when the web search game improved. Looking back it probably was less that Google Search was simply better, but more that it pushed hard just being a search engine, whereas the others were still very much stuck in that early WWW mindset of being a portal to the web.
To a certain extent this is understandable, as search engines aren’t a charity and running the associated hardware as well as the required bandwidth costs a lot of money. Despite this it would seem that we still have a rather thriving web search engine landscape, even if ChatGPT, Claude and kin are trying to become the very last ‘site’ you will ever need. This even as their little web crawlers are still doing the same crawling as has been done since the birth of the WWW.







You must be logged in to post a comment Login