Skip to main content

How Does a Search Engine Work, Roughly?

Nostalgia

Back in the early 2000s, when I first got my hands on a proper internet connection at university — studying Electrical Engineering, no less, with a focus on Control Systems — I genuinely believed that typing something into a search engine meant it was out there, live, rummaging through the actual web in real time.

Embarrassing, really.

Like sending a little scout off into the wilderness and waiting for him to come back with news. — Quite, honourable user of search engine. Please give me a moment to fetch the exhilarating answer of your... sort of... question! — Skittering about with his tiny legs and coming back with information.

Indeed, I thought it was like that!

Well, I didn't discuss it with anyone back then. Just searched for materials from the internet, went back glaring at the books and screen, no —

Prithee, good fellows, when I didst query Yahoo!, was it truly scurrying about the great web, fetching results in earnest?

⬆️ No. Absolutely not.

We were just doing what the curriculum told us to do, no further questions. And in my department, no proper "Search Engine" topic. But the logic underneath — the gates and whatnots — and the bloody semiconductor magic? Oh, we were absolutely drenched in those. We were the ones studying the asphalt and the Informatics Engineering lads were the ones studying the roads. The Mechanical Engineering lads were, well, bit intense. I couldn't tell if they were fascinated by robotics or just very keen on MMA. Well, no "MMA" that time, it was simply — I smack. Sorted.

Here's the MOSFET threshold voltage for our amusement:

${𝑉_{TH} = 𝑉_{𝐹𝐵} + 2𝜙_𝐹 + √{2𝜖_𝑠𝑞𝑁_𝐴(2𝜙_𝐹)}/𝐶_{𝑜𝑥}$

Steve from Microelectronics:

I beg your pardon, that formula does not belong to your department!

Well. Fine. Here's the closed-loop transfer function:

${𝑇(𝑠) = {𝐺(𝑠)} / {1 ± 𝐺(𝑠)𝐻(s)}$

End of amusement

Anyway, turns out — not quite, mate. I mean, the search engine bit, not Steve.


How It Actually Works

The main flow goes as such:

  1. Crawling

    Bots continuously roam the web, hopping from link to link, page to page, quietly gathering content.

    ⬆️ Hopping. ➡️ Hop + ing = hopping, English.

    Indeed, English, as if this were all in Portuguese.

    Not "hoping" (hope + ing). They didn't hope for the best.

    "Hoping from link to link." ⬅️ Very... spiritual.

    They were commanded to do the tasks. No hoping was included in that. —

    I hope that link is friendly! Last link, it took my trousers. I cannot believe that.

    — One sentient bot uttered.

  2. Indexing

    Everything gathered gets analysed, categorised, and filed into a massive internal database.

    This database is what a search engine actually is, underneath all the polish.

  3. Updating

    The bots keep revisiting.

    Pages change, new ones appear, old ones die.

    Some pages get deindexed — quietly removed from results due to legal requests, poor quality, or the site simply going dark.

  4. User searches

    When we type into that search box, we're querying that internal database

    not the live web.

    ⬆️ Similar to when we asked an AI to look up something from the web. It didn't go to the live web. But of course, compare "searching the database" and "searching the web" ⬅️ which one is more... trustworthy? Precisely. Like asking a bloke to get a book from the KITCHEN, but then he says — Aye aye. Fetching from LIBRARY. — No. Still the kitchen. Simply repeat the order. The person giving the order does not need to know the details. Book is fetched. That's the goal.

    Moving on.

    Then the results get filtered and ranked through

    algorithms.

    The algorithms:

    • relevance scoring,
    • spam detection,
    • legal compliance,
    • and a fair bit of commercial logic too.

    And it doesn't stop there — every click we make on a result, and how long we actually stay on that page, feeds right back into the ranking algorithm. The users themselves are unknowingly voting on the quality of results with every single search.

    The Retrieval Pipelines

    That little skittering scout above, he now is updated to have "AI Mode" by default, instead of just — Here you go, honourable user of search engine. (Spreads scrolls on the ground.) Ah. Plenty of scrolls.

    Essentially, they are still rummaging through their own database — because... they cannot just barge into Microsoft's, all of them search engines, not merely Google, like some sort of lost pirates — but with a sophisticated bouncer in front of the mechanism, employing the AI. The AI bouncer acts as an intermediary between the user's query and the ranking/retrieval system.

    This is the classic ranking pipeline.

    Raw query.
    ⬇️
    Keyword matching against index.
    ⬇️
    Ranking algorithm sorts results by relevance, authority, etc.
    ⬇️
    User gets a list of search result links.
    ⬇️
    User clicks (or doesn't).
    ⬇️
    That behaviour feeds back into step 3.

    And this is the latest trend we've seen on Google Search. The AI Mode (LLM-synthesised) retrieval pipeline.

    Raw query.
    ⬇️
    Gemini intercepts.
    ⬇️
    Breaks down the query into multiple sub-queries.
    ⬇️
    Fires them all at the same index simultaneously.
    ⬇️
    Retrieves results.
    ⬇️
    Gemini synthesises and summarises into a conversational answer.
    ⬇️
    User gets paragraphs — with some links in the paragraphs — and the "regular" search results below that. "Video" results usually at the top, followed by website link results.
    ⬇️
    Feeds back into step 3. Either there's a click or no click.

⬆️ Roughly, it is indeed as follows.


The Neighbourhood Analogy

The Neighbourhood Analogy

Perhaps the simplest way to picture it:

Imagine we need to find an address in an unfamiliar neighbourhood.

We don't just wander about the streets hoping to stumble across it. We consult a map, or ask the local bloke at the corner shop who's walked every street and knows the area like the back of his hand.

The search engine is that bloke. He's already done all the wandering — so we don't have to.

And that, really, is all there is to it. No little scout dashing about the live web on our behalf. Just a very large, very well-organised filing cabinet — with a bloke at the front desk who's already done all the legwork for us.

Consult a map. ⬇️

Hi, map.

That will be ten quid for consultation.

I haven't asked anything, map.

Twenty-five for pleasantries.

Thirty-five? Just for an address?

Fifty for list of reasons.

A very distinguished map, that.


The Querying Nostalgia

Remember back then, we queried as such — well I suppose, until now:

orange and apple

Perhaps even —

orang eaple; (Hit enter. Fixed the query.) orange apple (Hit enter.)

Or copied from somewhere, pasted that into the search box —

Omni-Directional, Through Hole 6 mm

No context, no proper sentence such as:

what is the benefit of orange and apple?

Or —

where can I get an omni-directional condenser microphone component?

Because back then, if we put a proper question or sentence, the results would be bloody awkward! Yes? Remember?

Database query sermon lad:

Well, OF COURSE! ALL HAIL THE INDEX OF THE SQUEL! For its word is law, its rows are gospel, and thou shalt not query without it! Lest ye receive a full table scan and perish! HA! Hmppfft! (Adjusting robes.)

And then we NEEDED to learn the database "language"! For searching things! — Car. Cheap car. Cheap car with bikini. — And the "search operators" — Cheap car with bikini -thong site:somewebsite.gov intext:cheap — And then trial and error — SELECT * FROM internet WHERE humans_wrote_it = TRUE AND query = 'cheap car with bikini'; ⬅️ Returned 47,000,000 results. About SQL.

Imagine if that old-school querying style above is implemented to a local bloke — "implemented", well, yes:

13 maple street

What?

where 13 maple street

I don't know.

where is 13 maple street?

I STILL don't know.

where is 13 maple street? ANSWER ME

I DID!

🥊💥

😵💫 (Horizontal.)

A very odd and aggressive querist, that. Very implemented.


Spam

The "spam" bit — that was from Monty Python. ⬅️ You may need to... cover your ears a bit, or lower your speaker volume. They are about to shriek and screech.

Spam — not the dish, per se. Well yes, that spam, the pre-cooked canned meat product. But it was more of the repetitive malarkey in their sketch — it could be a sandal or anything else, really. But in it, they used spam.

And then people on internet back then started to use "spam" to refer to... that repetitive nonsense.

Junk email = "spam". ⬅️ Back then. People who consumed spam were bloody confused by that term. Back then. Not now. Now, everyone is possibly well informed or simply — Oh yes, spam. The internet spam. Not the food spam. Back to swiping my arse.

Imagine if they used sandal instead of spam. —

Oi Geoff, my mailbox now is full of sandals! Such a rogue sandaler.

As you've noticed above, I use "back then" as a form of spam. And also, the "spam" itself.


No Small Investment

It is worth pausing for a moment to appreciate what is actually running underneath all this.

The crawling and indexing I described above runs on

vast warehouses packed floor to ceiling with servers, consuming electricity on a scale that would make your energy bill weep.

These data centres sprawl across multiple countries, engineered to stay online every hour of every day without flinching.

Then there are the engineering teams — armies of them — dedicated solely to keeping the crawlers crawling, the indexes sharp, and the whole enormously complex machine ticking along properly.

And yet, we type into that search box entirely for

free.

Not a penny changes hands.

Which naturally begs the question — How on earth do they pay for all this?

The answer, of course, is us — or rather, our attention.

Advertising is the engine behind the engine. Every search we make, every result we click, every pattern in our behaviour feeds into a finely tuned commercial machine that sells targeted advertising to businesses worldwide.

The search engine is free because we are the product being sold to advertisers.

Mm.

So yes — not exactly a charity operation. Just a very well-oiled business hiding behind a very clean, simple search box.

Mwahahaha!

Though in fairness, it is a reasonable arrangement.

Businesses need customers, and the internet is precisely where those customers roam. Advertising on a search engine is simply that age-old transaction — connecting a seller to a buyer — at an almost incomprehensible scale.

Just commerce doing what commerce does.

Though not every advertiser showing up at that front door is entirely reputable, mind you.

Google Ads, for all its sophistication, has historically attracted the odd scoundrel peddling phishing schemes, dubious software installers, and the classic —

🎉 Congratulations! You have won! 🥳

— variety of nonsense.

No need for a dodgy bloke on a street corner anymore — just buy an ad, sit back, and let the internet do the legwork.

Google does monitor their advertising platform, of course, and publishers can flag rogue ads when they surface. Whether that net catches everything is a matter of ongoing debate.


The SEO Realm — Now

SEO = Search Engine OptimiZation. I need to honour the Zed in it. All goes to Noah Webster.

For decades, the SEO realm was a proper wilderness — gurus, acronyms, backlink farms, algorithm whisperers, and an entire cottage industry built on the sacred art of pleasing a database. Entire agencies rose and fell on a Google Tuesday. Careers were forged and destroyed by Panda, Penguin, and whatever other woodland creature Google fancied unleashing that quarter. And through it all, the SEO lads stood firm, invoicing confidently, armed with spreadsheets and an unshakeable belief that they understood the algorithm. Mm.

Then the LLM waltzed in, read the whole internet for breakfast, and started answering questions directly. All those years of meticulously crafted title tags, H1 headings, canonical URLs, and high-quality, evergreen content — synthesised, digested, and served up without so much as a footnote crediting the poor sod who wrote it.

The SEO industry, once a bustling marketplace of noise and nonsense, now stands rather quiet, like a fairground after closing time.

If Matt Cutts reads this — once the SEO world's high priest, once a deity — he'll either laugh heartily or stare solemnly into the middle distance.

The Ad Publishers — Now

The LLM waltzed in.

LLM = Large Language Model. AI (Artificial Intelligence) specifically for understanding, processing, and generating the human language.

Well, if the SEO lads lost their fairground, the ad publishers lost the entire bloody town.


The AI Diagram

Humans write content.
⬇️
Published on web.
⬇️
Crawled and indexed.
⬇️
AI trains on it.
⬇️
AI serves answers directly.
⬇️
No traffic to websites.
⬇️
No ad revenue.
⬇️
No incentive to write.
⬇️
Less fresh content.
⬇️
AI grows stale.
⬇️

⬆️ So now the very engine that devoured the web is slowly starving because the web has stopped being fed. In the long run, AI will grow stale. No fresh information to digest. AI will certainly hallucinate in this phase. Akin to when they were first released to public.

Remember when Gemini recommended glue as a solution to cheese not sticking to pizza? Or suggesting someone eat a small rock per day for minerals — CONFIDENTLY? Indeed, that first release. Well, all things considered, all AIs had that moment, not just Gemini. I'm talking to you, MISTRAL and GPT and CLAUDE and such. Or, about you. Hm. 🤔

So here's the thing — Google's absolutely boxed in, aren't they?

Push the AI forward:

they're slowly sawing off the very branch they've been sitting on for twenty-odd years.

Pull back?

Headlines write themselves, stock drops, every tech journalist sharpens their pencil for the "Google blinked" piece worldwide!

They built the thing, announced the thing, deployed the thing — and now they simply have to keep smiling through it all like everything is absolutely fine.

Which it isn't.

It's all fine, folks. (Smiling. Whilst sawing the branch. Shhkk shhkk.) Trust us. Fine. Acceptable. In good health. Satisfactory. (Shhkk shhkk.) ⬅️ That sounds rather odd.

Somewhere at Google in ~2023, from "Bard" chatbot, their oracles simply forgot, or perhaps skipped the advertising flow, because of the excitement over their panic upon OpenAI-Microsoft's thing back then. — YES! That is brilliant. Users will love that! Very human! We have no further questions! Car. Cheap car. Cheap car with bikini! All were answered. Conversationally! — I believe there was no Merlin in that lot.

Very ouroboros. Mm.


The Cookie Banner — Obviously, Now

Back to the search-engine-advertising trope — the legacy of all that enthusiastic data harvesting, of course, is that we now cannot visit a single website without being ambushed by

a cookie consent banner

the size of a motorway billboard.

You are quite welcome, everyone.


See you next time! 👋

Comments

Monkey Raptor uses cookies for analytics, advertisements, and functionality. More info on Privacy Policy