Google, conceptually,
uses an HTML DOM parser. What this does is break any web page HTML down into
its basic structure and each HTML tag is given an ID. This ID represents the
order of the HTML tags from beginning to end, any dependency between HTML
elements such as a
li
tag is dependent upon a ul
tag, any parent-child relationship between
HTML elements such as nesting li
tags, any content block
relationships between HTML elements such as a p
tag following a header tag such as h1
. This
structure is represented using a language such as XML which is traditional.
Keep in mind that HTML to XML parsers have existed a very long
time.
Once
the elements are broken apart, any
a
tag can further be broken down into its
elements. Any time a page is parsed, the first thing that is done is that all
links are stored into the index within a link table. This link table is a
relational table that has a relationship with a URL table. The URL table stores
the URLs of pages while the link table simply make relations between records in
the URL table with the link text. If you are not familiar with relational
databases, this may not fully make sense. To that end, each table is like a
spread sheet. One sheet has URLs. One sheet has link text and references to
records within the URL sheet.
A link within the index has three basic elements; the source URL
(reference), the target URL (reference), and the link text. If a link is stored
into an index where only the page it was parsed from (source) has a URL within
the index, meaning that the target URL has not been fetched yet, it is a
dangling link. The URL the link is pointing to (target) is then placed within
the fetch queue to have the page fetched, indexed, etc. If the target page
cannot be fetched, it is a broken link and remains within the index as a broken
link for reference.
This is a recursive process, meaning that it begins and ends
repeatedly; fetching pages, parsing pages, and indexing pages. For search
engines, these processes are broken into individual independent processes. Some
search engine processes are queue based, meaning they take a record from a
queue (list or databse) and processes it, or trigger based, meaning that a
trigger event starts the process, or batch based, meaning that it performs a
process against the entire database.
Pages are fetched from a queue of URLs. Once the page is fetched
and stored, a trigger event is set to parse the page. Once the page is parsed,
various other processes are triggered including one that processes links. Each
trigger based process is considered real-time. Contrast this to the PageRank
algorithm which is batched based and runs periodically.
This process is called crawling. It is like a spider that crawls
the web. As each page is fetched, parsed, and link target URLs added to the
queue to be fetched, most pages are discovered very easily. For the remaining
pages that do not have a link, the sitemap comes into play. While it is not
generally necessary for a site to have a sitemap, it can help the search engine
know that it is able to fetch all of the site's pages adequately. Sitemaps are
primarily used to audit whether a site can properly be crawled. For any page
listed within the sitemap that does not have a target link, the URL is
submitted, as read from the sitemap, to the fetch queue to ensure that the
search engine as as many pages as can be fetched from any site.
That is it. It is a simple process that has existed for a very
long time and works amazingly well.
Pages are periodically refetched. This is based upon a network
concept TTL meaning Time To Live. This is simply a number representing seconds.
For example, 5 minutes is 300 seconds and 24 hours is 86400 seconds. While
no-one knows what the starting TTL time for a web page is, this TTL is adjusted
for each paged from either a longer time period or a shorter time period
depending upon whether the pages changes or not. There is a process to
determine if either the page content changes or templated content changes with
an algorithm to determine what changes are of value or not. This means that
links in a sidebar may not make a page's TTL time shorter while a change within
the content will.
This is important to know because this is how a search engine
determines, in part, a page's freshness. Of course any new page is also fresh.
If a page changes frequently, it is fetched more frequently using the TTL time
as a trigger. The shorter the TTL time, the more often the page is refetched,
parsed, indexed, etc. Each time a page is refetched, the TTL time is shortened
to determine how often a page should be fetched. It is the shortening and
lengthening of the TTL that allows the page to be fetched appropriately
according to how often it changes. There is a maximum TTL. For example, any
page that does not change will be checked using the maximum TTL. This allows a
search engine to timely process any page.
The freshness TTL time exists for each page and will effect how
links are found on that page. Pages with shorter TTL times will have links
found quicker than pages with longer TTL times.
The reason why this is important to this answer is because of
links. More often than not, the pages that are fresh have links to other pages
that may also be fresh. Blogs are a prime example of this. Are you getting the
picture? These links get submitted to the fetch queue just as before making
link discovery that much faster.
No comments:
Post a Comment