-->
  • How Googlebot find new pages on a website



    Google, conceptually, uses an HTML DOM parser. What this does is break any web page HTML down into its basic structure and each HTML tag is given an ID. This ID represents the order of the HTML tags from beginning to end, any dependency between HTML elements such as a li tag is dependent upon a ul tag, any parent-child relationship between HTML elements such as nesting litags, any content block relationships between HTML elements such as a p tag following a header tag such as h1. This structure is represented using a language such as XML which is traditional.
    Keep in mind that HTML to XML parsers have existed a very long time.
    Once the elements are broken apart, any a tag can further be broken down into its elements. Any time a page is parsed, the first thing that is done is that all links are stored into the index within a link table. This link table is a relational table that has a relationship with a URL table. The URL table stores the URLs of pages while the link table simply make relations between records in the URL table with the link text. If you are not familiar with relational databases, this may not fully make sense. To that end, each table is like a spread sheet. One sheet has URLs. One sheet has link text and references to records within the URL sheet.

    A link within the index has three basic elements; the source URL (reference), the target URL (reference), and the link text. If a link is stored into an index where only the page it was parsed from (source) has a URL within the index, meaning that the target URL has not been fetched yet, it is a dangling link. The URL the link is pointing to (target) is then placed within the fetch queue to have the page fetched, indexed, etc. If the target page cannot be fetched, it is a broken link and remains within the index as a broken link for reference.
    This is a recursive process, meaning that it begins and ends repeatedly; fetching pages, parsing pages, and indexing pages. For search engines, these processes are broken into individual independent processes. Some search engine processes are queue based, meaning they take a record from a queue (list or databse) and processes it, or trigger based, meaning that a trigger event starts the process, or batch based, meaning that it performs a process against the entire database.
    Pages are fetched from a queue of URLs. Once the page is fetched and stored, a trigger event is set to parse the page. Once the page is parsed, various other processes are triggered including one that processes links. Each trigger based process is considered real-time. Contrast this to the PageRank algorithm which is batched based and runs periodically.
    This process is called crawling. It is like a spider that crawls the web. As each page is fetched, parsed, and link target URLs added to the queue to be fetched, most pages are discovered very easily. For the remaining pages that do not have a link, the sitemap comes into play. While it is not generally necessary for a site to have a sitemap, it can help the search engine know that it is able to fetch all of the site's pages adequately. Sitemaps are primarily used to audit whether a site can properly be crawled. For any page listed within the sitemap that does not have a target link, the URL is submitted, as read from the sitemap, to the fetch queue to ensure that the search engine as as many pages as can be fetched from any site.
    That is it. It is a simple process that has existed for a very long time and works amazingly well.
    Pages are periodically refetched. This is based upon a network concept TTL meaning Time To Live. This is simply a number representing seconds. For example, 5 minutes is 300 seconds and 24 hours is 86400 seconds. While no-one knows what the starting TTL time for a web page is, this TTL is adjusted for each paged from either a longer time period or a shorter time period depending upon whether the pages changes or not. There is a process to determine if either the page content changes or templated content changes with an algorithm to determine what changes are of value or not. This means that links in a sidebar may not make a page's TTL time shorter while a change within the content will.
    This is important to know because this is how a search engine determines, in part, a page's freshness. Of course any new page is also fresh. If a page changes frequently, it is fetched more frequently using the TTL time as a trigger. The shorter the TTL time, the more often the page is refetched, parsed, indexed, etc. Each time a page is refetched, the TTL time is shortened to determine how often a page should be fetched. It is the shortening and lengthening of the TTL that allows the page to be fetched appropriately according to how often it changes. There is a maximum TTL. For example, any page that does not change will be checked using the maximum TTL. This allows a search engine to timely process any page.
     
    The freshness TTL time exists for each page and will effect how links are found on that page. Pages with shorter TTL times will have links found quicker than pages with longer TTL times.
    The reason why this is important to this answer is because of links. More often than not, the pages that are fresh have links to other pages that may also be fresh. Blogs are a prime example of this. Are you getting the picture? These links get submitted to the fetch queue just as before making link discovery that much faster.
  • You might also like

    No comments: