It wouldn't be unusual to assume that whatever the human eye can see on a website, search engines can pick up on. However this isn't the case.
We have been told that Googlebot is able to fill out forms, accept cookies, and crawl all types of links but this takes up apparently endless crawling and indexing resources. Therefore Googlebot will only obey certain commands, ignores forms and cookies, and crawls only the links coded with a proper anchor tag and href.
We thought it would be useful to come up with a list of seven items that block Googlebot and other search engine bots from crawling (and indexing) all of your web pages
1. Location-based Content.
A visitor’s IP address is detected on sites with locale-adaptive pages and content is then displayed based upon that location. However, this is not infallible as a visitor’s IP could be showing as the wrong location resulting in content the user does not want to see. An example of this is GoogleBots default IP is from the San Jose, California area and so Googlebot would only see content related to that region. A way around this is to ensure that even if location-based content appears on the first entry to a site, any subsequent content should be based on links clicked, rather than an IP address. This is one of the hardest organic barriers to weed out.
2. Cookie-based Content
In order to personalise a site visitors experience, cookies are placed on a web browser and preferences such as language are set. For example, if you visit a site and you choose to view the content in Italian, a cookie is set and the rest of the pages are set to the Italian language. The URLs stay the same as when the site was in English, but the content is different. However, content that visitors access solely due to cookies, rather than clicking a link, will not be accessible to search engine bots. When the URL doesn’t change as content changes, search engines are unable to crawl or rank the alternative versions.
Ecommerce sites usually code their links using onclicks (a mouseover dropdown linking to other pages) instead of anchor tags. While that works for humans, Googlebot does not recognise them as crawlable links. For Google, a link is not a link unless it contains both an anchor tag and an href to a specific URL. Anchor text is also desirable as it establishes the relevance of the page being linked to. Pages linked in this manner can have indexation problems.
4. Hashtag URLs
Content can be refreshed without the page reloading if AJAX is used, which is a form of JavaScrip where a hashtag is inserted into a page’s URL. However hashtags do not always replicate the intended content on subsequent visits and so content might not always be what the visitors are looking for. Most SEO consultants are aware that indexation issues are inherent with hashtag URLs, it can often be overlooked that this basic element of their URL structure is causing organic search issues.
DisallowBots will be told via the disallow command which content to crawl, via an archaic text document at the roof of a site called robots.txt file
This disallow command can prevent pages from ranking as bots are unable to determine the pages relevance even though it doesn't prevent indexation.
Search bots can be blocked from crawling an entire site if the disallow command appears robots.txt files accidentally (perhaps when a redesign of a site is pushed live for example) If you notice a very sudden drop in organic search traffic, the existence of a disallow in the robots.txt file is one of the first things to check for.
6. Meta Robots Noindex
Just like disallow commands, noindex attributes can be accidentally pushed live. They’re one of the most difficult blockers to discover. Noindex attributes are more powerful than disallows because they stop indexation. The noindex attribute of a URL’s meta tag instructs search engine bots not to index that page. It’s applied on a page-by-page basis, rather than in a single file that governs the entire site, such as disallow commands.
7. Incorrect Canonical Tags
Canonical tags are tucked away in source code. Errors can be difficult to detect and so if desired pages on your site aren’t indexed, bad canonical tags may be the culprits. These tags identify which page to index out of multiple identical versions and are important weapons to prevent duplicate content. All non-canonical pages attribute their link authority — the value that pages linking to them convey — to the canonical URL. Non-canonical pages are not indexed.