Using Log File Analysis To Improve Your SEO Performance
By Matthew Edgar · Last Updated: November 11, 2022
To improve your website’s technical SEO performance, you need to understand how Google’s robots are crawling through your website. We need to know if the bots are finding every page on the website and what issues robots might be encountering. One of the best ways to find this information is by reviewing your website’s log files.
The information in the log file can help you answer several key questions to better understand how search engine robots are using and understanding your website. This includes:
- How often do robots crawl the website?
- Which robots crawl the website?
- Are robots finding every file on the website?
- Are robots crawling unnecessary pages (and, if so, how much)?
- How often do robots recrawl old pages?
- What errors do robots encounter?
- Do robots respect instructions on the robots.txt file?
- Do robots use the website’s XML sitemap files?
And more! Once you begin exploring log files, you’ll find plenty more questions to answer.
In this article, we’ll review what log files are and how you can use log files to analyze how Google is crawling your website. The more you dig into your log files, the more opportunities you will find to improve your SEO performance.
- Information Contained in Log Files
- How to Retrieve Log Files
- Log File Analysis Tools
- Parsing & Using Log File Data
Information Contained in Log Files
Access logs contain a record of every file accessed on a website. Within the log file, you can see what files were accessed, when those files were accessed, some information about the user that accessed those files (including the user agent), and where people found the website. Here is an example log file entry for Googlebot.
12.345.67.890 - - [01/Feb/2020:19:57:26 -0700] "GET /some-page HTTP/1.1" 301 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Side note: Log file entries are typically called a “hit”. However, when a bot visits a website, that bot’s hit is usually called a “crawl” instead. Since this article is mainly about using log files to monitor bot activity, I’ll be using the term “crawl”.
Files Accessed
Along with the HTML pages on the website, the log file also records all resource files that were loaded, including JavaScript, CSS, and image files. On average, about 70 resource files are loaded on each page of every website so the resource files will be a substantial portion of any log file. For each file requested, the log files will report the response status code of the file and that information can help surface any errors visitors or robots may be encountering.
In the example log entry above, the URL of the file accessed is “/some-page”. This URL returns a 301 response code, meaning this URL redirects elsewhere.
User Information / User Agent
The information about the user accessing the files is limited. It will not provide the person’s email address or phone number. Most log files will include an IP address, which can be considered personally identifiable information (or PII) and may be subject to privacy laws. If you are unfamiliar with the laws governing how to handle IP addresses for your organization, consult with a lawyer.
Along with an IP address, the log file also can report the visitors’ country and user agents. The country isn’t typically all that helpful (more useful geographic information can be found in analytics tools), but the user agent is beneficial. The user agent says what browser and operating system the visitor or bot was using. Importantly for SEO considerations, the user agent tells you if the visitor was a search engine robot.
The user agent in the example log file entry above is Googlebot’s user agent: “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”.
Referrer Information
The log file entry can also indicate where the visitor located the page being visited via the Referrer information. However, Googlebot blocks this information so log files will not contain any information about where Googlebot found the crawled files. As well, some browsers and plugins block this information. For example, here is a record of a person visiting the “/test-page” on a website who was referred to the website from a Google search result.
12.345.67.890 - - [27/Feb/2020:13:41:58 -0700] "GET /test-page / HTTP/1.1" 200 24594 "https://www.google.com/" "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"
How to Retrieve Log Files
For most hosting providers, you can access log files through the website’s control panel. If you can’t locate your log file at your hosting company, you can also email your hosting support to determine if you are able to access and how.
As an example, in cPanel, you can search for “access” and locate the “Raw Access” in the “Metrics” area. This will download as a GZ file, so you will need a tool like 7-Zip File Manager to extract the files.
cPanel Access Log – Accessing log files from cPanel
Here are ways to access log files from a few of the more popular hosting companies. Note that each company will differ slightly in how much data the access log provides.
Log File Analysis Tools
There are many different tools to help you parse the data contained in your website’s log files. My top three recommendations:
- At the enterprise level, the best option is Botify’s Log Analyzer. This also connects with Botify’s other tools, making for a powerful toolset to identify and fix any issues.
- For websites with significant organic traffic but are not quite at the enterprise level, JetOctopus’s log file analyzer is a great choice. Their log analyzer provides great reporting that can help you identify lots of opportunities to improve how robots crawl the website.
- Finally, http Logs Viewer is a budget-friendly option. This tool is relatively inexpensive (€20 or about $21 US as of this writing) and provides a solid set of reports to help extract information from the log files. It runs on your log computer and can work quite well for websites with less traffic to analyze.
Parsing Log File Data
To see how you can use log files, let’s review three key questions you can answer with a log file.
- How often does Google crawl my website?
- What pages is Googlebot crawling?
- Can bots crawl images, JavaScript, or CSS files?
To keep this article useful for a wider audience, I’ll be using http Logs Viewer for my examples. Even though the examples are provided in http Logs Viewer, the general approach to answering these questions will still be applicable across different tools.
1. How often does Google crawl my website?
How To Check This
Step #1: Filter to Googlebot.
Filtering to Googlebot User Agent
Filter the log file view to only show log entries where Googlebot is the user agent. Type in Googlebot in the “User Agent” field, select “Include” from the dropdown next to this field, and then click “Apply Filter”.
Step #2: Report on Hits Each Day.
Report on Hits Each Day
On the “Reports” menu, select “Hits Each Day”.
Note: in this case, hits are essentially synonymous with crawls or visits.
Step #3: Review activity.
Hits Each Day Graph
This will load a graph showing how many hits Googlebot made to the website each day. Over time, you may start to notice patterns of when Googlebot crawl more or less often. You also want to make sure Googlebot is crawling more often after major site changes.
Why Check This
Search engine robots don’t crawl websites consistently. Some websites naturally get crawled more often than others if Google’s bots begin to understand the website is speaking to topics that are frequently updated. However, even within a website, Google’s bots can understand that certain pages deserve to be crawled more frequently than others based on the nature of the content.
Google’s bots do a good job of understanding which pages to crawl more frequently but the bots can make mistakes. As a result, you need to know how often Google’s bots are crawling each page on your website and make sure it is often enough relative to how often your content is updated. If your website isn’t crawled often enough, you need to make changes to get Google to crawl more often—such as updating your XML sitemap, adding internal links, adding dates to your content to make the update frequency more visible, or updating the text on the page.
2. What pages is Googlebot crawling?
How To Check This
Step #1: Filter to Googlebot.
Filtering to Googlebot User Agent
Filter the log file view to only show log entries where Googlebot is the user agent. Type in Googlebot in the “User Agent” field, select “Include” from the dropdown next to this field, and then click “Apply Filter”.
Step #2: Load page report.
Statistics Menu – Click on “Pages”
Next, go to the “Statistics” menu and select “Pages”. You can then choose a specific date range or to use all data available in the log file.
Step #3: Review request.
Report on What Pages Were Hit (Crawled) by Googlebot
Finally, you can review the number of times each page (each file, really, as this will include JavaScript, CSS, and images too) was crawled by Googlebot. Is there anything getting crawled more than it seems like it should? What is missing from the most crawled pages? To make this easier to review, you can export to CSV and open it in Excel.
Why Check This
After determining the overall crawl volume, you need to know what pages Googlebot is crawling. For starters, you want to know if Google is crawling every page on your website or if they are only crawling a few key pages. Is Googlebot only crawling a few pages and missing out on large sections of your website? Or is Google crawling mostly everything but not crawling a few key pages?
If Google isn’t crawling a page, that page will not rank (or not rank well) in search results. Especially when you are looking at a month’s worth of data in your log file, you should see most of your pages getting crawled at least a few times. If Googlebot isn’t crawling all pages on the website, that could be a sign Googlebot doesn’t think those pages are very important and not worth its time to recrawl. In that case, adding more internal links to the pages that aren’t getting crawled (or external links if those external links are of decent quality) will help make those pages look more important and increase the chances that those pages will be crawled.
Along with asking this question overall, also check crawl volume for any pages that have recently been added to the website or updated on the website. If Google’s bots are not crawling recently updated or added pages, this can explain why those pages are not ranking in search results. Using the “Request Indexing” feature in Google Search Console may help and so could resubmitting the XML sitemap with these updated URLs included. But, requesting indexing and XML sitemap resubmissions are hacks to prompt Google to crawl something. Instead of finding hacks that might get Google to get crawl the pages, you need to figure out why bots are ignoring recent updates. Are the bots ignoring the updates because of low-quality content or errors? Or are bots ignoring the content because of poor site architecture and they simply can’t find links to pages that have been added or updated?
In a similar way, you want to see if Googlebot is crawling older pages you’ve removed from your website. This can help you understand why older pages are still in Google’s index and appearing in search results even if they currently return a 404 not-found error message indicating that page has been removed. Likely, if an old, removed page is still in search results, that is because Google hasn’t crawled that old page and hasn’t seen that the page was removed. One trick that can help here is submitting an XML sitemap full of those removed pages in Google Search Console—this prompts Google to recrawl those old pages and can sometimes speed up the removal of those pages from the index and from search results.
3. Can bots crawl images, JavaScript, or CSS files?
How To Check This
Step #1: Filter to Googlebot.
Filtering to Googlebot User Agent
Filter the log file view to only show log entries where Googlebot is the user agent. Type in Googlebot in the “User Agent” field, select “Include” from the dropdown next to this field, and then click “Apply Filter”.
Step #2: Filter to JavaScript, CSS, and images.
Filtering to JavaScript, CSS, and Image Files
Next, we want to filter to review only JavaScript, CSS, and images. In the dropdown next to the “Request” filter, select “Include (Regex)”. Then, in the filter box add in a regular expression to look for these files, such as: js|css|jpeg|jpg|gif|png. Once this is input, click “Apply Filter”.
Step #3: View status codes.
View Status Codes for JavaScript, CSS, and Image Files
Once filtered, you can scan through all the requests to make sure all of the requested JavaScript, CSS, and image files have a response status 200, indicating they loaded correctly for Googlebot. You can also go to the “Reports” menu, select “Status Codes”, and then select “Status Code (full)” to view a report showing all the status codes encountered on these files types.
Why Check This
Googlebot doesn’t just read the HTML code and the text of your website. Instead, Google’s bots use a headless browser to see your website the same way a human sees the website: with images, with design, and with functionality. That means Googlebot needs to be able to load your website’s JavaScript, CSS, and image files. If you are blocking them from doing so—for example with a disallow on the robots.txt file—then the bots won’t be able to fully understand the page. That can cause bots to ignore certain pages and keep those pages from appearing in search results.
More simply, image search can be a key means of driving traffic to your website, especially if your content lends itself to people searching for visual information (such as informational sites sharing diagrams or product pictures on some ecommerce websites). If you block Googlebot from seeing the images when they crawl your website, you remove your chances of appearing in image searches and that limits one potential source of traffic. Plus, not only can images appear in image-specific searches, but images can be used in featured snippets alongside text—blocking images also removes the possibility of appearing in these features on search results and capturing a would-be visitor’s attention by doing so.
How Often Should I Check My Log Files?
How often you need to review your log files depends on how actively you update your website. If you are adding and removing dozens of pages from your website every day, then reviewing log files weekly will help you make sure Googlebot is seeing these changes. If your website is less active, only making a few changes each week or a few changes each month, then a monthly or quarterly review of the log file will probably be enough. However, if you have recently redesigned or restructured your website, you do want to review log files more frequently (daily or weekly) to make sure the search robots are starting to crawl through the updated website to detect all the changes you’ve made.
Of course, how often you should check is also dependent upon how important organic traffic is to your website and to your business. If most of your traffic or revenue comes from organic search, then monitoring log files more often is critical—after all, any problem with a robot crawling your website could lead to significant negative impacts on your company’s bottom line. On the other hand, if your website’s traffic predominately comes from email, ads, or word of mouth instead of organic traffic, then reviewing the log file to understand how robots are using your website isn’t as important (however, log file analysis could still help you understand your non-bot users).
Need Help?
If you have any questions about log files or need help analyzing log files to improve your SEO performance, please contact me.