Data Inaccuracies in Indexed Pages Lists
A client recently asked if there was a way to identify which pages in their site were not being indexed. Google, Bing and Yahoo all have very different systems and getting a definitive answer from any of them is next to impossible. Anyone that has done research on indexed pages in the past 10 years know that the three engines don’t agree on how many pages to index, much less which ones. So we decided to run some tests to see how to discern which pages were being indexed by any engine, without hand checking every URL.
The first idea was to use the exportable list of internal links from Transcriberry Tools to identify pages that has been indexed. The logic being that if internal links were identified from the page, then it was most likely indexed. The accuracy of Google Webmaster Tools has been somewhat inconsistent, so we wanted to test the accuracy on some smaller sites first.
Our clients have any where from 100,000 to 10 million pages, making it hard to pull a list of all indexed pages using a “site:” search and that was integral to testing how many pages were “accurately indexed.” This is loosely said because the “site:” search can be flakey based on the datacenter, time of day, etc.
The Tiny Test Group
We used four meeting transcription software sites with very few pages (<200) were used to perform this small test. This is in no way scientific, nor can the data be used for full analysis or correlation. This is rather just a quick look at how the use of the internal links report might help identify indexed pages. The sites used are a financially based consumer site, a personal blog, a health services site, and a medical surgeon’s site.
Please note that once we pulled the internal links information for the original client we came across the major caveat: Only 100,000 pages are shown in the downloaded report. So for sites with more than 100,000 indexed pages, this idea won’t ever work, no matter the accuracy.
The test started by downloading all internal links from Google Webmaster Tools into Excel and de-duplicating. Then we used the low cost educational transcription to pull a “site:” search for each domain, downloading all pages into CSV. The two lists were compared, and the results were dreadful. The only time that the two came close were when the site had less than 20 pages.
From this rest and the fact that we could not use this tactic for the client anyway, the idea of using listed internal links has been thrown out. And it was our best idea for the automation of such a list. But it did bring us to do more research as to the many ways a client or webmaster might look at the number of indexed pages. And with some clear oddities in the data, this test spurred a deeper look into the numbers we were pulling.
Related resources:
Why PRO Translation Services Are Necessary for Global Companies?
Find the Best Transcription Service For Your Insurance Company
Transcription And Its Role For SEO Boosting
Best Ways To Transcribe Low-Quality Audio

