|
|
Web Crawlers
getCITED is being "crawled" on a regular basis by the "robots" of some of the
Internet's best known search engines, including Google and
AltaVista. Thus, entering publication details into
getCITED is an excellent way to disseminate the results of our research. However,
simply entering in the citation details of a publication is insufficient for it to appear
in searches at the major search engines. To understand why, consider that the "robots" that
gather information for search engines do not actually enter queries when they arrive at a site
like getCITED. Instead, they simply follow whatever links they can find which, at
getCITED, start with the Reports page. Thus, additional actions are required.
Given the need for a trail of links, the best strategy for getting the content in getCITED
into the search engines is to link it in as many ways as possible. For example, by linking an
identity record to a department and institution with which the robots are familiar, the identity
is made visible to the 'bots. And by linking the identity record to publications to which the
corresponding individual contributed, the publication record is rendered visible. That said, the
best way to make publication records visible is to link them to the publications in their
bibliographies (i.e., create References). All it takes is one link to a publication record already known to the 'bots in
order to make a publication visible. The implication? Take the time to link the publications you add to at
least a few of their references. In addition to making publications visible, you will also be
contributing to the web of links that makes getCITED so powerful.
Why publications do NOT appear
There are a number of reasons why publications and personal information entered into getCITED
do not always appear in searches carried out at the major search engines. Thus, please familiarise
yourself with these prior to contacting us. The number one reason, as is noted above, is a lack of
links. If a record has no links, it is invisible and will never be found by a web crawler. That said,
if a record does have links, it might still not show up for one or more of the following reasons:
-
More links may be needed
When you create links, keep in mind that the records to which you are linking might be
invisible to the 'bots. You can usually test this by entering in a few key words along
with "getcited" at the search engine in which you are interested in appearing. If it
doesn't appear, then it's likely that the record you've linked to is currently invisible
(but see below for other explanations). If so, more links may be necessary and, in particular,
links to heavily cited publications.
-
Every search engine is unique
Search engines are in competition with one another and, thus, are constantly devising new ways
of gathering content in their attempts to outdo one another. For this reason, we can't keep
tabs on how they all work. The one we pay most attention to is
Google and most of what we have to say herein pertains to it.
That said, if you want information to appear in other search engines, you could do the getCITED
community a big favour by simply submitting getCITED's URL to that search engine.
-
Not every page is crawled
Believe it or not, search engines do not crawl every page they find, not even Google.
There are simply too many pages. Thus, they need an algorithm to decide which pages to crawl and, in the
latter's case, it comes down to how "important" the site is. If it has a lot of other pages linked
to it, it's deemed important. Presently, getCITED is deemed reasonably important and, as a
result, quite a few of its pages get crawled. Nevertheless, there are many pages that don't get crawled
simply because getCITED's quota of pages has been met. As far as we know, there's no way of
influencing which pages get left out. However, what members of getCITED can do to rectify
this situation is put links to getCITED on their homepages (and anywhere else, for that matter).
This will raise the site's "importance" quotient and, with it, its quota of pages.
-
Pages are crawled intermittently
Web crawlers sometimes take a break and, even when they're working, they cycle through a lengthy list of links
that only brings them to each page periodically. Thus, it may take weeks and, in some cases, months before
a 'bot crawls a particular page of information. That said, with some search engines this time period
can be decreased by creating more links.
-
New pages take time to appear
After a page has been crawled, the information gathered has to be integrated into the "index" that the
search engine uses to generate its results. This is done periodically and, thus, it can take as long as a month
for crawled pages to start appearing in searches. What makes this even more interesting, is that some search
engines, such as Google, have so many computers running that a week can pass
between the time the first computer and the last computer get updated. Thus, pages can show up in one search
but not another if your queries happen to arrive at different computers.
|