getCITED   
  Home     Search     Add Content     Reports     Help  
HELP INDEX
Add Content
Associations
Audit Trail
Citations
Comments
Contributors (Roles)
Countries
Departments
Disciplines
Faculties
HTML
Identities
Institutions
ISBN/ISSN
Joining getCITED
Languages
Links
onlineAGORA
Organizations
Page View(ers)
Parent Publications
Preferences
Privacy Policy
Publications (Types)
Ratings
References
Regions
Reports
Roles
Search
Statistics
Subscriptions
UserInfo
WebCrawlers

Can't find it? Tell us!

Web Crawlers

getCITED is being "crawled" on a regular basis by the "robots" of some of the Internet's best known search engines, including Google and AltaVista. Thus, entering publication details into getCITED is an excellent way to disseminate the results of our research. However, simply entering in the citation details of a publication is insufficient for it to appear in searches at the major search engines. To understand why, consider that the "robots" that gather information for search engines do not actually enter queries when they arrive at a site like getCITED. Instead, they simply follow whatever links they can find which, at getCITED, start with the Reports page. Thus, additional actions are required.

Given the need for a trail of links, the best strategy for getting the content in getCITED into the search engines is to link it in as many ways as possible. For example, by linking an identity record to a department and institution with which the robots are familiar, the identity is made visible to the 'bots. And by linking the identity record to publications to which the corresponding individual contributed, the publication record is rendered visible. That said, the best way to make publication records visible is to link them to the publications in their bibliographies (i.e., create References). All it takes is one link to a publication record already known to the 'bots in order to make a publication visible. The implication? Take the time to link the publications you add to at least a few of their references. In addition to making publications visible, you will also be contributing to the web of links that makes getCITED so powerful.

Why publications do NOT appear

There are a number of reasons why publications and personal information entered into getCITED do not always appear in searches carried out at the major search engines. Thus, please familiarise yourself with these prior to contacting us. The number one reason, as is noted above, is a lack of links. If a record has no links, it is invisible and will never be found by a web crawler. That said, if a record does have links, it might still not show up for one or more of the following reasons:

  • More links may be needed
    When you create links, keep in mind that the records to which you are linking might be invisible to the 'bots. You can usually test this by entering in a few key words along with "getcited" at the search engine in which you are interested in appearing. If it doesn't appear, then it's likely that the record you've linked to is currently invisible (but see below for other explanations). If so, more links may be necessary and, in particular, links to heavily cited publications.
  • Every search engine is unique
    Search engines are in competition with one another and, thus, are constantly devising new ways of gathering content in their attempts to outdo one another. For this reason, we can't keep tabs on how they all work. The one we pay most attention to is Google and most of what we have to say herein pertains to it. That said, if you want information to appear in other search engines, you could do the getCITED community a big favour by simply submitting getCITED's URL to that search engine.
  • Not every page is crawled
    Believe it or not, search engines do not crawl every page they find, not even Google. There are simply too many pages. Thus, they need an algorithm to decide which pages to crawl and, in the latter's case, it comes down to how "important" the site is. If it has a lot of other pages linked to it, it's deemed important. Presently, getCITED is deemed reasonably important and, as a result, quite a few of its pages get crawled. Nevertheless, there are many pages that don't get crawled simply because getCITED's quota of pages has been met. As far as we know, there's no way of influencing which pages get left out. However, what members of getCITED can do to rectify this situation is put links to getCITED on their homepages (and anywhere else, for that matter). This will raise the site's "importance" quotient and, with it, its quota of pages.
  • Pages are crawled intermittently
    Web crawlers sometimes take a break and, even when they're working, they cycle through a lengthy list of links that only brings them to each page periodically. Thus, it may take weeks and, in some cases, months before a 'bot crawls a particular page of information. That said, with some search engines this time period can be decreased by creating more links.
  • New pages take time to appear
    After a page has been crawled, the information gathered has to be integrated into the "index" that the search engine uses to generate its results. This is done periodically and, thus, it can take as long as a month for crawled pages to start appearing in searches. What makes this even more interesting, is that some search engines, such as Google, have so many computers running that a week can pass between the time the first computer and the last computer get updated. Thus, pages can show up in one search but not another if your queries happen to arrive at different computers.


    ABOUT getCITED   |    CONTACT US   |    USER INFO   |    PREFERENCES   |    PRIVACY   |    LOG IN   
Comments? Suggestions? Send them to feedback@getCITED.org.

Copyright © 2000-2006 getCITED Inc. All Rights Reserved.