![]() |
|||||||||
| Inktomi Web Search FAQ Moved to Yahoo! Search Help Web Search Information
Web Search Information
A: Inktomi provides search results to the search sites listed on our portal partner page. The different search portals may also use results from other information sources, so not all of their results come from the Inktomi search database. The different search portals also may apply different selection or ranking constraints to their search requests, so Inktomi results at different portals may not be the same. Check the service descriptions and information at each search service to select the one that works best for you.
A: Visit a portal site using Inktomi web search and do a web search using the full title and some quoted text from the page of interest. Select unique text from the page, so the number of matching pages will be easy to scan. Our search portal partners may present web content results from other sources as well as Inktomi results. Check the result labels and descriptions provided in the portal results lists. The site search boxes at the www.inktomi.com web site do not search the Inktomi database; they only provide a search of content at www.inktomi.com. Do not try a search at www.inktomi.com to search for content in the World Wide Web. If your search topic or web pages include "adult" content, it may not be included in the results from all web portals. Some of our web portal partners always filter out any material that is even faintly suspicious of being adult content. Check the service description and information provided at each search portal to find the one that best suits your needs. Q: How do I get an offensive site/URL removed from your database? A: Inktomi does not automatically remove documents from our database because they offend someone. The database is a reflection of the content of the Internet and is built from billions of documents on millions of web sites around the world. Some of those documents will be personally, politically, or morally objectionable to some people, but others may find those same documents interesting and useful. We want the search database to represent the content of the web in as accurate and usable a form as possible. Some of our web portal partners provide filtering so that their search results will not include potentially offensive material. Inktomi will remove references to content that has been published online illegally. One example of illegal web content is material published in violation of copyright protection. To report such content and get it removed from the search database, see our copyright infringement report page. Inktomi does have guidelines for index inclusion or exclusion based on protecting the accuracy and correctness of our search results. Web sites or pages that violate our content guidelines may be removed from the index. Objectionable content that is misrepresented by the web site may violate the content guidelines. Reports of sites that do not follow our guidelines, but still appear in search results, are welcome at reportspam@inktomi.com Q: Whenever I search for my name, I get a list of sites with content which I find offensive. How can I get it removed from Inktomi's index? A: The Inktomi web search database is a reflection of the content of the Internet and is built from billions of documents on millions of web sites around the world. We want the search database to represent the content of the web in as accurate and usable a form as possible. We depend on automated collection of pages from public web sites, and avoid manual intervention or overrides, as such a large database cannot be maintained by manual updates. The search database only reports what has been published elsewhere. If a web site has published your personal information, or incorrect information about you, you should contact the site owner or site content provider to get that information removed or corrected. Personal names are not guaranteed unique, and multiple people may have the same name. Pages published by or about a different person with the same name are not subject to suppression because of coincidental name similarity. After the published content is removed or corrected, the information in the search database will be corrected by our normal content refresh processes. When providing personal data to a web site, be sure to check the site's privacy policy. In some cases, your submission to a site grants them permission to publish that information. Content Information for WebMasters
A: Inktomi Search/Submit is the fast and guaranteed way to get your site content discovered through Inktomi's extensive network of Search/Web partners, including MSN and HotBot. Visit our Web Portal Customers page to see the web portals we work with. Search/Submit includes your most valuable content in the Inktomi index and keeps the content fresh with 48-hour updates for a one-year subscription period. Inktomi Search/Submit is available to content providers that wish to include up to 1,000 URLs on a flat-fee basis. For sites with more than 1,000 URLs we recommend Inktomi Index Connect, which offers a pay-for-performance pricing model. Inktomi guaranteed inclusion programs are available only through our program partners, not direct from Inktomi. The paid inclusion programs provide quick inclusion and guaranteed service intervals, but are not required for search database inclusion. The Inktomi web crawler, "Slurp", will also locate new sites by following links from existing known web pages. So to get a new site added to the search database, just be sure that other web sites publish links to your new site. The crawler will find the links when refreshing the known pages, and add your site to be crawled on future updates. Adding your site to one of the major directory services such as Yahoo! or dmoz is an excellent way to be sure there are published links to your site.
A: Visit a portal site using Inktomi web search. Do a web search using an identifiable phrase from the title or description of your web page, and see whether your URL is in the results. For example, if your web page title is "The Best Frimframs in the World!", use that quoted phrase for your search. If there are too many search results to look through for your URL, search using a longer, more unique phrase from the text on your page. Avoid using a search by URL to check for database content. URL syntax can allow multiple forms of the same URL, and only a single "canonical" form is recognized in the database. The site search boxes at the www.inktomi.com web site do not search the Inktomi database; they only provide a search of content at www.inktomi.com. Do not try a search at www.inktomi.com to search for content in the World Wide Web. If your search topic or web pages include "adult" content, it may not be included in the results from all web portals. Some of our web portal partners always filter out any material that is even faintly suspicious of being adult content. Check the service description and information provided at each search portal to find the one that best suits your needs.
A: Sites that violate the Inktomi content guidelines may be removed from the index. Check the content guidelines and the content policy FAQ to see whether your site and content meet the guidelines. Pages with no unique text or no text at all may drop out of the index or may never be indexed. If you want a page to appear in web search results, be sure that page includes some unique text content to be indexed. Check the short description below for document ranking to look for ways to optimize your search ranking. Q: How do I notify you that I have new pages or updated pages? A: If a page changes, the change will be properly reflected in our database when the page is next crawled and indexed. Slurp revisits known web pages periodically to check that the page is still there and to update the database information about the page content. When you add new pages to an existing site, be sure to build href links from your existing web pages and navigation tools to the new pages. When Slurp refreshes content from the known pages, it will find the links to the new pages and add them to the web database. Slurp cannot always follow dynamic links, so to assure content discovery be sure to include a static link to new content. If your site navigation is normally done only with dynamic links, you can create a site map page with a static link map of your site to be sure robots can discover all of your content. Q: How do I change the abstract shown in the search results for my page? A: When constructing an abstract for the pages it finds, the crawler looks first to the "description" meta tag for the summary. If no description tag exists, the first few hundred characters of the visible text is used to formulate the abstract. To change the abstract, please update the "description" meta tag in your HTML code. Q: How are web documents ranked? A: Inktomi search results are ranked based on a combination of how well the page contents match the search query and on how "important" the page is, based on its appearance as a reference in other web pages. The quality of match to the query terms is not just a simple text string match, but a text analysis that examines the relationships and context of the words in the document. The query match considers the full text content of the page and the content of the pages that link to it when determining how well the page matches a query. Q: How do I make my page rank higher in the search results? A: Here are a few tips that can make sure your page can be found by a focused search on the Internet:
Q: I have moved my web pages to a new site. How will you update my URLs to match the new address? A: Our web database is automatically updated without manual intervention and reflects the content of the Web. To add your new pages to our database you should have incoming links to them from your old site, updated references from sites that link to your old site, and new directory listings. Such links will ensure that we discover your new addresses on our next database update cycle. If you require immediate inclusion of the new site, consider our Search Submit program. As long as the old web site continues to serve content, it will still be included in the search database and may rank higher in results than your new site. If you want to ensure that web users are redirected to your new URL, but do not want the old URL to be included in our database, then please place a redirect from the old URL to the new one and block crawler access to the old URL via a RES robots.txt file. On our next database update cycle, our crawler will discover that your old URL is no longer crawlable and it will be removed from the database. Q: Do you index ASP pages and .shtml pages? A: Slurp does index dynamic pages. But for page discovery Slurp mostly follows static links, and we recommend the avoidance of dynamically generated href links except in directories disallowed by a /robots.txt exclusion rule. Beginning July 2003 Slurp will more often follow dynamic links to find web pages, so some areas of servers that were not previously indexed may now be found automatically. Q: Do you index pages that use frames? A: Slurp does index pages that use frames. If a document used as a subframe includes a robots noindex META tag, the index exclusion will be applied to the entire frame document. To prevent subframes from being indexed but allow the source page to be indexed, please move those subframes to a directory where you can apply a robots.txt disallow rule to them, and add a rule to your /robots.txt. A: Yes, Slurp will follow redirects, such as meta refreshes and HTTP 302 "page moved" responses. When following a redirect, Slurp will index the content from the destination URL but display the original URL in results. Inktomi Indexing Information for WebMasters Q: What is "Slurp", and why is it accessing my web server ? A: Slurp is Inktomi Corporation's web-indexing robot. It collects documents from the web to build a searchable index for a number of search services on the web. In your server's log files you are seeing our robot as it visited your site. For more information on Slurp and details on how to instruct it not to index your pages, please see the following FAQ items and the Slurp information page at http://help.yahoo.com/l/us/yahoo/search/webcrawler/. Q: What is "YahooSeeker", and why is it accessing my web server ? A: YahooSeeker is Yahoo! Inc's site-indexing robot for Yahoo! Shopping service. It collects documents from identified commercial web sites to build a searchable index for the online shopping service at http://shopping.yahoo.com. In your server's log files you are seeing our robot as it visited your site. For more information on YahooSeeker and details on how to instruct it not to index your pages, please see the following FAQ items and the YahooSeeker information page at http://help.yahoo.com/l/us/yahoo/shopping/merchant/. Q: How do I stop your robot/spider (Slurp) from crawling my site? A: In order for our robot to stop indexing your site, or for you to restrict our robot to index only certain areas on your website you should use a robots.txt file according to the RES (Robots Exclusion Standard). The /robots.txt lines to exclude all robots from crawling your site is:
User-Agent: *
Disallow: /
Excluding robots from your entire site means your site information will not be present in web search results and other information services. It is usually best to allow the crawlers to access desired parts of your site and exclude only portions you do not want indexed. Note: The following are example directories. You will need to change them to the correct path/directory on your site.
User-Agent: *
Disallow: /onlinepurchasingpath/
Disallow: /bulletinboardpath/
Disallow: /onlineforumpath/
Disallow: /clientinformationdirectory/
Slurp also observes robots meta instructions, as described at http://www.robotstxt.org/wc/exclusion.html#meta. Notice that for the robots meta tags to be effective, the page with the tags must be allowed by /robots.txt. If page access is not allowed by /robots.txt, the crawler cannot get the page to read its meta tags. Please see the following document for more information: http://www.inktomi.com/slurp.html Q. Dozens of your crawlers are beating on my web server. Stop it! A. Since we crawl billions of pages from the entire web, we use a large number of systems for web crawling and your web server may see contacts from a large number of different Inktomi IP addresses. The different Inktomi crawler systems are coordinated to avoid putting too much load on any single web server. There may be many different Inktomi IP addresses listed in your access logs, but Slurp allows no more than one active connection to a single web server, so only one Inktomi crawler at a time can connect to your server. Slurp sends no more than 60 requests per minute to a single web server, aggregate among all the crawler systems. Slurp determines a single "web server" by IP address, so if your host is serving multiple IPs it may see higher levels of activity. If Slurp is crawling portions of your site that you do not want indexed
,
use robots.txt exclusion
rules to block that part of your site from the crawler. If the crawler
is collecting correct content but asking for pages faster than you like,
see the crawler rate question below. Q. Dozens of your crawlers are logged in reading my forum or bulletin board, and they stay there for hours. A. Some bulletin board or web forum software packages report a client as "logged in" or "viewing" for some time after the client has read a page from the server. Slurp does not maintain any sort of continuing connection or session to a web server. It only reads individual URLs from a server; each URL read typically takes less than 1/2 second. The application report of a continuing presence by the crawler is based on an assumption that the forum or bulletin board access was made by a person using a browser, who will spend time reading the material from the forum. If possible, check the web server software access logs for a more accurate picture of the actual activity from Slurp. If you would prefer that the forum or bulletin board content not be crawled for inclusion in web search results, use a disallow rule in /robots.txt to exclude the forum or bulletin board directory from crawling. Note: The following are example directories. You will need to change them to the correct path/directory on your site.
User-agent: *
Disallow: /forumpath/
Disallow: /bulletinboardpath/
For full information on /robots.txt see the Robots Exclusion Standard.
Q. Your crawler is hitting my server too hard. Slow down! A. Slurp sends an average of up to 60 requests per minute to a single web server, and uses no more than one active connection to a single web server. We determine a single "web server" by IP address, so if your host is serving multiple IPs it may see higher levels of activity. There is an Inktomi-specific extension to robots.txt which allows you to set a lower limit on our crawler request rate. You can add a "Crawl-delay: xx" instruction, where "xx" is the minimum delay in seconds between successive crawler accesses. Our default crawl-delay value is 1 second. If the crawler rate is a problem for your server, you can set the delay up to 60 or 300 or whatever value is comfortable for your server. Setting a crawl-delay of 20 seconds for Slurp would look something like:
User-agent: Slurp
Disallow: /path-to-be-excluded-from-crawl
Crawl-delay: 20
Q. Why are you ignoring my robots.txt file and accessing my root / document even though I have listed it in robots.txt ? A: Slurp must retrieve the root document from a site for internal use. If you have disallowed "/" in robots.txt then the root document will not be indexed, nor will it be added to our search database, nor links from it followed. In addition, for performance reasons and to reduce the load on your web server, Slurp caches robots.txt files internally. So if you have modified your exclusion rules in robots.txt Slurp might not recognize the change immediately. Q. Why are you ignoring my robots.txt file for a particular directory? A: Check that your /robots.txt file is readable by web clients from the URL "http://mywebsite.com/robots.txt". Verify that the robots.txt syntax is correct per the Robots Exclusion Standard. For performance reasons, and to reduce the load on your web server, Slurp caches robots.txt files internally. So if you have modified your exclusion rules in robots.txt Slurp might not recognize the change immediately. Disallow rules in /robots.txt apply to absolute paths, so the disallow values must begin with a "/" to be effective. Instructions for specific user-agent values apply instead of general user-agent instructions. So if /robots.txt includes instruction lines for "User-agent: Slurp", only those instructions will apply to Slurp. Any instructions for "User-agent: *" will be ignored if a more specific user-agent match exists. Inktomi servers also host the YahooSeeker crawler as well as the Slurp crawler. Exclusion rules for "User-agent: Slurp" will not apply to YahooSeeker. To control YahooSeeker, either add exclusion rules for "User-agent: YahooSeeker" or see "How do I stop Yahoo! from crawling my web site for Yahoo! Product Search?" for exclusion instructions. Correct path information for user home directories can sometimes cause confusion. Many servers automatically translate www.yourserver.com/~user to www.yourserver.come/home/user. But these directory identifications are not classed as the same under RES rules, and href links could use either form. So to reliably exclude a user directory you would need a robots.txt file entry for both /~user and /home/user. Symbolic links between directories can also cause effective aliases for directories on a site. Q: How do I have my web site or web pages removed from the Search Engine? A: Inktomi does not remove sites or pages from the index upon request, as we do not have the means to readily verify the validity and authority of such requests. The search index is built based on billions of documents from millions of web sites, and we cannot depend on manual operations to maintain the index. There are three ways to prevent our crawler from indexing your site or portions of your site:
For more information see: "How do I stop your robot / spider (Slurp) from hitting my site?". Pages can stay listed in our databases for some time after the site content or control documents are changed. Such changes will not take effect in the search database until our information about the site is updated by Slurp. However please be assured that we will pick up your changes during our next refresh cycle. Q. You're still crawling URLs that I removed ages ago. What's wrong with your crawler? A. If your web server has been configured to return a friendly or clever page for a HTTP 404 Page Not Found error, Slurp may not recognize that it has received an error and may continue to request the incorrect URL on a normal refresh cycle. Please use an HTTP 404 error code when returning the custom URL-not-found page rather than a 200 OK status. At least include a robots noindex meta tag in the response page HTML so the error page content is not indexed. Q. Your crawler is asking for strange URLs that have never existed on my site, like /piopio/darkness-halo-bottom-camera.htm. Are you looking on the wrong host? A. Some web servers send a friendly or clever result page in a HTTP 200 OK response instead of a HTTP 404 Not Found result for page-not-found conditions. To check on web server handling of page-not-found conditions, Slurp will occasionally send deliberately odd URLs built from random words to sites from which no 404 results have been seen. These URLs are built intentionally to not match any actual content at the site. We save information on the web server response to requests for non-existent pages so we can correctly recognize and remove obsolete URLs in our search database. A Slurp check for 404 results from a web server consists of requests for up to 10 such URLs. The check for 404 behavior is not a normal part of Slurp site refresh, so such requests will be rare. Q: We've changed our web server's IP address. Will you update the IP address? A: Providing your DNS update has propagated around the Internet, we will access the new site correctly on our next crawl cycle. You may see accesses to the old IP address for a short time, as we cache DNS information for improved performance. Q: Why are you "hacking" my site? There are repeated access requests to my site, including port scans. A: Slurp is Inktomi Corporation's web-indexing robot. What you are seeing in your web server logs are not hacking attempts nor a web user repeatedly accessing parts of your site, but normal regular visits by the Slurp robot to your web site. If you wish to restrict the parts of your site that Slurp visits, then we advise use of a Robots Exclusion Standard "robots.txt" file. For more information on RES files, please see: http://www.robotstxt.org/wc/exclusion.html If the machine that we are accessing is no longer running a web server, but you are seeing access attempts on your firewall logs, then it is possible that we have followed a link to it from another web page. These attempts should stop shortly afterwards, when we discover that we cannot access the server. Slurp accesses web sites for which it has followed target links from within other web sites. If during our crawl and index of the web we found URLs which referenced your machine—either on port 80 or any other port that was referenced in a URL—then our crawling system will periodically attempt to connect to your machine on those ports in order to validate that a web server still does run on them. Slurp does not use any sort of probing, network scanning, or port-scanning intrusion techniques to locate web servers. It only makes HTTP "GET" requests to web servers as identified in a published URL. Q: I wish to filter out all Inktomi crawler accesses from my web server logs, what are the IP addresses which your crawlers use. A: Inktomi has a variety of crawling subsystems which span a number of different IP subnets and IP address ranges. The common link between all of our crawling systems is that they contain the string "Slurp" in the HTTP User-Agent. We would therefore recommend that you filter based on User-Agent rather than IP address as this ensures that you will filter all current and forthcoming crawlers. Q: Why do you keep asking for documents which don't exist or have been moved? A: Slurp is attempting to re-index them for our database. Now that it knows they no longer exist they will be dropped from our database, and if they have a new location then they will be discovered on our next crawl cycle. Slurp also follows links as published from other web pages, so if those links are incorrect, misspelled or out-of-date Slurp will still request that URL from your server at least once. If your web server has been configured to return a friendly or clever page with no error code for error conditions, Slurp may not recognize that it has received an error for the page request and may continue to request the incorrect URL on a normal refresh cycle. When serving modified error pages, it is best to return the correct HTTP error code, not a "200 OK" status. Q: Can I simulate crawler requests to see what content Slurp reads from my web site? A: The following instructions will allow you to send an HTTP request that mimics the action of Slurp (our web crawler). This simulation of Slurp access requires that you manually send an HTTP request for your web content in a form similar to that used by the crawler when it accesses the content. Normal page access from a web browser does not always show the same results. The manual HTTP request can be sent from a command session using the "telnet" utility provided with your operating system. Or see the note below for other command options. From a UNIX shell or a DOS/Windows command line, enter the command telnet <host> <port>
telnet www.inktomi.com 80When the telnet session shows that you have established a connection with the HTTP server, type in the following command set to emulate slurp: (copy and paste the lines between dashed lines) ----------------------------------------------------------- GET /<path/filename> HTTP/1.0 Host:<domain> Accept: text/* User-Agent: Mozilla/5.0 (Slurp/cat; slurp@inktomi.com; http://www.inktomi.com/slurp.html) From: slurp@inktomi.com -----------------------------------------------------------Type an extra carriage return (CR) to send the request. This will retrieve the document at /<path/filename> of <domain>. For example, to retrieve the main page of the web search division from our home page, the first two lines of the command set would be: GET /products/web_search/ HTTP/1.0 Host: www.inktomi.com . . .The GET command will return the entire document. If you need to retrieve just the server headers and not the whole document, substitute 'GET' with 'HEAD' in the lines above. This substitution will display the status, the server type, the date, the content type of the response, and various other server information.
Note: Network session utilities
Is Slurp affected by the Apache/mod_ssl worm (CERT Advisory CA-2002-27)? A: The Inktomi web crawler systems do not use Apache nor mod_ssl and are not subject to the vulnerability described in CERT Advisory CA-2002-27. Access to your web server from Inktomi web crawler systems does not represent an attack on your system. Normal crawler activity can match two of the symptoms mentioned in the advisory:
The crawlers access port 80 a lot, since that is the normal HTTP service port for web pages. The crawlers also follow URL links and redirects published in public web pages. These links can lead the crawler to attempt access to HTTPS service ports, typically port 443. The crawler uses only normal HTTP, and an HTTP request attempt on a port configured for HTTPS service will cause a server log message like GET /mod_ssl:error:HTTP-request HTTP/1.0This access by the crawler is not an attempt to hack the web server nor to access any secured pages.
Q: My Web Search question was not answered here, who can I contact? A: Please check the Yahoo! Search FAQ Inktomi technical support is provided under and in accordance with the terms of an appropriate written support agreement. This document is for informational purposes only. |
|
Copyright © 1996-2004 Inktomi Corporation. All Rights Reserved. Legal Notice | Privacy Policy | Copyright Policy
|