Google Search Appliance (GSA) - The difference between Crawl and Index and Feeds


Google Search Appliance (GSA) - Getting URL's into your index. Why might you use the "Crawl and Index" vs the "Feed"function?

If you are having some fun setting up a Google Search Appliance (GSA) you might be having some difficulties finding a proper description of how to use the various ways to get documents into the GSA's index.

Basically there is two methods to get documents into a GSA:
  1. Crawling and Indexing
    • Start Crawling from the Following URLs
    • Follow and Crawl Only URLs with the Following Patterns
    • Do Not Crawl URLs with the Following Patterns:
  2. Feeds
    • Web Feed
    • Content Feed
The main problem I found with Google's help on this topic is that there are a ton of different ways to learn about how to configure the device to crawl and it mostly focus on Option 1 - Crawling and Indexing and not on Feeds. 

If you are a Corporate you'll most likely want to control what gets indexed in a very controlled manner. Where I work quickly found this to be true. 

This scenario I ran into was interesting. My colleagues and I setup the device to Crawl and Index a lot of our corporate intranet by telling the GSA to start at a set of specific starting URL's  around 70 URL's to start with. Then we decided to index another intranet site that had two top level URL's in it. The problem we encountered when doing this was that embedded within the 70 other site URL's there were URL's pointing to this site with the 2 top level URL's and then we got confused as to where the URL's in the index were coming from. 


How to Setup your Google Search Appliance to index only what you feed into it?

Here's the details of what I found. If you setup a feed there are two ways define a feed.

  1. Web Feed
  2. Content Feed


Web Feeds

To define a Web Feed you set the feed tag <datasource> so it contains the word "web" the gsa will treat them as if they were implemented via the "crawl and index" function. Feeds that are named "web" are considered to be Web feeds and the URLs in them are injected into the regular crawling queue.

Content Feeds

To define a Content Feed you set the <datasource> so it does not include the word "web" as part of the <datasource> name so it will be treated separately from the "crawl and index" function.


What you can do with a feed

A document that is fed into the device within a content feed can be marked not to be crawled and it can also be tagged with various other attributes such as meta tags and creation dates etc.

Example Content Feed XML

http://code.google.com/intl/fr/apis/searchappliance/documentation/614/feedsguide.html
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE gsafeed PUBLIC "-//Google//DTD GSA Feeds//EN" "">
<gsafeed>
<header>  Note that the <datasource> tag below doesn't contain the word "web". This defines it as a Content feed.
<datasource>Human Resources - NewLetters</datasource>
<feedtype>incremental</feedtype>
</header>
<group>
<record url="http://site1.net/CA/newsletters/2011-01-01/execannounceHR.html/" action="add" last-modified="Tue, 18 Oct 2011 10:53:20 " mimetype="text/html">
<metadata>
<meta name="SiteSectionName" content="HR News Letters"/>
<meta name="Type" content="Newletter"/>
<meta name="Original Creation Date" content="Mon, 10 Oct 2001 09:53:20"/>
<meta name="Last Update Date" content="Tue, 18 Oct 2011 10:53:20"/>
<meta name="Document Language"content="English"/>
<meta name="Wealthnet Site Area" content="International"/>
<meta name="Group Name" content="HR"/>
<meta name="Division Name" content="IT&S"/>
<meta name="Business Line" content="Domestic" />
<meta name="Country"content="Canada"/>
</metadata>
</record>


Removing URL's From the GSA's Index

There are several ways of removing content from your index that were input into the device using a feed. The method used to delete content depends on the kind of feed that was used to input the URL's into the device in the first place.

For content feeds, remove content by performing one of these actions:

  • Resubmit the XML feed file with no record section. Basically the feed is then considered blank and the device will replace all the URL's from the original feed with nothing. In essence deleting. 
  • Push the URL as part of an incremental feed, using the "delete" action to remove the content. This is the fastest way to remove content. URLs will be deleted within about 30 minutes.
  • Remove the URL from the feed and perform a full feed. Because a full feed overwrites the earlier feed contents, any URLs that are omitted from the new full feed will be removed from the index. The content is deleted within about 30 minutes.
  • Remove the data source and all of its contents. To remove a data source, log into the Admin Console and open the Crawl and Index > Feeds page. Choose the data source that you want to remove and click Delete. The contents will be deleted within about 30 minutes. The Delete option removes the fed documents from the search appliance index. The feed is then marked Delete in the Admin Console.
  • After deleting a feed, you can remove the feed from the Admin Console Feed Status page by clicking Destroy.

For web and metadata-and-URL feeds, remove content by performing one of these actions. Note some of these methods also applies to Crawled and Indexed documents because any feed that is defined as a web feed is basically crawled just like any URL that is imput into the device via the Crawl and Index function.
  • Resubmit the XML feed file with no record section. Basically the feed is then considered blank and the device will replace all the URL's from the original feed with nothing. In essence deleting.
  • In the feed at the XML record level for the document, you can set the action to delete. The action="delete" feature works for content, web, and metadata-and-URL feeds. Then re-feed the feed document to the GSA.
  • Remove the URL from the web server. The next time that the URL is crawled, the system will encounter a 404 status code and remove the content from the index.
  • Specify a pattern that removes the URL from the index. For example, add the URL to the Do Not Crawl URLs with the Following Patterns list. The URL is removed the next time that the feeder delete process runs.
Note: If a URL is referenced by more than one feed, you will have to remove it from the feed that owns it. See the Troubleshooting entry Fed Documents Aren't Updated or Removed as Specified in the Feed XML for more information.

Hope this helps!

Luke Morrison

Comments

  1. Really great tips about crawling and indexing.
    I enjoy your post.
    Thanks for sharing.
    Website Developer in Bangalore

    ReplyDelete
    Replies
    1. You are welcome Ketul. It's been a while since I wrote that. What version of Firmware are you running now? I'm sure Google has released a lot of updates between then and now eh?

      Delete

Post a Comment