Using SWISH-E To Index Your Site
SWISH-Enhanced is a fast, powerful, flexible, free, and easy to use system for indexing collections of Web pages or other text files. Once indexed, you can perform quick searches on your Web pages using the index file. It is currently installed on CGI101, so if you're a customer, you don't need to install it. If you're not a customer, check with your ISP to see if SWISH-E is already installed; if not, you can downloaded it from SunSITE.There are three parts to making your web site searchable with SWISH-E. First, you have to create a configuration file that SWISH-E will read to index your site. Then you have to actually index the site. And lastly, you have to have a CGI that will perform the search and return results.
Step 1. Create the Config File
To create an index of your pages, SWISH-E reads a configuration file to determine which pages should (or should not) be indexed. You should download (or copy) the following file to your own account:
/home/www/help/swish.confThen you'll need to edit it. There are three things that have to be changed:
IndexDir /home/yourusername/public_html IndexFile /home/yourusername/public_html/swish.index IndexName "Site Index"The paths to your web directory should be fixed, so you should replace
/home/yourusername/public_htmlwith the actual (full) path to your web files.Nothing else should need changing unless you want to fine-tune your search engine (such as omitting files with certain names, etc.). If you read through the config file you'll see the different options, plus help for each one. Any line that starts with a "#" is a comment, and many options are commented out by default.
The sample config file is also set up so that it only indexes .html files. If you want to index other files, for example .txt or .shtml files, you'll need to change the following line near the bottom of the config file:
IndexOnly .htmlAnd add the suffixes you want, for example:
IndexOnly .html .txt .shtml
2. Index The Site
Once your config file is saved, you'll have to run swish-e to create the index file. This can be done from the unix command line like so:/usr/local/bin/swish-e -c /home/yourusername/public_html/swish.confIf all goes well, the index file will be created at the location specified by the IndexFile directive in the conf file. The first time you run this, you'll also want to chmod 644 swish.conf to make it readable by your CGIs.
You'll need to re-index your pages whenever you make changes to them. You can either do this manually every few weeks or so (depending on the frequency of the changes), or you may want to create a cron job to re-index your site nightly. (I recommend this, because it lets you change your pages without worrying about the index.) To set it up in cron, type
crontab -eto edit your cron file. You'll be put into an editor (which will be whatever your default editor is - possibly pico or vi). You'll then add the following line:
0 0 * * * /usr/local/bin/swish-e -c /home/yourusername/public_html/swish.confThen save the file. This will schedule the indexer to run at midnight every night.
3. A CGI To Perform Searches
Once you've indexed your site, you can make it searchable by adding the following form to any of your pages:
<form action="search.cgi" method="POST"> Search For Keywords: <input type="text" name="keywords" size=30> <input type="submit" VALUE="Search"> </form>And here's the CGI to handle the actual search:
#!/usr/bin/perl # # Kira's simple SWISH-E search CGI # use CGI::Carp qw(fatalsToBrowser); # You'll need to change this to the document root of your webspace. # for personal accounts, change it to /home/yourusername/public_html. # If you're running your own server, it should be the path to your # document root for the server, e.g. /home/htdocs $docroot = '/home/yourusername/public_html'; # This also must be changed; if you've used your personal account path # above, you should change this to /~youruserid so the webserver can # properly translate the URL for your files. For non-personal pages, # just set this to blank. $prefix = '/~yourusername'; print "Content-type:text/html\n\n"; # customize this section as appropriate for your site print <<EndHTML; <html><head><title>Search Results</title></head> <body> <h2 align="CENTER">Search Results</h2> EndHTML read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'}); @pairs = split(/&/, $buffer); foreach $pair (@pairs) { ($name, $value) = split(/=/, $pair); $value =~ tr/+/ /; $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg; $value =~ s/~!/ ~!/g; $FORM{$name} = $value; } $keystring = $FORM{'keywords'}; if ($keystring =~ /^([\w\-\. ]+)$/ ) { $topic = $1; } else { &dienice("Bad keyword: `$topic'. Please don't use commas or non-alphanumeric characters."); } @results = `/usr/local/bin/swish-e -w "$topic" -f /home/yourusername/public_html/swish.index`; $ct = 0; foreach $i (@results) { # results are returned in the form: # relevance path title filesize # separated by spaces. # comments start with #, and the last line starts with a ., so we're #ignoring those: if ($i =~ /^#/ or $i =~ /^\./) { # errors start with 'err', so we'll pass those on to dienice: } elsif ($i =~ /^err/) { $i =~ s/^err/Error/; &dienice($i); # } else { ($start, $title, $size) = split(/\"/,$i); ($perc, $url) = split(/ /,$start); $perc = $perc / 1000 * 100; $percstr = sprintf("%3.1f\%",$perc); # since the "url" returned is really the full unix path to the file, # you need to translate this to a proper web url. Change the docroot # and prefix variables as described above. $url =~ s/^$docroot/$prefix/; print "<a href=\"$url\">$title</a> - $percstr <br>\n"; $ct = $ct + 1; } } if ($ct == 0) { print "No results found.<p>\n"; } &do_footer; sub do_footer { # customize this section as appropriate for your site print <<EndFoot; <p> $ct results found.<p> </body> </html> EndFoot } sub dienice { my($msg) = @_; print "<h2>Error</h2>\n"; print $msg; &do_footer; exit; } # the end.
Source code: http://www.cgi101.com/help/search.txt
You may also want to check out SunSITE's collection of other scripts to search-enable your pages.
Spidering
The above example uses SWISH-E's filesystem method of indexing. It can be configured to spider a site instead (using HTTP calls); this is useful if you want to index a remote site. Visit http://sunsite.berkeley.edu/SWISH-E/Manual/spidering.html for instructions on how to do this.I don't recommend this for indexing your own site, especially if you have bandwidth limitations on your account, because the spider traffic will eat up some of (or a lot of, depending on the size of your site) your web traffic quota.