Zurück - Hauptseite

Indexing and Searching a private Homepage

Introduction

The homepage www.forums9.ch consists of about 5000 HTML pages and pictures, written in german language. The page describes local cultural events, geography, some artists etc. for a region with about 40'000 inhabitants, 15 kilometers south-west from Zurich. We intended to publish the contents of this website (about 400 MBytes of data) on a CD-ROM, using an almost 1:1 copy of the online data with minor changes.

One of the requirements was to provide a Search function for the user of this CD-ROM. Usual approaches for such functions are to calculate an index on the HTML data, which is then searched when the user enters his keywords. For small volumes of data, a direct search on the online data might work as well.

Neither of these methods would work for a CD-ROM edition of the internet page. Whenever the user puts the CD-ROM into his computer, the auto start option launches his browser and he can navigate on the HTML pages. However it is not possible to execute any CGI functions because these are only provided by the web server. Our solution was to provide a search word index in the form of Static HTML pages. This index looks similar to those found in technical books, presenting an alphabetic list of keywords followed by a set of page numbers where the keywords are found in the book.

The basic idea was taken from [1] which provides a program example to extract keywords from HTML pages and sort them into a so-called inverted index, using regular expressions. Our indexing program still rips the HTML pages using regular expressions and calculates an inverted index as an intermediate result - but it does more.

The resulting index is broken into 26 files, one for each letter of the alphabet, and each file contains the keywords starting with the same letter. Look at StichwortQ.htm for an example.

Using the ISO-8859-1 Character set in Perl

Perl can be told to use the ISO-8859-1 character set, although this is not its default behaviour. Using ISO-8859 instead of plain ASCII affects Perl programs in two interesting areas:

In fact you can declare both a specific language and a character set. However for practical use declaring the ISO8859-1 character set and for example Swiss-German means you can process all western european languages including english texts correctly. See code snippet for actual code.

Creating and using a Stopword List

The indexing program loads a stopword list from a configuration file. Keywords listed in the stopword list are discarded during creation of the inverted index. Our stopword list contains about 200 words, mostly german articles, all inflection variants of "sein", "haben" and so on, plus typical internet navigation labels such as "Zurück" (back), "Hauptseite" (Home), "Top" and the like.
As a first step, a Perl hash has been used to collect keyword frequency statistics. The most frequent words from this statistics have been placed on the stop list.

Keyword filtering - simple heuristic method

While analyzing the HTML pages to populate the stopword list, we found a simple rule to recognize nouns in german texts. All german nouns, as well as person and geographical names, start with an uppercase letter. However every sentence starts with an uppercase letter, and so the first word of each sentence would be considered a noun as well..
To our amazement, it was found that almost every german sentence starts with a word which is known in the stopword list! Therefore it was possible to filter nouns and names out of our text without using dictionaries or semantic knowledge.

References

[1] Scott Guelich, Shishir Gundavaram, CGI Programming with Perl


Perl Code snippet: Using ISO8859-1 character set, Swiss German
#!/usr/bin/perl -w
use strict;
# Character set and language specific sorting for swiss german
# ASCII/english text can be processed regardless of these settings
use locale;
use POSIX qw(locale_h);
setlocale(LC_CTYPE, "de_CH.ISO8859-1");
setlocale(LC_COLLATE,"de_CH.ISO8859-1");


Perl Code Snippet: Print keyword frequency statistics for development

# Example: Output keyword frequency list from PERL hash in alphabetic order
# This sort uses specified LC_CTYPE and LC_COLLATE
# Keywords in this list are already normalized to start with uppercase letter

my %GLOBAL_WORDLIST=();
# Note: Code to fill word list not shown
sub print_global_wordlist()
{
my $word="";
my $freq="";
foreach $word ( sort keys %GLOBAL_WORDLIST )
{
$freq = $GLOBAL_WORDLIST{$word};
print "$word;$freq;\n";
}
}