The homepage www.forums9.ch consists of about 5000 HTML pages and pictures, written in german language. The page describes local cultural events, geography, some artists etc. for a region with about 40'000 inhabitants, 15 kilometers south-west from Zurich. We intended to publish the contents of this website (about 400 MBytes of data) on a CD-ROM, using an almost 1:1 copy of the online data with minor changes.
One of the requirements was to provide a Search function for the user of this CD-ROM. Usual approaches for such functions are to calculate an index on the HTML data, which is then searched when the user enters his keywords. For small volumes of data, a direct search on the online data might work as well.
Neither of these methods would work for a CD-ROM edition of the internet page. Whenever the user puts the CD-ROM into his computer, the auto start option launches his browser and he can navigate on the HTML pages. However it is not possible to execute any CGI functions because these are only provided by the web server. Our solution was to provide a search word index in the form of Static HTML pages. This index looks similar to those found in technical books, presenting an alphabetic list of keywords followed by a set of page numbers where the keywords are found in the book.
The basic idea was taken from [1] which provides a program example to extract keywords from HTML pages and sort them into a so-called inverted index, using regular expressions. Our indexing program still rips the HTML pages using regular expressions and calculates an inverted index as an intermediate result - but it does more.
The resulting index is broken into 26 files, one for each letter of the alphabet, and each file contains the keywords starting with the same letter. Look at StichwortQ.htm for an example.
Perl can be told to use the ISO-8859-1 character set, although this is not its default behaviour. Using ISO-8859 instead of plain ASCII affects Perl programs in two interesting areas:
In fact you can declare both a specific language and a character set. However for practical use declaring the ISO8859-1 character set and for example Swiss-German means you can process all western european languages including english texts correctly. See code snippet for actual code.
The indexing program loads a stopword list from a configuration file. Keywords listed in the stopword list are
discarded during creation of the inverted index. Our stopword list contains about 200 words, mostly german articles,
all inflection variants of "sein", "haben" and so on, plus typical internet navigation labels
such as "Zurück" (back), "Hauptseite" (Home), "Top" and the like.
As a first step, a Perl hash has been used to collect keyword frequency statistics. The most frequent words from
this statistics have been placed on the stop list.
While analyzing the HTML pages to populate the stopword list, we found a simple rule to recognize nouns in german
texts. All german nouns, as well as person and geographical names, start with an uppercase letter. However every
sentence starts with an uppercase letter, and so the first word of each sentence would be considered a noun as
well..
To our amazement, it was found that almost every german sentence starts with a word which is known in the stopword
list! Therefore it was possible to filter nouns and names out of our text without using dictionaries or semantic
knowledge.
[1] Scott Guelich, Shishir Gundavaram, CGI Programming with Perl
my %GLOBAL_WORDLIST=();
# Note: Code to fill word list not shown
sub print_global_wordlist()
{
my $word="";
my $freq="";
foreach $word ( sort keys %GLOBAL_WORDLIST )
{
$freq = $GLOBAL_WORDLIST{$word};
print "$word;$freq;\n";
}
}