Critical Section

Archive: February 8, 2003

<<< February 7, 2003
Home
February 9, 2003 >>>

Saturday, 02/08/03 07:15 PM

Today features an interesting ploy by the French. They have proposed a U.N. peacekeeping force be stationed in Iraq, as a way to prevent Iraqi aggression while averting war. It certainly complicates things, since now there is a credible alternative for U.N. Security Council members to consider (besides doing nothing and attacking Iraq). U.S. Defense Secretary Donald Rumsfeld was not amused, possibly because he found out about the proposal in the media, instead of being contacted directly.

Interesting Tale of Two Stories by Doc Searles on the Linux Journal website. He suggests that Linux' success is an Innovator's Dilemma story - that Microsoft is being "attacked from below" in classic fashion. A lot of interesting ideas - I can't do it all justice - read it!

The Scientist has a great article: Breast Cancer: The Big Picture Emerges [free registration required]. {My little company Aperio makes microscope slide scanners which, among other things, are useful for automated detection of breast tumors.}

Clear Channel, the company which owns over 1,200 radio stations and 37 TV stations, is planning to sell CDs of a live concert recording - at the concert!

Gary Kasparov and Deep Junior drew their match. In six matches each won once, and there were four draws. Most observers felt game 5 was the best, in which Deep Junior, playing black, offered a stunning bishop sacrifice and chased Kasparov's king all over the board. Only a clever perpetual check enabled Kasparov to salvage a draw. Isn't it amazing how much depth there is in such a simple game?

Simple Search

Saturday, 02/08/03 08:12 PM

If you found this article interesting, you might also like: Tracking Visitors, and Frames.
This site has a homemade simple search facility. Want to learn about it?

Background

Websites are basically collections of pages with textual information. There may be pictures, animations, graphic elements, etc., but these are all fru fru, the meat is the text. Websites have evolved two distinct ways of directing traffic among their pages of text information:

Navigation and site maps. The header, sidebar, footer, whatever which appear on [or next to] each page are a high-level table of contents. A site map, if present, is a page giving a lower-level table of contents. Both organize the content of the site by subject.

Search. A way to find specific words or phrases, wherever they may appear on pages within the site. Essentially an index to the site's content.

Implementing navigation and site maps is easy - just add links! This is the essence and beauty of "the web". But how do you implement search? Do you have to duplicate all the functionality of Google? Uh, no. For most websites a simple search is all that is required. And guess what? I have a simple search to share with you!

{
To see the simple search in action, it is over there on the right. Please try it!
}

Overview

Simple search is a single script written in Korn Shell - something which will run on any Windows, Unix, or Linux webserver. If you're not a script aficionado, hang in there - the concept is directly transferable to any other scripting language like Perl or ASP.

First, how should search work? We need a spec! Here's the desired functionality:

A visitor enters one or more words of text, and clicks "search".

The search script displays a list of links to each page which contains the entered text.

The text for each link to a page is the page title.

The link to each page is accompanied by the date and time of the content.

The link to each page is followed by a brief excerpt from the page's contents. Just enough to give the visitor some flavor of what is on that page.

The links are displayed sorted by relevance within time. As a crude measure of relevance we'll use the number of times the search text appears on the page. (Remember, this is simple search, not Google.)

If no text is entered, the search script responds with a simple form prompting for search text.

Okay, let's have a talk about tweetle beetles. Just kidding. Let's talk about the contents of the website, the stuff we're going to search. Each site is going to have one or more folders [directories] which contain the pages. You need to know which folders you want to search. For example, on this site there are two folders containing all the pages, one named "posts" and one named "articles". Your mileage will vary.

{
Pages of HTML are simply text files. This is one of the really cool things about HTML. So we can deal with them purely as text - even if the text will be interpreted to cause pictures to load, animations to play, graphics to display, etc.
}

There are two things we need to get from each page:

The title of the page. This is usually simply the stuff between <title> and </title>, so it is pretty easy to find.

The date/time of the page. There are two ways to get this. First, it may be that you have a consistent way of putting the date/time inside each page. Second, you can use the modification date of the page's file. The first way is preferable because the posting date/time may be different from the last modification, but the second way works - the main thing is you need a consistent way to get the date/time.

The basic idea is to look through the text of each page trying to match the search words, and if found, we're going to display a link to the page along with its title, its publication date/time, and some text from the page. That's it, really simple.

Details

Great - we know what we want the search results to look like, we know where to search, and we know how to get the title and date/time from each page. We're ready. So here it is, the meat of simple search:

integer found=0
echo "$search" | grep -ci "\b$search\b" $target - | grep -v ":0$" |
 sed "s/:/ /" | sort -nr +1 | while read file hits;do
 if [ "$file" = "(standard" ];then
 if [ $found -eq 0 ];then
 echo "No results found"
 fi
 continue
 fi
 found=1
 title="`grep "<title" $file | head -1`"
 title="${title#*title>}"
 title="${title%</title*}"
 tstamp="`grep "Permalink" $file`"
 tstamp="${tstamp#*}"
 tstamp="${tstamp%%<*}"
 echo "<a href="$file">$title</a> ($hits) -
 echo "$tstamp"
 echo "<blockquote style="margin-right: 0px">"
 grep "^<[pP]" "$file" 2>/dev/null |
 head -$hitlines |
 sed "s/<[^>]*>//g;s/ / /g" |
 cut -c1-$hitchars |
 sed "s/ [^ ]*$//;s/.$/&... /"
 echo "</blockquote>"
done

Pretty simple, huh? (Well, I think it's simple, you might have run screaming from the room. If so, sorry.) Let's walk though this together, shall we?

The first line (integer found=0) simply initializes a flag so we can report "not found" if we didn't find anything. The second line does the real work. It begins with an echo which pipes the search string into a grep. The grep is really the key to everything, it does the actual searching. There are a few tricks here:

The -i option tells grep to ignore case. This is a simple search, so that's what we want.

The -c option tells grep to output the name of each file searched together with the number of occurrences of the search text within the file. Just what we want!

The \b gets turned into a b by the shell, and this tells grep to match "whitespace" (spaces, tabs, and the beginning or end of a line). By bracketing the search text ($search) by these we're asking grep to do a word match. This is a simple search, so that's what we want (we don't want "pain" to match "spain").

The target ($target) is a list of all the files you want to search. On my site this is "articles/* posts/*" which means everything in the "articles" and "posts" folders. You can search anywhere you want...

The output from the first grep is piped into a second one, grep -v ":0$". The -c causes grep to append a colon and the #matches to the end of each line, so ":0" will drop any files which don't have any matches.

The sed edits colons to spaces. So each line will have filename, a space, and the #matches.

The sort sorts the lines into descending order by #matches.

The while read sets up a loop, assigning filename and #matches to variables "file" and "hits".

If you understand this, you understand everything. If you don't, sorry, but maybe try once more. I promise we'll go faster from here...

Okay, line three is in the body of the loop, performed for each file which has a nonzero number of matches. The if tests for the last line of the loop, created by piping the search string into the grep with echo. If the "found" variable is zero, we didn't find anything; output "not found".

The three "title" lines pull the value of the page's title out of the file (in this case by isolating the text between <title> and </title>). The three "tstamp" lines pull the date/time of the page out of the file (every page on this website has an anchor named "permalink"; as mentioned above, you might have to do something different to get the date/time, including maybe using the file's date/time of last modification). Then there are two "echo" lines which generate the HTML for the output - the link to the file and the date/time.

Hey, we're almost done. And we've reached the most interesting part. Remember our spec?

"The link to each page is followed by a brief excerpt from the page's contents. Just enough to give the visitor some flavor of what is on that page."

How the heck do we do this? Well, that's what the next grep is all about. In English, we're taking the first few paragraphs of the page, truncating each line, and stringing them together. Here are the details:

The grep looks for each line of the file which starts with "<p". This finds all the paragraphs.

The head takes just the first few paragraphs (based on $hitlines). I currently have this set to 5 - you can experiment.

The next sed does two things - it drops all HTML tags from the text, and it converts each   into a space (" " is a code which means "non-breaking-space" in HTML; my pages are full of them).

The cut truncates each line (based on $hitchars). I currently have this set to 45, again, you can experiment.

Finally, the second sed backs up to the last space in the line, and truncates the line there (so we don't end up with partial words after the cut), and then appends "..." to the end.

The result of all this is a relatively brief chunk of text which is somewhat representative of the contents of the page. The preceding and following "echo" lines bracket this text with <blockquote>, which causes it to be indented.

That's it! A simple search. We're not competing with Google here, but it works.

P.S. I'm happy to share this code - shoot me email if you're interested.

Return to the archive.