Tracking Visitors

<<< Saturday, January 04, 2003 11:15 PM

Sunday, January 05, 2003 09:47 PM >>>

Tracking Visitors

I built a little system for keeping track of visitors to this site. Here's how it works - by way of disclosure, and in case you care... and also in case you have suggestions for improvement!

Background

Many if not most sites track visitors, why?

To make sure the site is working.
Various technical reasons (browser compatibility, network connectivity, etc.).
For accounting purposes (doesn't apply to a personal 'blog, of course).
To watch where they are coming from. Interesting and [perhaps] useful.
Vanity. ("Yippee, three hits!")

The general information available from a visitor comes from three sources:

TCP/IP socket connection.
- Source IP address (may lead to source domain).
HTTP header
- Browser type (aka "user agent"). Often implies platform and OS, too.
- Referring URL, if any.
- Cookies.
HTTP request
- Session information passed through from previous request.
- Form information entered by user.

Keep in mind that HTTP is stateless, a webserver cannot remember anything from one request to the next. The only way to maintain state across requests is by saving information via the user's browser.

For many website purposes it is important to maintain information within a session. (Session is not a technical term, but it means roughly "same visit by same user from same computer".) Temporary cookies or form variables can be used for this purpose.

For other website purposes it is important or desirable to main information across sessions. Essentially this means "different visit by same user from same computer". Persistent cookies are the only way to accomplish this.

When passing information from one request to the next via form variables or cookies, there are two basic techniques. One it to pass the information itself through the browser. This has the advantage that the server need not store anything, but it has several disadvantages:

The information is "visible" on the network. This can be addressed by encrypting the information, at some cost of additional overhead.
The information may potentially be modified on the network or by the user. Modifications can be detected via checksums, etc., but they may still impair the information.
The information may be lost entirely if the user clears their cookies, reloads their software, uses another computer, etc.
The information may be large, in which case the fact that it is passed back and forth on each request slows the interactions.

The other technique is to store the information on the server, and pass a pointer to the information through the browser. This is preferred.

Details

I decided to assign each visitor a unique number, and store it in a permanent cookie. (The cookie is named w-uh, if you'd like to check...) This number is used as an index to a small database. In the database I keep track of each new visit, with "a visit" defined as "each time I see the cookie after at least three hours have elapsed since the last visit". I'm currently storing date, time, IP address, and domain, along with a count of visits. I also have another little database where I store referring URLs and their corresponding targets.

Because I have the date and time of a user's last visit, I can highlight things which are new since they last visited. I'm currently thinking about the most useful way to do this. Possibilities:

Put a visible token next to each "new" post or article.
Shade "new" posts or articles on the home page.
Build a "what's new" page for returning visitors.
Some combination of the above...

Stay tuned - I'll let you know what I decide...

Along with mere traffic information, I'd like to have a visitor's email address. That way I can communicate with them to tell them about site updates, ask their opinion, etc. The only way to get a visitor's email address is to ask them for it, and naturally you don't want to pester them. So I decided on the following logic:

If I already have a visitor's email address, I do nothing. It doesn't seem worth verifying.
If I don't have a visitor's email address, after three visits I pop up a window asking them for it. (I assume anyone who hasn't visited at least three times is probably not interested in getting email about the site.) If they give it to me, great, otherwise I assume they don't want to give it to me and store a "not given" status.

The implementation was simple because my site has two entry point CGIs, index.cgi and noframes.cgi, as described in the frames article. These CGIs simply call a common subroutine to manage the cookie stuff.