Recently I systematically optimized this little site. By way of documentation and in case it is of public interest, here's what I did...
- Conform to standards. Make more people and robots able to "view" the site.
- Reduce file sizes. Increase speed loading pages.
- Serve a special home page to "robots". Help them find everything easily.
HTML is a "loose" language. Just about anything goes. The popular browsers like Internet Explorer and Mozilla will "do the right thing" with all kinds of weird errors. But for maximum compatibility it is best to have pages which are "correct".
The easiest way to make sure your pages are correct is to use an HTML validator. I like Doctor HTML, but there are a bunch out there. You point Doctor HTML at a page, and it tells you what (if anything) is wrong with it. This is a great way to pick up unclosed tags, invalid syntax, etc. - it also verifies links and even checks spelling.
Most browsers and programs don't care about content-encoding, but some do. (The ones that don't pretty much assume U.S. ASCII is in use.) The easiest way to take care of this is simply specify the encoding in a META tag:
<META HTTP-EQUIV="Content-type" CONTENT="text/HTML; charset=US-ASCII">
If you have templates for your pages, put this in the template and you're done.
Finally, if you're a heavy user of CSS, be sure to test the CSS you're using on all the browsers with which you want to be compatible. I test with Internet Explorer, Mozilla, and Opera (Windows), Internet Explorer, Mozilla, and Safari (Mac), and Mozilla (Linux). Even though your CSS may be "valid", it may not be interpreted the way you want by all browsers. This is one reason I've stuck to frames and tables, they've been around so long pretty much all browsers treat them the same way.
Everyone's browsing experience will improve if you can reduce file sizes, especially people with slower connections to the Internet. It will also enable your site to serve more people concurrently with the same amount of bandwidth. There is nothing you can do which is better for your visitors (except give them interesting content!)
Reducing file sizes bifurcates into two kinds of activities: reducing image sizes, and reducing page sizes.
Reducing Image Sizes
Image sizes are a function of three things - the pixel dimensions of the image, the type of image, and the compression ratio. You should never make images any bigger than they have to be. If you have a really big image which just must be big, then put a thumbnail for it in the page's content which links to a new window with the big image. Any image bigger than 200 x 200 pixels is a candidate for shrinkage or thumbnailing.
There are two kinds of images in wide use on the web: GIFs and JPEGs. GIFs are best for images with a small number of colors and well-defined borders - cartoons, diagrams, flow charts, etc. JPEGs are best for images with gradients of colors and smooth transitions - mainly photographs. The coolest tool for shrinking images is Adobe Photoshop's "Save for the Web" feature. This allows you to take any image and try "what if" scenarios with file format and compression ratio. In addition, when Photoshop saves for the web it optimizes image headers, storing only the minimum information required, and enables progressive rendering, allowing larger images to be displayed incrementally as the browser receives data. There are other tools which have similar capabilities, but Photoshop is the leader.
Reducing File Sizes
HTML pages are plain text; making them smaller is pretty tough. Of course it is always better to use less words if you can, "brevity is the soul of wit" and all that. But that won't really make your pages smaller.
The best thing to do for reducing HTML page sizes is to implement GZIP compression. This means each page will be compressed before sending it out over the network, and decompressed by the browser. Typically this reduces file sizes by about 50%. All modern browsers say they support compression and do, but many robots do not. If the client does not support compression the server will automatically send an uncompressed page. There is really no downside to implementing this - do it!
If you're using Apache, the way to implement compression is via mod_gzip. There are many parameters for mod_gzip; I found this page to be very helpful. I use the following directives in my HTTPD.CONF file:
LoadModule gzip_module modules/mod_gzip.so |
|
in LoadModule section, should be last |
|
|
|
AddModule mod_gzip.c |
|
in AddModule section, should be last |
|
|
|
<IfModule mod_gzip.c> |
|
|
mod_gzip_on Yes |
|
enable mod_gzip |
mod_gzip_command_version '/mod_gzip_status' |
|
status URL |
mod_gzip_minimum_file_size 500 |
|
minimum file size to compress |
mod_gzip_maximum_file_size 500000 |
|
maximum file size to compress |
mod_gzip_maximum_inmem_size 100000 |
|
maximum file size to compress in memory |
mod_gzip_min_http 1000 |
|
require HTTP/1.0 for compression |
mod_gzip_handle_methods GET POST |
|
use compression for GET or POST |
mod_gzip_item_include file .html$ |
|
compress HTML files |
mod_gzip_item_include file .cgi$ |
|
compress CGI output |
mod_gzip_item_exclude file nph-.*.cgi$ |
|
don't compress nph CGI output |
mod_gzip_item_exclude file .css$ |
|
don't compress CSS files |
mod_gzip_item_include mime ^text/ |
|
compress any text types |
mod_gzip_item_exclude mime ^image/ |
|
don't compress any image types |
mod_gzip_add_header_count Yes |
|
include header size in statistics |
mod_gzip_dechunk Yes |
|
correctly handle chunked output |
mod_gzip_send_vary Yes |
|
correctly handle incremental output |
</IfModule> |
|
|
If you're using IIS, the way to implement compression is via the Web Service property sheet. Microsoft has a good description of how to do this on their website. They are cautious about recommending page compression for CPU utilization reasons, but in my experience it is always beneficial; most of the time your webserver runs out of bandwidth long before it runs out of CPU cycles. This page also has good information about configuring IIS for compression.
After you get compression configured, you can test it using this site. Very handy.
Serve a Special Home Page to "Robots"
I don't know about you, but I've found that "robots" make up a good deal of the traffic to my site. These robots can be search engine spiders, various indexing tools like technorati, or analysis tools. There are also tons of RSS aggregators out there, and although they load your site's RSS feed first, many of them come back and get page data, too.
So - I have my website setup to look for the HTTP_USER_AGENT, and if the client is a robot I serve a different home page. This serves several purposes:
- Robots are not interested in visual presentation, so you can eliminate images, tables, styles, etc. (And if you're using them, you can eliminate frames, too!) This makes the page smaller and also avoids confusing the robot.
- Robots are interested in your links. My "normal" home page has links as part of the articles posted there, but all the navigation links are on a separate page served as the navigation bar. And this doesn't have all the links, either, because I have an "extended blogroll" of sites I like. So for robots I serve a page which has the home page content, all the navigation bar links, and the extended blogroll. This gives them all the links in one place.
How do you tell if you're dealing with a robot? Well, if the agent string doesn't start with "Mozilla" or "Opera", it's a robot. (For historical reasons all versions of Netscape and Internet Explorer have always used "Mozilla" in their user agent strings.) If it starts with "Mozilla" it might still be a robot pretending to be a browser; I check for two common cases, "Slurp" (Inktomi's spider) and "Teoma" (Ask Jeeves / Teoma's spider). There are others, but this will get you 99% of the robots.
Some handheld browsers report a non-Mozilla user agent, like Handspring's Blazer and AvantGo. This is a good thing; the robot version of the home page is perfect for a handheld (no graphics, straightforward layout, all links present, etc.). For this reason it is better to put links at the bottom than the top; nobody wants to see your blogroll before your content.
It was a little more work, but it's nice to keep the robots happy :)
[Update 1/1/23: no more special page for Robots. This was a lot of work, but ultimately not needed anymore, if it ever was.]
|