Under The Hood

I’ve recently been doing some ‘under the hood’ tweaking to prwdot.org, to make things run smoother, easier, faster, etc. If you are interested in technical stuff, and the mere mention of ‘code samples’ makes your ears perk up, then you might enjoy reading on to see what I’ve done and to offer comments, suggestions, or questions…

FeedBurner
***FeedBurner|http://www.feedburner.com/*** is a service that ‘enhances’ news feeds. You basically plug in the URL of your own feeds, and FeedBurner provides a URL for other people to use to access your feeds. FeedBurner sits in between your readers and your feeds, and records information on who is reading, what they’re reading with, etc. FeedBurner can also ‘beautify’ your feeds so that they are readable by the average joe if they click on them in a browser, and can massage/alter the feed data in numerous other ways. At this time, however, I’m only using the stats tracking feature.

I have plugged all of our site’s feeds into FeedBurner so that I can see who is using which feeds. So far, the most popular are our summary feed (14 regular subscribers), our comments-only feed (7 regular subscribers), and our fulltext+comments feed (5 regular subscribers). If you haven’t updated your feed URL, don’t worry. You’ll automatically be sent through to FeedBurner whether you want to or not. 🙂 More details on that in the next section…

mod_rewrite
***mod_rewrite|http://httpd.apache.org/docs-2.0/mod/mod_rewrite.html*** is an Apache module that provides a powerful way to take incoming requests and intelligently redirect them to other locations. The ***Gallery|http://gallery.menalto.com/*** software has used this for a while now to allow shorter urls. For example, ‘gallery.prwdot.org/events_gatherings’ actually refers to ‘gallery.prwdot.org/view_album.php?set_albumName=events_gatherings’ — but you never see that second URL. Movable Type uses it as well, if you’re using dynamic publishing. ***Will|http://pulchersentio.prwdot.org/***’s blog is a great example of that. If you click through to his individual entry archives, you’re actually being sent to a script that pulls the entries out of the database – even though it appears that you’re requesting an .html file. I’ve used mod_rewrite in a number of other places on our site. Here are some examples:

1) Previously, I had a PHP script called feed.php that all of our feeds were processed through – it would take the name of an xml file as an argument, parse and clean the xml, and log information on the feed request to a database. Eventually, I decided that I just wasn’t getting any useful information, and I didn’t like maintaining the system myself. So I ditched the script. Unfortunately, there are still some people who are accessing the old script. A simple redirect doesn’t work, because the type of feed they are requesting doesn’t come in the URI, it comes in the query string. mod_rewrite to the rescue!

ccc|RewriteCond %{QUERY_STRING} blog\.xml
RewriteRule ^feed.php\
http://feeds.feedburner.com/prwdotsummaries? [L,R=301]|ccc

The first line says ‘look for ‘blog.xml’ in the query string. The second line says ‘if the previous condition was met, then redirect this request to the proper url (using a ‘permanent’ redirect. also, stop matching here.)

2) After ditching the script, I was serving up simple xml files for our feeds. When I discovered FeedBurner, I wanted everyone to switch over to the new feedburner feeds. But I didn’t want to have to announce it to everyone. So I set up mod_rewrite to redirect requests for our xml files to the FeedBurner feeds, which in turn do the actual fetching of the xml files. “But wait!” you may be thinking. “Wouldn’t that cause an infinite loop?” Yes, it might… since after being redirected to FeedBurner, FeedBurner would come back and would itself request the xml file… only to be redirected to… FeedBurner? Yes, this would be a problem. Fortunately, mod_rewrite solves that too. The mod_rewrite rules are set up so that everyone except FeedBurner gets redirected to FeedBurner. The only people who can get at the actual xml files are FeedBurner, and anyone who goes through the trouble to fake a user-agent string from FeedBurner. Frankly, if you’re going through all that trouble, I apologize. I’m not sure the actual xml files are worth your time. 🙂 Anyway, here’s how it looks:

ccc|RewriteCond %{HTTP_USER_AGENT} !feedburner [NC]
RewriteRule ^blog_fulltext_comments.xml$\
http://feeds.feedburner.com/prwdotfulltextcomments? [L,R=301]|ccc

The first line says ‘do not match a request if it comes from the feedburner user agent (case insensitive). The second line says ‘if the first condition is met, then redirect the request to the proper location (with a permanent redirect, and stop matching).

3) Our site is controlled by a PHP wrapper script that takes the content of small html files, wraps them in a navigation structure, and then prints them to the browser. It used to be that all of the links on our site pointed directly to this script, for example, prwdot.org/?p=archives/cat_techie. Recently, I wrote a mod_rewrite rule that allows our links to appear to be regular .html documents, but behind the scenes they are still handled by the PHP wrapper:

ccc|RewriteRule ^(.+)\.html$ /?p=$1|ccc

This rule takes any document that ends in .html, and sends the document name (minus the .html), to the wrapper script. The user only sees the .html link, which is nice, and it also makes each specific page show up in our web log analysis tools. Previously, only the wrapper script would show up, since the CGI arguments don’t show up in the Apache logs by default. This way, only the original request for the .html file is logged – mod_rewrite doesn’t log the actual underlying script that is being requested.

If anyone has suggestions on how to make these rules better, please let me know. I’ve only just scraped the surface of the mod_rewrite manual, so I’m sure there is more out there.

Web Design Mistakes
Thanks to ***Nikki|http://everytomorrow.org/***, I read a great article called ***The Biggest Web Design Mistakes in 2004|http://www.webpagesthatsuck.com/biggest-web-design-mistakes-in-2004.html***. This article covers many misconceptions held by web designers in an irreverant, but hard-hitting manner. Fortunately, I’m not in flagrant violation of many of the article’s points. However, there were a few minor places that I figured I could make some revisions.

qqq|

Remember, nobody gets excited about the tools used to build a house (“Please tell me what brand of hammers you used!”). People get excited about how the house looks and performs.

|qqq

This is from the section “3. Mystical belief in the power of Web Standards, Usability, and tableless CSS.” Now I’m not quite as nutso about all of this stuff as some people are. But it’s true, for a long time, I have made a point to put links on the bottom of every page that link to HTML/CSS/RSS validators, web standards documents, the name of our web hosting and DNS hosting companies, etc. The truth is, nobody cares about this stuff, and nobody clicks on this stuff. Trust me, I had hoped to generate some affiliate revenue from the links to my DNS and web hosting companies, but it just didn’t materialize. There were a few clicks, but no conversions. In any case, I did put that stuff all the way down at the bottom of the page, where most people wouldn’t see it, since I knew that most people didn’t care… but frankly, there’s just not any point to having it there. At all. Even for techies. If you’re a techie, and you want to validate my HTML… you know how to do it. You don’t need a link there encouraging you to do it. If you care what our web hosting company is, you’ll ask. So as of a few days ago, all of those superfluous links are gone. I left the links to our RSS feeds, since people do actually want to see those and they are useful for reading our site, and I also left the link to Movable Type, since I’m obligated to do so because of licensing. But all that other crud (if anyone actually ever noticed it) is gone.

There are a few other minor points where our site goes awry. The photo banner at the top of the page is a link to our homepage, but it doesn’t say so, and there’s no way for the reader to know that, so it technically falls under the category ‘mystery meat navigation’. It’s not the worst violation, because you don’t need to click on it to get to the home page… but it’s there. Also, we have a ‘navigational failure’ due to the fact that there is a link to the home page on the home page… so I’ll try to fix that soon. We could also use a little work on telling our visitors exactly where on our site they currently are (documentation, links, archives, blog, etc…)…

Website Spamming
Website Spamming is the next big thing for spammers. There are numerous ways to spam a website that are barely-legal: sending spam comments to blogging comment scripts, sending spam trackback pings to blog trackback scripts, and, recently, sending spam comments to the ***Gallery|http://gallery.menalto.com/*** photo gallery software. This is a huge problem, and one that has triggered a plethora of solutions. I’ve been through a number of them, but I wanted to share the one that has worked the best for me.

mod_security
***mod_security|http://www.modsecurity.org/*** is an Apache module that provides a layer of security in front of all of your web applications. Before a request gets to Movable Type, or WordPress, or Gallery, or any other page for that matter, it must be handled by mod_security. This module looks at every part of the request, and based on rules that are set up in your Apache configuration, decides what to do with it. Here is what my mod_security configuration looks like:

ccc|

# Only inspect dynamic requests
# (YOU MUST TEST TO MAKE SURE IT WORKS AS EXPECTED)
SecFilterEngine DynamicOnly
SecFilterScanPOST On

# Default action to take when rejecting requests
SecFilterDefaultAction “deny,log,status:403”

# Some sane defaults
SecFilterCheckURLEncoding On
SecFilterCheckCookieFormat Off
SecFilterCheckUnicodeEncoding Off

# Accept almost all byte values
SecFilterForceByteRange 1 255

SecUploadDir /tmp
SecUploadKeepFiles Off

# Only record the interesting stuff
SecAuditEngine RelevantOnly
SecAuditLog logs/audit_log

# You normally won’t need debug logging
SecFilterDebugLevel 0
SecFilterDebugLog logs/modsec_debug_log

# Require Content-Length to be provided with
# every POST request
SecFilterSelective REQUEST_METHOD “^POST$” chain
SecFilterSelective HTTP_Content-Length “^$”

# Don’t accept transfer encodings we know we don’t handle
# (and you don’t need it anyway)
SecFilterSelective HTTP_Transfer-Encoding “!^$”

## COMMENT SPAM PREVENTION ##
# Include blacklist file pulled from Jay Allen’s list
Include conf/blacklist_rules.txt
# Include our own private blacklist file
Include conf/blacklist_private_rules.txt

|ccc

I’m not going to go through every line of this, but I’ll highlight some of the important ones.

‘SecFilterEngine DynamicOnly’ – tells mod_security to scan only dynamic requests. These are the requests that could most likely be exploited by hackers or spammers. Static requests, for things like text files or graphics, aren’t a threat.

‘SecFilterScanPOST On’ – make sure to scan POST requests. This is what happens when a spammer submits a form with bogus information. mod_security will scan every parameter submitted in the form.

‘Include conf/blacklist_rules.txt’

This line tells mod_security to read in an external file containing a list of rules. In my case, this particular file is generated by a script that I wrote which pulls in a master blacklist of sorts (I’ll talk about this next). It could also be a file with your own private list of blacklist keywords. Rules look something like this:

ccc|SecFilterSelective HTTP_Referer|ARGS “01-melodias.com”
SecFilterSelective HTTP_Referer|ARGS “01-ringetone.com”|ccc

The first part of the rule, ‘SecFilterSelective’, is a mod_security directive that looks at a specific part of the request. The second part, ‘HTTP_Referer|ARGS’, tells it to match on content in either the request’s referer, or the request’s arguments. Often, a spammer will set up a bogus referer. Many web sites automatically link to sites that refer to their site, so spammers exploit this to promote their own sites. In addition to the referer, spammers usually include one or more urls in the arguments to their post, so the ARGS variable will catch that. The third part of the rule is the actual string to search for. This can include regular expressions, or simply plain text.

The great thing about mod_security is that it scans requests coming in to all of your scripts, and it doesn’t need to be aware of what specific software you’re running. So whether it’s Movable Type, WordPress, Gallery, Drupal, or whatever – you’ll be protected just the same.

blacklist_to_modsec
I wrote ***blacklist_to_modsec|http://prwdot.org/docs/blacklist_to_modsec.html*** to perform a specific task. It pulls ***Jay Allen|http://www.jayallen.org/***’s ***Master Blacklist|http://www.jayallen.org/comment_spam/blacklist.txt*** file from the web and converts it into rules for mod_security. On our site, I have the script set up to run automatically a few times per day, and pull in updates from Jay’s list. This keeps us fairly well protected from spammers, and anything that does get through can be put in our private blacklist file. Hopefully in the future I’ll have a chance to make it smarter/faster/better… for now, you can visit the ***blacklist_to_modsec|http://prwdot.org/docs/blacklist_to_modsec.html*** docs for more information.

I hope you’ve enjoyed reading this under-the-hood look at our site, and please feel free to ask any questions, offer suggestions, or leave comments.

One thought on “Under The Hood

  1. Pingback: World Wide Wood

Leave a Reply

Your email address will not be published. Required fields are marked *