Chopping logs

PC Pro logo Posted: 4th April 2000 | Filed under: Press Articles, Technical
Author: Paul Ockenden
First Appeared in PC Pro 2000

We received an interesting email from Arturo of Bristol, asking if we could help explain the data in his NT4 IIS server log files. Arturo explained that he currently uses proprietary log file analysis software, but sometimes he doesn't believe the results it gives, and so would like to dive into the log files to do some manual checking. The only problem is that the log files appear to be composed entirely of gobbledygook!

Well Arturo, the first thing to note is that IIS can produce log files in various different standards. We suggest that you configure your server to create log files in W3C Extended Log File Format. This is a standard which is supposed to harmonise the various Web servers' log files, and allow easier analysis, but do bear in mind that this is a typical Microsoft take on a standard!

It is probably easiest to try and explain the various log file fields by quoting a real example. Here is a short extract from one of the log files from Paul's servers, Because of the limited column width we have here, and thus the obvious line wrapping, we've started each new line with a >>>

>>> #Software: Microsoft Internet Information Server 4.0
>>> #Version: 1.0
>>> #Date: 2002-02-12 02:50:15
>>> #Fields: date time c-ip cs-username s-sitename s-computername s-ip
cs-method cs-uri-stem cs-uri-query sc-status sc-win32-status sc-bytes
cs-bytes time-taken s-port cs-version cs(User-Agent) cs(Cookie) cs(Referer)
>>> 2002-02-12 02:50:40 - W3SVC33 DELLBOY
GET /index.asp - 200 0 7486 181 0 80 HTTP/1.0
 - -
>>> 2002-02-12 02:57:15 - W3SVC33 DELLBOY
GET /index.asp - 200 0 7527 427 31 80 HTTP/1.1
>>> 2002-02-12 02:57:16 - W3SVC33 DELLBOY
GET /main2.css - 200 0 1564 340 438 80 HTTP/1.1
>>> 2002-02-12 02:57:17 - W3SVC33 DELLBOY
GET /images/ballantines_com_sm.gif - 200 64 0 360 1047 80 HTTP/1.1

The first four lines just describe the log file. What version Web server we're running, what version of the W3C Extended Log File Format we're using, the date and time that the log file was created, and a list of the various fields. This final line is the really useful one. It gives the name of all of the data fields, and shows the order in which they will appear. The following four lines contain real 'hit' data, the first from a search engine spider, and the following three from a real live user. It's important to note that the delimiter between the fields is the space character. Any spaces in any of the data will be replaced with a plus sign.

Ploughing the fields

Let's take a look at what the various fields mean:

date: Obvious!
time: Obvious. Well obvious-ish. Just bear in mind that this is the local time of the server, rather than the time the user thought it was. If a bloke called Bill in Seattle complains that your site wasn't working at 4:30am, you'll need to do some timezone maths in order to locate the correct log file entries.
c-ip: Client IP address. Once upon a time this was the IP address of the end user's computer, but these days it is more likely to be a proxy server shared by many hundreds or thousands of your visitors (for example, all AOL users will probably show up from maybe one or two c-ip addresses, and to make matters worse, some ISPs have two or more different proxies, and a given user may flip between then during a session!) This can make tracking a user through the logs really difficult, and might explain some of the reasons why your stats analysis software is giving strange results (three thousand visitors might be seen as one, or maybe one visitor might be seen as three!)

cs-username: The username used to log in to the site. For most public Web sites there won't be any login, so it'll simply show a - character. Note that on an NT or Windows 2000 server this will only record 'proper' logins (using the browser's pop-up username password box) which tie into NT security. They wouldn't show the users on sites where the developers have written their own forms based security and login systems.

s-sitename: An internal identifier used by the Web server. This will be the same for all of the records from a given site. For IIS sites the format is W3SVCx where x is the number of the Web site on the server, based on the order in which they were set up. So the first ever (default) site will be W3SVC1, and in our case we've set up 32 other sites including this one. s-computername: This is the name of the server. You'll spot that in our example the server is named Dellboy. At the time of our first ever server purchase from Dell it seemed quite apt. However, now that it sits there humming away with a load of siblings the name is looking a little silly. Luckily we resisted the urge to call our next server Rodney.
s-ip: The Site's IP address. In our case Note that with HTTP 1.1 it is possible (and even encouraged) to have several Web sites sharing the same IP address.
cs-method: The 'HTTP Method'. Normally, for general Web browsing this will be GET. It can be either GET or POST if the user has submitted a form, and you might spot additional methods such as HEAD, PUT, OPTIONS, TRACE or DELETE. Hopefully you won't find too many of the latter!
cs-uri-stem: The page that the user has requested - just the page, not the domain name. In other words, just the bit at the end of the URL. If you look at the example data above, the first two requests are for /index.asp, the third is for a style sheet and the last one is for a GIF image file.
cs-uri-query: A Querystring. In a URL such as the cs-uri-query will be the 'fred=1&mary=2' part. Doing analysis on this data can be useful when studying the log files of dynamically generated sites.
sc-status: Every page request from a Web server results in a status return code being generated. You'll no doubt be familiar with code 404 for Page Not Found. Code 200, which you'll see in our example above, means 'success'. You'll find a comprehensive list of codes here:
sc-win32-status: (Huh? Win32? Remember, we said that this was Microsoft's take on a standard..) As well as the Web server status, the operating system might return a code for file protection violation, etc. In most cases this should be zero.
sc-bytes: The number of bytes sent from the server to the browser. You'll have probably noticed by now that the server to client data items are all sc-something and client to server are cs-style.
cs-bytes: The number of bytes received by the server FROM the browser. This will generally be quite small unless the user is submitting data via a form, or uploading a file. time-taken: How long it took to transfer the request to the user, in milliseconds. This can be slow for a complex scripted page or if the user has a slow internet connection.
s-port: The TCP port that the Web service is running on. HTTP web servers usually run on port 80, but occasionally you might find them on port 81, or 8080. cs-version: The HTTP version supported by the client. Usually 1.1, but older browsers may only support 1.0. The main use of HTTP1.1 is sharing several sites on a single IP address.
cs(User-Agent): A string which usually enables us to determine the operating system, browser and browser version of the user. In the examples above, the first hit is from "Mozilla/3.0+(Slurp-TOCC/si;;+". The Mozilla/3.0 at the front means that this user is claiming to be using Netscape 3, but the rest of the string reveals that the the 'user' is actually a Web robot called 'Slurp'. Checking on the c-ip field with the 'whois' server at shows that this is in fact a Japanese robot, but aren't they all... The second, third and last hits show "Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+MSN+6.1;+MSNbMSFT;+MSNmes-es;+MSNcOTH)". This one is a bit more obvious. Firstly it is claiming to be Netscape 4 compatible. Then it reveals itself as being Internet Explorer 6.0. The platform is shown as Windows NT 5.1 - of course there is no such thing - NT 5.1 is the internal identifier for Windows XP. The final bits all show that the user also has the MSN client software installed on their machine. We said that the User-Agenct 'usually' gives the operating system and browser, but not always - We've seen from our first hit that Web robots can often lie about what or who they are, but that's just the start of our problems. Proxy servers can get in the way, hiding the users from the server, plus the users themselves can cause problems. For example, people using browsers like 'Opera' often set them up to report being Internet Explorer 5, the browser they probably despise! This is yet another reason why your stats analysis software might be giving strange results.
cs(Cookie): This shows any cookies we are using on this site - On the server in question we're using non-persistent cookies to try to maintain session state. You can see this really neatly in our small example. It should be obvious by now that the second, third and last lines are from the same user: The Client IP address is the same, the times are very close, and the User Agent is identical. You'll see that on the first access (index.asp) there is no cookie shown. That's because we are using non-persistrent session cookies. They only last for a single browse session of the site - no data will be retained on the user's computer. So in the first request there is no cookie data - the user is just asking for a file. Our server will send that file, along with a session cookie. This cookie is then reported in all subsequent requests, enabling us to track the user. If, however, the user closes their browser and then comes back again, we'll start the whole thing over again with a different session cookie.
cs(Referer): Some browsers, in fact these days it's most browsers, will report the page which sent us to the current one. You can see that our non-robot visitor was went to the site by the google search engine, and that the user had searched for the word 'ballantines'.

Hopefully, with this insight into how the log file gobbledygook is constructed, Arturo should be able to work out whether his analysis software really is giving him sensible answers.