A Web of Thoughts: Web 2.0 and Everything Else

An expresson of my interests in Web 2.0 and everything beyond and other than it.

Sunday, April 20, 2008

Web Analytics and Me....

If you own a website - be it an amateur personal endeavour or a professional money-making machine - you must have often wondered who have visited your site, where they came from and how long they have been there? Legitimate questions. I often asked myself the same when I built my first site, with no knowledge whatsoever, of how to look at server logs and decipher some meaning out of them.

Then, of course, realization dawned upon me. I have matured over the years - from being an ignorant fool to an enlightened soul - as the art of web anlytics has evolved.

To many, and website is a subtle marketing tool and to a few, website stats is a mere gizmo which they proudly display at the footer of their personal pages to let the world know how many - and in most cases, how few - people have visited their personal pages on the internet.

To me it is more than that...it is an artform as well as my breadearner. Being associated with the web analytics industry for a while now, everyday I am exposed to fascinating data and fascinating new technology. Here dutifully I wish to share some of that with you - the community, the latest buzzword of the cyberdom!

Web Analytics and Web Metrics...a bewildering maze of connotations

Let's start with the basics: the terms we use and confuse...

  • www.webopedia.com defines Web analytics as "A generic term meaning the study of the impact of a Web site on its users." and searchcrm.techtarget.com defines the same as "...the process of analyzing the behavior of visitors to a Web site". There is an agreement over the fact that the data thus collected is used "...to streamline a site to create a better user experience". Web analytics is the in-depth study of data collected from tracking of website activity data. Tracking itself is a combination of various quantitative and qualitative methods employed using various means, like a 'web-bug' or 'server log files'.

  • Web Metrics on the contrary are standardized ways of measuring something that relates to the Web itself. Web Metrics has a considerably broader scope. "Web metrics focuses on the number and types of people online, the number of broadband versus dial-up connections, advertisers, advertisements (shapes, sizes, level of annoyance) and all things related to the Internet as a whole." (Excerpt from Web metrics versus Web analytics by Jim Sterne)



Whereas concerns itself with the microcosm of the website, concerns itself with the macrocosm of the WWW. Both are too intricately related to instigate mortals like me to indulge in the crime of confusion...but I'm trying to improve :)

Stating the Stateless

webopedia.com refers to HTTP - the stuff that makes the web work - as "a stateless protocol because each command is executed independently, without any knowledge of the commands that came before it". For comprehensive tracking of visitor activity on a given website, webmasters and alike must use resources offered by web scripting tools to keep track of every individual visitor. When you visit a website and look at, say the homepage with 5 different images in it, your web browser makes several separate connection to each resource. Thus, though there may be a single page and therefore a single page-view, altogether there are 6 requests or 6 different requests by the same browser. Most web servers maintain log files to track each request. Along with this they also log referral information and the name of the browser. Very useful...but in the long run, if you want answers to questions like "how long this guy from Spain was on my site?", you will hit a roadblock. Since, for the server every request is a new one, and the "state" was not captured, to the server every single guy from Spain is a new guy. Interpreting the "state" out of the stateless is one of the difficulties of successful web analytics. There are many more challenges actually and for a rather comprehensive documentation of these, How the web works by Stephen Turner is a great reading. By the way, Mr Turner is the guy behind Analog, our saviour, the god who shows us the path when it comes to making sense of our server logs. Go on now. Read it, while I go and make some more cookies ;)

Need some Cookies...?

The challenge is how to record the fleeting dialog between a web server and the web browser. Yes, I like to refer to the transactions between servers and browsers as a dialog. The browser sends a request for a resource to the server and the server responds appropriately with a response. You could discern a rather interesting and lively conversation if you dare to listen!

In 1994, Lou Montulli of Netscape Communications kind of discovered the "cookie" we know of today. Soon there was the initial cookie specifications and, first, Netscape in 1994 ,and then, Internet Explorer in 1995 started cookie support in their browsers. Web developers found many uses for the cookie including tracing a browser session. Since then cookies have become an important analytics tool.

To understand how cookies can help in web analytics its important to understand how cookies themselves work:

Wikipedia.org defines cookie as "a packet of information sent by a server to a World Wide Web browser and then sent back by the browser each time it accesses that server." Cookies are stored in the user's computer, ready for future access. Each cookie has an unique name and the data it stores can be read by the server which sets it. Web analytics systems use this technology rather effectively to track each individual user. The idea that works in this case is, each cookie is unique to each user. The cookie data may store an unique ID assigned to that computer, and whenever there is a request from the browser for a resource on the server which had originally set that cookie, the server can read the cookie data and assess whether its the same machine which had requested another resource only seconds back. This somewhat emulates a situation in which the "state" can be captured. With the help of a cookie this time, in the Spanish guy example from the previous post, the server can distinguish between two Spanish guys a little more effectively.

So you see, how kind and benevolent our servers can be at times...always keen to give away some warm homemade cookies to whoever comes knocking!

Collection Agents

Before any kind of analysis, there is invariably a large pool of data. You need to delve deeeeeep into it, throw away the junk and add more than your brain cells to get the disturbing questions answered and your analytical process to come to a conclusion.

So first thing first...collect the data we need for analysis...

The data required for web analysis can be collected and accumulated in three ways: Webserver log, server/network monitor and page tagging with invisible image.
     
  • Webserver log: Almost all webserver has data logging capabilities based on server administrator's chosen settings. The server logs each request by the browser for a given resource as well as the response sent back by itself. In some sense its like the chat logs which gets written and stored in your machine, when you are using IM clients like Yahoo!Messenger. The information available and collected from a log file can be date, time, number of hits, visitors, visitor duration, visitor origin subdomain, referral link, visitor IP address, browser type and version, platform, cookies.

  • Server/network monitor: Used as a plugin to the webserver, a server monitor records events from the server itself. A network monitor, also known as "packet sniffer", records every packet of data as it is transmitted or received - from browser requests, server responses, form-posted data, cookies and individual files. Thus this tool is able to develop a more comprehensive picture of web traffic.

  • Page tagging: This system needs a web-page to be tagged by a small piece of code and occassionally an invisible image. The image is of a special kind which is able to capture all the data that the previous two methods is able to capture, and using the code present in the page, the captured information is transmitted to a (usually) remotely located application, which in turn collects the data and stores in a data container.


As far as the relative methodology is concerned, it is possible to discern a very important element of distinction: the first two techniques need to be implemented on the same side as the web-server itself. Data collection by these two methods must happen locally. Whereas, in the third method, the logging can happen remotely. Since the code in the web page is able to transmits the raw-data away from the website one intends to track, a certain server in some other location can be configured to receive and record that data. This feature makes this third method a rather independent and reliable alternative to server log analysis.

Wondering already which is the better or preferred method?

Paradigm Shift

There has been some recent developments lately and it seems like some people are on the verge of a breakthrough when it comes to web analytics. It is called the "The Magdalena and Thomas methodology" and Brandt Dainow talks about it in his article entitled "Web Analytics Breakthrough!". An interesting article.

Accidents and crossroads

Literally, Web analytics (using the page tagging system) is a accident-prone business and one has to be very careful. We are in a perpetual threat of accidentally getting our calculations wrong. When it comes to calculate user activity on the web, one has to take into account the concerns raised by human behavior.

It seems easier to ascertain the number of visitors who visited a particular page on a website than calculating the number of those visitors who also deleted their tracking cookies. The latter remains in the realms of probability and estimation. And cookie is what nourishes web analytics. You deny mine and I am bound to behave erratically.

According to a recent report released by JupiterResearch, about 15% of users reject all cookies, while 28% selectively block third-party cookies. Also when even atleast a section of the analysis is done on the basis of tracked IPs, there is a problem. The same user may be assigned diferent IPs at diffrent points of time and may access a site from different locations. A strange case would be when the user accesses the same website from his laptop from two different locations going through two diffrent proxies. Cases like these -hypothetical or real- obstruct accurate calculation and obfuscate the collected data.

There is thus a need to understand and design an intuiative algorithm which has to look through the possiblities of cookie deletion or blocking one one level and the possibilities of the same user going through several IPs over any period of time, in order to talk effectively of unique-ness of visits and visitors.

We thus seem to have come to a crossroads, where new possibilities and needs emerge. May be we need to slightly shift and shuffle and take the right road. Thanks to people like Magdalena Urbańska and Tomasz Fryderyk Urbański, who are thinking of new innovations to perfect the art and science of Web Analytics.