Identifying the problem server can be rather challenging. Often the first attempt is to modify your local DNS (in Windows it's the HOSTS file) to point the site URL to a single machine. Depending on how your web farm is set up you may not be able to do this because the individual machines may not be visible to you. Only the farm's pool address is visible. Furthermore, sometimes the problems we encounter do not manifest themselves when running on a single environment (otherwise we'd have caught them in development right??). To complicate the matters moreso, often the only chance you have to identify on which machine the problem occurred is right when it occurred, as in, when you are staring at the application crash page. Simply attempting to replicate the problem after you set up your tracking may not be enough.
A simple solution I have implemented on our staging and production web farms involves nothing more than the built in HTTP headers supplied by IIS. First, just add an HTTP header to each machine in the farm that contains the name of the machine, or any other unique value that you can map to the machine:
Then, when you browse to the site or are looking at an error message, you can open a tool like Fiddler or FireBug to view the page's HTTP header information for the response.
Particularly with a tool such as FireBug or another DOM inspector, you can get immediate information without having to start any kind of tracking tool or needing to relaunch the site.
2 comments:
Intresting and simple trick!
I am curious: do you have an examples from real world when this feature helped?
Great question, I certainly do...
We have an application that uses a 60 gigabyte content repository. Instead of duplicating it on every server on the web farm, all the content is on a fiber connected SAN disk system. One of the servers was experiencing a security problem with the SAN system. The application would intermittently crash or throw "content unavailable" errors when it was being served by the one culprit machine. Being able to identify which machine was having the problem allowed us to isolate it, troubleshoot and fix it.
Post a Comment