Tuesday, November 13, 2007

SOS "Son of Strike" by Mark Smith

I'm a casual user of Windbg, and usually appreciate the challenge of using it to diagnose issues.  Crazy, I know.  There are some really great resources out there on how to get started with it, but I had a pleasant surprise in this month's (December 2007) MSDN Magazine.  Check out the developmentor Winter 2008 Course Schedule, there you will find this article by Mark Smith.  Its worth a read.

HTTPERR Connections_Refused

Symptom

IIS 6.0 website stops responding to requests.  Using telnet to connect to the website port can't make a connection.

Regardless on whether you start at the "top", and worked your way down the application stack, or started at the "bottom" and worked your way up, you would eventually find the HTTPERR log files for http.sys and they would lead you to the answer: Google for "Connections_Refused" and you would have hit the number one ranked response by David Wang - HOWTO: Diagnose IIS6 Failing to Accept Connections Due to CONNECTIONS_REFUSED.   The hint here to start at the "bottom" vs. the "top" is that there is no application level response. e.g. HTTP 500, or anything else for that matter.

If this is the problem your experiencing, stop now, read the article and see if that's the end of your journey.  What follows is just my account of backing up the information found in the article.

As I never seemed to make a connection to the web service, I looked in the HTTPERR logs.  Sure enough, there was the tattletale entries of "Connections_Refused" which led me directly to David's article. 

Sample HTTPERR logs

[..snip..]

2007-11-13 20:01:44 - - - - - - - - - 3_Connections_Refused -
2007-11-13 20:01:49 - - - - - - - - - 4_Connections_Refused -
2007-11-13 20:01:54 - - - - - - - - - 3_Connections_Refused -
2007-11-13 20:01:59 - - - - - - - - - 4_Connections_Refused -
2007-11-13 20:02:04 - - - - - - - - - 3_Connections_Refused -
2007-11-13 20:02:09 - - - - - - - - - 4_Connections_Refused -

[..snip..]

 

Since I've never actually experienced a condition where I've not had non-paged pool memory available, I continued to follow the steps in the article to validate (I'm curious that way).  First thing I noticed was that the server's non-paged pool memory was a little more than 109MB, as seen below. 

 

 

Normally you have 256MB of non-paged pool memory available to an x86 Windows 2003 Server, however, we were running with the /3GB switch enabled in the boot.ini, which further reduced that non-paged pool memory half to 128MB.  There was no requirement for the server to run with the /3GB switch, so we'll remove that at the earliest opportunity, but why the high memory utilization?

Running poolmon -b showed the kernel memory allocations (paged and non-paged pool) sorted with the highest allocations.  Turns out, there was a driver that had allocated over 69MB of non-paged pool memory all to itself, as shown below. 

 

Http.sys behavior is to stop accepting new connections when available non-paged pool memory falls below 20MB.  128MB - 109MB = 19MB available non-paged pool memory, if I read that correctly, which translates to "Connections_Refused". 

This driver belonged to our virus scanning software, so I punted to our infrastructure group with a "Whats up with this?" email, basically the contents of this post.  They flipped it to the vendor, who identified that yes it is a problem, and yes its already been fixed.  Matter of fact, it had been delivered to us the day before, we were just waiting for a regularly scheduled change window to apply the fix. 

References