Skip to content

basic/digest authentication causes connection pool leaks #785

Closed
@chrisrhut

Description

@chrisrhut

Discovered this while load testing a service (built on the Play 2 framework) which makes async HTTP calls to a backend database: repeated concurrent requests eventually cause the OS to run out of file handles and fail due to too many open sockets. netstat revealed several thousand open HTTP connections.

I have traced the issue to a problem in how AsyncHttpClient handles 401 responses.

Reproducing / Test Case

A simple app that demonstrates the bug is available here: https://gist.github.com/chrisrhut/74988852e5313c3d613b (specify the VM argument -Dorg.slf4j.simpleLogger.log.com.ning.http=debug to enable log output).

In a terminal window, have this running in the background to see the issue at hand: while true; do netstat -an | grep "50.16.189.35"; echo ""; sleep 3; done IMPORTANT: because of the way it's hosted, the IP address of httpbin.org changes with some frequency, so make sure to set it accordingly.

Example terminal output after 20 loops:

tcp4       0      0  192.168.1.139.51857    50.16.189.35.80        ESTABLISHED
tcp4       0      0  192.168.1.139.51856    50.16.189.35.80        ESTABLISHED
tcp4       0      0  192.168.1.139.51853    50.16.189.35.80        ESTABLISHED
tcp4       0      0  192.168.1.139.51852    50.16.189.35.80        ESTABLISHED
tcp4       0      0  192.168.1.139.51849    50.16.189.35.80        ESTABLISHED
tcp4       0      0  192.168.1.139.51846    50.16.189.35.80        ESTABLISHED
tcp4       0      0  192.168.1.139.51845    50.16.189.35.80        ESTABLISHED
tcp4       0      0  192.168.1.139.51841    50.16.189.35.80        ESTABLISHED
tcp4       0      0  192.168.1.139.51838    50.16.189.35.80        ESTABLISHED
tcp4       0      0  192.168.1.139.51837    50.16.189.35.80        ESTABLISHED
tcp4       0      0  192.168.1.139.51834    50.16.189.35.80        ESTABLISHED
tcp4       0      0  192.168.1.139.51832    50.16.189.35.80        ESTABLISHED
tcp4       0      0  192.168.1.139.51829    50.16.189.35.80        ESTABLISHED
tcp4       0      0  192.168.1.139.51822    50.16.189.35.80        ESTABLISHED
tcp4       0      0  192.168.1.139.51821    50.16.189.35.80        ESTABLISHED
tcp4       0      0  192.168.1.139.51817    50.16.189.35.80        ESTABLISHED
tcp4       0      0  192.168.1.139.51814    50.16.189.35.80        ESTABLISHED
tcp4       0      0  192.168.1.139.51813    50.16.189.35.80        ESTABLISHED
tcp4       0      0  192.168.1.139.51810    50.16.189.35.80        ESTABLISHED
tcp4       0      0  192.168.1.139.51807    50.16.189.35.80        ESTABLISHED

A key observation is that even with all these established connections, the log output for the code displays: [Hashed wheel timer #1] DEBUG com.ning.http.client.providers.netty.channel.pool.DefaultChannelPool - Entry count for : http://httpbin.org:80 : 1

Investigation / Root Cause

Further investigation of the log output yields more clues. Each GET request (after the very first one) displays:

[main] DEBUG com.ning.http.client.providers.netty.request.NettyRequestSender - Using cached Channel [id: 0xcfd11ef5, /192.168.1.139:51930 => httpbin.org/50.16.189.35:80]
 for request 
DefaultHttpRequest(chunked: false)
GET /basic-auth/user/passwd HTTP/1.1
...
Response DefaultHttpResponse(chunked: false)
HTTP/1.1 401 UNAUTHORIZED
Www-Authenticate: Basic realm="Fake Realm"

So we (correctly) took a connection from the connection pool ("Using cached Channel") to make the initial request, but received a 401 and authentication is required. At this point a single thread should be able to just re-use the existing connection to make the authenticated request, but the log output continues:

[New I/O worker #2] DEBUG com.ning.http.client.providers.netty.handler.HttpProtocol - Sending authentication to http://httpbin.org/basic-auth/user/passwd
[New I/O worker #3] DEBUG com.ning.http.client.providers.netty.request.NettyConnectListener - Request using non cached Channel '[id: 0x872c7990, /192.168.1.139:51932 => httpbin.org/50.16.189.35:80]':
DefaultHttpRequest(chunked: false)
GET /basic-auth/user/passwd HTTP/1.1
...
Response DefaultHttpResponse(chunked: false)
HTTP/1.1 200 OK
...
[New I/O worker #3] DEBUG com.ning.http.client.providers.netty.channel.ChannelManager - Adding key: http://httpbin.org:80 for channel [id: 0x872c7990, /192.168.1.139:51932 => httpbin.org/50.16.189.35:80]

So, the authenticated request uses a "non cached" channel and after the 200 OK this new channel is added to the connection pool.

Tracing into the code I discovered that sure enough, for each run of execute(), ChannelPool.poll() is called twice but ChannelPool.offer() is only called once. So a channel is orphaned, requiring the process to internally spawn a new one for every request.

Possible solution

I believe the issue is rooted in HttpProtocol.exitAfterHandling401(). The code first checks for NTLM followed by SPEGNO/Kerberos; in those two cases future.setReuseChannel() is set to true. In the else case (including Basic/Digest) this flag is not set. Futhermore, in the callback:

Callback callback = new Callback(future) {
    public void call() throws IOException {
        channelManager.drainChannel(channel, future);
        requestSender.sendNextRequest(nextRequest, future);
    }
};

we don't close the channel in the case that it's not going to be reused. I noticed that exitAfterHandlingRedirect() in Protocol.java does not exhibit this bug; it's avoided by calling ChannelManager.tryToOfferChannelToPool() in its callback. Perhaps a similar step should be carried out here.

So I believe the solution is two-fold (I will submit a pull request after filing this issue):

  • enable "reuse channel" for Basic/Digest authentication (someone with a more intricate knowledge of the subtleties of HTTP auth, please correct me if I'm wrong)!
  • in the case where we will NOT reuse the channel, make sure it's returned to the connection pool. This will avoid the orphaned connection issue if I'm wrong about the first bullet and/or in other cases not specifically covered by NTLM/Kerberos.

Thanks!

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions