Description
Discovered this while load testing a service (built on the Play 2 framework) which makes async HTTP calls to a backend database: repeated concurrent requests eventually cause the OS to run out of file handles and fail due to too many open sockets. netstat
revealed several thousand open HTTP connections.
I have traced the issue to a problem in how AsyncHttpClient handles 401 responses.
Reproducing / Test Case
A simple app that demonstrates the bug is available here: https://gist.github.com/chrisrhut/74988852e5313c3d613b (specify the VM argument -Dorg.slf4j.simpleLogger.log.com.ning.http=debug
to enable log output).
In a terminal window, have this running in the background to see the issue at hand: while true; do netstat -an | grep "50.16.189.35"; echo ""; sleep 3; done
IMPORTANT: because of the way it's hosted, the IP address of httpbin.org changes with some frequency, so make sure to set it accordingly.
Example terminal output after 20 loops:
tcp4 0 0 192.168.1.139.51857 50.16.189.35.80 ESTABLISHED
tcp4 0 0 192.168.1.139.51856 50.16.189.35.80 ESTABLISHED
tcp4 0 0 192.168.1.139.51853 50.16.189.35.80 ESTABLISHED
tcp4 0 0 192.168.1.139.51852 50.16.189.35.80 ESTABLISHED
tcp4 0 0 192.168.1.139.51849 50.16.189.35.80 ESTABLISHED
tcp4 0 0 192.168.1.139.51846 50.16.189.35.80 ESTABLISHED
tcp4 0 0 192.168.1.139.51845 50.16.189.35.80 ESTABLISHED
tcp4 0 0 192.168.1.139.51841 50.16.189.35.80 ESTABLISHED
tcp4 0 0 192.168.1.139.51838 50.16.189.35.80 ESTABLISHED
tcp4 0 0 192.168.1.139.51837 50.16.189.35.80 ESTABLISHED
tcp4 0 0 192.168.1.139.51834 50.16.189.35.80 ESTABLISHED
tcp4 0 0 192.168.1.139.51832 50.16.189.35.80 ESTABLISHED
tcp4 0 0 192.168.1.139.51829 50.16.189.35.80 ESTABLISHED
tcp4 0 0 192.168.1.139.51822 50.16.189.35.80 ESTABLISHED
tcp4 0 0 192.168.1.139.51821 50.16.189.35.80 ESTABLISHED
tcp4 0 0 192.168.1.139.51817 50.16.189.35.80 ESTABLISHED
tcp4 0 0 192.168.1.139.51814 50.16.189.35.80 ESTABLISHED
tcp4 0 0 192.168.1.139.51813 50.16.189.35.80 ESTABLISHED
tcp4 0 0 192.168.1.139.51810 50.16.189.35.80 ESTABLISHED
tcp4 0 0 192.168.1.139.51807 50.16.189.35.80 ESTABLISHED
A key observation is that even with all these established connections, the log output for the code displays: [Hashed wheel timer #1] DEBUG com.ning.http.client.providers.netty.channel.pool.DefaultChannelPool - Entry count for : http://httpbin.org:80 : 1
Investigation / Root Cause
Further investigation of the log output yields more clues. Each GET request (after the very first one) displays:
[main] DEBUG com.ning.http.client.providers.netty.request.NettyRequestSender - Using cached Channel [id: 0xcfd11ef5, /192.168.1.139:51930 => httpbin.org/50.16.189.35:80]
for request
DefaultHttpRequest(chunked: false)
GET /basic-auth/user/passwd HTTP/1.1
...
Response DefaultHttpResponse(chunked: false)
HTTP/1.1 401 UNAUTHORIZED
Www-Authenticate: Basic realm="Fake Realm"
So we (correctly) took a connection from the connection pool ("Using cached Channel") to make the initial request, but received a 401 and authentication is required. At this point a single thread should be able to just re-use the existing connection to make the authenticated request, but the log output continues:
[New I/O worker #2] DEBUG com.ning.http.client.providers.netty.handler.HttpProtocol - Sending authentication to http://httpbin.org/basic-auth/user/passwd
[New I/O worker #3] DEBUG com.ning.http.client.providers.netty.request.NettyConnectListener - Request using non cached Channel '[id: 0x872c7990, /192.168.1.139:51932 => httpbin.org/50.16.189.35:80]':
DefaultHttpRequest(chunked: false)
GET /basic-auth/user/passwd HTTP/1.1
...
Response DefaultHttpResponse(chunked: false)
HTTP/1.1 200 OK
...
[New I/O worker #3] DEBUG com.ning.http.client.providers.netty.channel.ChannelManager - Adding key: http://httpbin.org:80 for channel [id: 0x872c7990, /192.168.1.139:51932 => httpbin.org/50.16.189.35:80]
So, the authenticated request uses a "non cached" channel and after the 200 OK this new channel is added to the connection pool.
Tracing into the code I discovered that sure enough, for each run of execute(), ChannelPool.poll() is called twice but ChannelPool.offer() is only called once. So a channel is orphaned, requiring the process to internally spawn a new one for every request.
Possible solution
I believe the issue is rooted in HttpProtocol.exitAfterHandling401()
. The code first checks for NTLM followed by SPEGNO/Kerberos; in those two cases future.setReuseChannel()
is set to true. In the else case (including Basic/Digest) this flag is not set. Futhermore, in the callback:
Callback callback = new Callback(future) {
public void call() throws IOException {
channelManager.drainChannel(channel, future);
requestSender.sendNextRequest(nextRequest, future);
}
};
we don't close the channel in the case that it's not going to be reused. I noticed that exitAfterHandlingRedirect() in Protocol.java
does not exhibit this bug; it's avoided by calling ChannelManager.tryToOfferChannelToPool()
in its callback. Perhaps a similar step should be carried out here.
So I believe the solution is two-fold (I will submit a pull request after filing this issue):
- enable "reuse channel" for Basic/Digest authentication (someone with a more intricate knowledge of the subtleties of HTTP auth, please correct me if I'm wrong)!
- in the case where we will NOT reuse the channel, make sure it's returned to the connection pool. This will avoid the orphaned connection issue if I'm wrong about the first bullet and/or in other cases not specifically covered by NTLM/Kerberos.
Thanks!