0% found this document useful (0 votes)
5K views

Apache Traffic Server - HTTP Proxy Server On The Edge Presentation

Apache Traffic Server is an open-source HTTP proxy and caching server originally developed by Inktomi and donated to the Apache Software Foundation by Yahoo. It uses an asynchronous event-driven model with multi-threading to provide high performance and scalability. As a critical component inside Yahoo for many years, Traffic Server benefits from being part of the well-established Apache community with new contributors and existing resources to tap into.

Uploaded by

chn5800in
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5K views

Apache Traffic Server - HTTP Proxy Server On The Edge Presentation

Apache Traffic Server is an open-source HTTP proxy and caching server originally developed by Inktomi and donated to the Apache Software Foundation by Yahoo. It uses an asynchronous event-driven model with multi-threading to provide high performance and scalability. As a critical component inside Yahoo for many years, Traffic Server benefits from being part of the well-established Apache community with new contributors and existing resources to tap into.

Uploaded by

chn5800in
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 7

Apache Traffic Server

HTTP Proxy Server on the Edge

Leif Hedstrom
Apache Traffic Server Development Team
Yahoo! Cloud Computing

[email protected]
[email protected]

Abstract — Apache Traffic Server[1] is a fast, scalable As such, TS already is (and has been for several years)
and feature-rich HTTP proxy and caching server. a critical component of Yahoos! network. By releasing
Traffic Server was originally a commercial product Traffic Server to the Open Source Community, a new
from Inktomi Corporation, and has been actively used tool is now readily available for anyone to use.
inside Yahoo! for many years, as well as by many other 1.1. Why Apache Software Foundation
large web sites. As of 2009, Traffic Server is an Open
This presentation does not focus on Yahoo!’s decision
Source project under the Apache umbrella, and is to open-source Traffic Server, and the choices that were
rapidly being developed and improved upon by an made during the process. However, it’s useful to
active community. understand why Yahoo! chose ASF, and what benefits
we derive from being an ASF Top-Level Project.
This talk will explain the details behind the Traffic
Server technology - What is it? What makes it fast? Being part of an already established and well-
Why is it scalable? And how is it different compared to functioning Open Source community brings immediate
benefits to the project:
other HTTP proxy servers? We will also delve into
details about how a large web site can utilize this power  We benefit from the many years of experience of
to create services with exceptional end-user experience. ASF leadership in Open Source technology.
 We immediately gained new contributors to the
INTRODUCTION project.
Apache Traffic Server is an Open Source project,  There is plenty of existing source code, skills and
originally developed as a commercial product by experiences in the ASF community, into which
Inktomi, and later donated to the Apache Software we can tap.
Foundation (ASF) by Yahoo! Inc. Apache Traffic Server
was accepted as a Top-Level Project in April of 2010,  We are part of a reputable and well-maintained
after 6 months of incubation. Graduating as a TLP is a Open Source community.
milestone not only for the community, but also shows
ASF’s commitment to Traffic Server, as well as by all the HTTP PROXY AND CACHING
contributors.
HTTP proxy servers, with or without caching, are
Yahoo! has actively used the original Traffic Server implementations of an HTTP server with support to act
software for many years, serving HTTP requests for as an intermediary between a client (User-Agent), and
many types of applications: another HTTP server (typically referred to as an Origin
Server). It’s quite possible, and in many cases desirable,
 As a Content Delivery Network (CDN), serving to have multiple intermediaries in a hierarchy, and many
static content for all of Yahoo’s web sites ISPs will proxy all HTTP requests through a mandatory
 For connection management across long intermediary.
distances, and providing low-latency connectivity There are three primary configurations for a proxy
to the users server:
 As an alternative to Hardware Server Load
Balancers (SLBs)

Apache Traffic Server – HTTP Proxy Server On The Edge 1


 Forward Proxy – This is the traditional proxy commonly used for writing applications that deal with
setup, typically used in corporate firewalls or by high concurrency:
ISPs. It requires the User-Agents (e.g. browsers)
1) Asynchronous event processing
to be configured and aware of the proxy server.
2) Multi-threading
 Reverse Proxy – In a reverse proxy setup, the
intermediary acts as any normal HTTP server By combining these two technologies, TS can draw the
would, but will proxy requests based on benefits from each. However, it also makes the
(typically) a specific mapping rule. technology and code complex, and sometimes difficult to
understand. This is a serious drawback, but we feel the
 Intercepting Proxy – This is similar to Forward positives outweigh the negatives. Before we discuss the
Proxy, except the intermediary intercepts the pros and the cons of this decision, we’ll give a brief
HTTP requests from the User-Agent. This is also introduction to these two concepts.
typically done by ISPs or corporate firewalls, but
has the advantage that it is transparent to the user. 1.2. Asynchronous Event Processing
This usually is also referred to as Transparent This is actually a combination of two concepts:
Proxy.
1) An event loop
Any HTTP intermediary must of course function as a
basic HTTP web server. There is definite overlap in 2) Asynchronous I/O
functionality between a proxy server and a regular HTTP
Together, this gives us what we call Asynchronous
server. Both typically provide support for access control
Event processing. The event loop will schedule event
(ACLs), SSL termination and IPv6. In addition, many
handlers to be executed as the events trigger. The
HTTP intermediaries also provides features such as:
asynchronous requirement means that such handlers are
 Based on the incoming request, finding the most not allowed to block execution waiting for I/O (or block
appropriate Origin Server (or another for any other reason). Instead of blocking, the event
intermediary) from which to fetch the document; handler must yield execution, and inform the event loop
that it should continue execution when the task would not
 Providing infrastructure to build redundant and block. Events are also automatically generated, and
resilient HTTP services; dispatched appropriately, as sockets and other file
descriptors change state and become ready for reading or
 Cache documents locally, for faster access and
writing (or possibly both).
less load on Origin Servers;
It is important to understand that an event loop model
 Server Load Balancing (SLB), by providing does not necessarily require all I/O to be asynchronous.
features such as sticky sessions, URL-based However, in the Traffic Server case, this is a fundamental
routing, etc. design requirement, and it impacts not only how the core
 Implementing various Edge services, such as code is written, but also how you implement plugins. A
Edge Side Includes (ESI); plugin cannot block on any I/O calls, as doing so would
prevent the asynchronous event processor (scheduler)
 Acting as a firewall for access to HTTP content: from functioning properly.
providing content filtering, anti-spam filtering,
audit logs, etc. 1.3. Multi-Threading
Traffic Server can perform many of these tasks, but Different Operating Systems implement multi-
obviously not all of them. Some tasks would require threading in different ways, but they are generally a
changes to the internals of the code; and some would mechanism to allow a process to split itself into two or
require development of plugins. Fortunately, Traffic more concurrently running tasks. These tasks (threads) all
Server, similar to Apache HTTPD, has a feature-rich exist within the same context of a single process. A
plugin API to develop extensions. Efforts are being made fundamental difference between creating a thread and
to not only release a number of useful plugins to the creating a new process is that threads are allowed to share
Open Source community, but we also aim to improve and resources not (commonly) shared between separate
extend the plugin APIs to allow for even more complex processes. As a side note, it is typically much less
development. We are also starting to see the community expensive for an OS to switch execution between threads
contribute new Traffic Server plugins. than between processes.
Threading is a simpler abstraction of concurrency than
TRAFFIC SERVER UNDER THE HOOD the asynchronous event processing, but every OS has
Apache Traffic Server differs from most existing Open limitations on how many threads it can handle. Even
Source proxy servers. It combines two technologies though switching threads is lightweight, it still has

2 Apache Traffic Server – HTTP Proxy Server On The Edge


overhead and consumes CPU. Threads also consume Server will run with around 20-40 threads only. This is
some additional memory, of course, although typically configurable, but increasing the number of threads above
not as much as individual processes will. the default (which is 3 threads per CPU core) will yield
worse performance due to the overhead caused by more
1.4. Why make it twice as complicated? threads.
Now that we have a basic understanding of what these
concurrency mechanisms provide, let’s discuss why
Traffic Server decided to use both. This is an important
discussion because it will help you decide which HTTP
intermediary solutions you should choose.
Multi-threading is a popular paradigm for solving
concurrency issues because it is a well-understood and
proven technology. It is also well-supported on most
modern Operating Systems. It solves the concurrency
problem well, but it does have a few problems and
concerns, such as:
 Writing multi-threaded applications is difficult,
particularly if the application is to take advantage
of shared memory. Lock contention, deadlocks,
priority inversion and race conditions are some of
the difficulties with which developers will need to Figure 1. Traffic Server Thread Model
confront.
Our solution does not solve all the problems related to
 Even though threads are lightweight, they still concurrent processing, but it makes it a lot better, and
incur context switches in the Operating System. certainly very scalable. Care has been taken to provide
Each thread also requires its own “private” data, flexible APIs so that plugin developers can write thread-
particularly on the stack. As such, the more safe and non-blocking code.
threads you have, the more context switches you
will see, and memory consumption will increase OTHER HTTP PROXY SOLUTIONS
linearly as the number of threads increases.
Traffic Server is obviously not a new invention of any
It generally is easier to program for asynchronous kind, as there are plenty of similar solutions both in the
event loops, and there are many abstractions and libraries Open Source community, and as commercial products.
available that provide good APIs. Some examples include This paper will not detail all of the available solutions.
libevent[2] and libev[3] for C and C++ developers. However, we will focus on Free and Open Source
(There are also bindings for many higher-level languages solutions.
for both these libraries, and others.) Of course, there are a
few limitations with event loops: All of these existing intermediaries provide the basic
features necessary for proxying HTTP requests. Each
 The event loop (and handlers) typically only piece of software has its own pros and cons - some are
supports running on a single CPU. optimized for a smaller set of applications, while others
are more generic. Performance differs wildly between the
 If the event loop needs to deal with a large different implementations, but in all honesty,
number of events, increased latency can occur performance is usually the least important piece in the
before an event is processed (by the nature of the decision-making process. The next sections will discuss
events being queued). several of the more common intermediary solutions that
 To avoid blocking the event loop, all I/O needs to are currently available.
be asynchronous. This makes it slightly more 1.5. Squid
difficult for programmers, particularly when
integrating existing libraries (which may be Squid[4] is probably the most well known, and oldest,
synchronous by nature). of popular HTTP proxy servers that are currently in use.
It originated from the Harvest project, and has since gone
Traffic Server decided to combine both of these through many updates and even large rewrites. The code
techniques, thus eliminating many of the issues and base is very mature, and it is used in a large number of
limitations associated with each of them. In Traffic mission-critical applications.
Server, there are a small number of “worker threads”;
each such worker thread is running its own asynchronous Squid typically runs as a single-process, single-
event processor. In a typical setup, this means Traffic threaded, asynchronous event processor. This means that

Apache Traffic Server – HTTP Proxy Server On The Edge 3


it is somewhat limited in scalability on modern multi- environment, which is quite a feat if true.
core systems; however, there is work being done to try to
alleviate this problem. When it comes to features and 1.9. Feature Comparison
support for all the various extensions to HTTP and HTTP The following table summarizes and compares
intermediaries, Squid shines. There is really no other common features implemented by a few popular HTTP
Open Source server that is as feature rich as Squid right intermediaries:
now, and it should definitely be considered when
evaluating servers. AT HAProx ngin Squ Varni
S y x id sh
1.6. Varnish Work  X X X 
Varnish[5] is an HTTP intermediary, which takes Threads
advantage of modern kernel features in Linux, FreeBSD Multi- X 1  2 
and Solaris in order to simplify the code, while at the process
same time achieving very high performance. A Event-     X
fundamental design decision is that all caching is done driven
using the virtual memory provided by the Operating Plugin APIs  X 3 4 5
System, and each active connection uses up a thread. The Forward  X X  X
latter means that Varnish can (and probably will) run proxy
with a large number of threads.
Reverse     
The core code in Varnish is fairly small. Instead, the proxy
system comes with its own configuration language, VCL, Transp. X6    X
which is very flexible. The downside is that almost any proxy
configuration or setup with Varnish will require some Load 7    8
VCL coding or tweaking. There are a large number of Balancer
contributed VCL scripts, which solve many common Cache X
problems and configuration requirements.    
ESI  X X  
But Varnish wasn’t built to be a general-purpose
ICP 9 X X  X
intermediary. As an example, Varnish will buffer the
entire response before sending it to the client, which Keep-Alive  X   
might not work for all types of HTTP services. SSL  X   X
1.7. nginx Pipeline X X
 10  
nginx[6] is an HTTP web server that also can function
as a proxy and cache, which puts it in the same category TABLE I. COMPARING HTTP INTERMEDIARIES
as Apache HTTPD. In fact, nginx is quickly becoming a
contender in the HTTP arena, already having grabbed a CONTENT DELIVERY NETWORKS
significant portion of the market share. This jack-of-all-
trades design also means that nginx is not a general- A Content Delivery Network, or CDN, is a service or
purpose intermediary either. infrastructure used to deliver certain types of HTTP
content. This content is usually static by nature, where
nginx uses a concurrency model similar to Apache Edge caches can effectively store the objects locally for
Traffic Server, except that it uses multiple processes some time. Examples of CDN-type content are
instead of threads. In addition to HTTP, it can proxy JavaScript, CSS, and all types of images and other static
several other TCP protocols, and also have a flexible media content. Serving such content out of a caching
plugin interface for extensions and additions. HTTP intermediary makes deployment and management
significantly easier, since the content distribution is
1.8. HAProxy
HAProxy[7] implements a proxy server that primarily 1
Not recommended by the maintainer
is tailored for HTTP (and possibly other TCP protocols) 2
Only with completely separate process instances
load-balancing and request routing. It is an event-driven, 3
Requires a recompile of the entire application
single-process application, with a reach feature set for 4
Squid v3 has plugin capabilities via eCAP
doing interesting Layer 7 routing decisions. With only a 5
Using VCL scripting language and compiler
single process, it does not scale particularly well on 6
This is actively worked on
modern multi-core CPUs. It has a limited feature set as a 7
Round-Robin only at this point
generic HTTP intermediary, but is very robust and 8
Round-Robin and a random director
reliable as a proxy. The HAProxy official website points 9
Partially broken at the time of writing this paper
out that the server has never crashed in a production 10
Only between TS and client
4 Apache Traffic Server – HTTP Proxy Server On The Edge
automatic. be much smaller.
A CDN automates content distribution to many Of course, many of the other existing HTTP caches
collocations, simplifying the operational tasks and costs. can be used to build a CDN. We believe Traffic Server is
To improve end-user experience, a CDN is commonly a serious contender in this area, but there is healthy
deployed on the Edge networks, assuring that the content competition.
is as close as possible to the users.
1.11. Configuration
There are several reasons this is beneficial: We are not going to go into great details about how to
 Cost reductions, and more effective utilization of configure Apache Traffic Server for building your CDN.
resources There are primarily two configuration files relevant for
setting up Traffic Server as a caching intermediary:
 Faster page load times
 records.config – This file holds a number of key-
 Redundancy and resilience to network outages value pair, and in most situations the defaults are
The biggest question you face when deciding on a good enough (but we will tweak this for a CDN).
CDN is whether to build it yourself or to buy it as a  remap.config – This configuration file, which is
service from one of the many commercial CDN vendors. empty by default, holds the mapping rules so that
In most cases, you are probably better off buying CDN TS can function as a reverse proxy.
services initially. There are initial costs associated with
setting up your own private CDN on the Edge, and this Out of the box, Traffic Server configuration is very
should be considered when doing these evaluations. restricted; in order to build a basic CDN server we will
need to modify both of these files. Let’s start with
Notwithstanding the above limitations, I am a strong records.config:
proponent of building your own CDN, particularly if
your traffic is large enough that the costs of buying the CONFIG proxy.config.http.server_port INT 80
services from a CDN vendor are considerable. Further, to CONFIG proxy.config.cache.ram_cache.size LLONG 512MB
be blunt, building a CDN is not rocket science. Any
organization with a good infrastructure and operations
team can easily do it. All you need is to configure and And then remap.config (these are just examples of a
deploy a (small) number of servers running as reverse “dummy” CDN):
proxy servers for HTTP (and sometimes HTTPS). map http://cdn.example.com/js http://js.example.com
map http://cdn.example.com/css http://css.example.com
1.10. Building a CDN with Apache TS
map http://cdn.example.com/img http://img.example.com
Apache Traffic Server is an excellent choice for
building your own CDN. Why? First of all, it scales
incredibly well on a large number of CPUs, and well Some example URLs that will work with the above
beyond Gigabit network cards. Additionally, the configurations:
technology behind Traffic Server is well-geared toward a
CDN: http://cdn.example.com/js/cool-stuff.js
http://cdn.example.com/img/thumbnail/ogre.png
 The Traffic Server cache is fast and scales very
well. It is also very resilient to corruptions and
crashes. In over 4 years of use of the Yahoo! Of course, there can be much more complex
CDN, there has not been a single (known) data configurations, particularly in the remap configuration,
corruption in the cache. but the examples demonstrate how little configuration
would be required to get a functional CDN with almost
 The server is easy to deploy and manage as a zero configuration using Apache Traffic Server.
reverse proxy server. The most common
configuration tasks and changes can be done on CONNECTION MANAGEMENT WITH ATS
live systems, and never require server restarts.
Connection management is very similar to a CDN; in
 It scales well for a large number of concurrent fact, many CDN vendors also provide such services as
connections, and supports all necessary HTTP/1.1 well. The purpose of such a service is primarily to reduce
protocol features (such as SSL and Keep-Alive). latency for the end-user. Living on the Edge, the
As a proven technology, Traffic Server delivers over connection management service can effectively fight two
350,000 requests/second, and over 30Gbps in the Yahoo! enemies of web performance:
CDN alone. This is an unusually large private CDN, with
over 100 servers deployed worldwide. Most setups will

Apache Traffic Server – HTTP Proxy Server On The Edge 5


 TCP 3-way handshake. Being on the Edge, the perhaps HTTPS. With an HTTP Server Load Balancer,
latency introduced by the handshake is reduced. you can:
Allowing for long-lived Keep-Alive connections
can eliminate such latency entirely.  Assure a particular user always hits the same
backend (real-) server
 TCP congestion control (e.g. “Slow Start”). The
farther away a user is from the server, the more  Assure a particular URL is served by the same
visible the congestion control mechanisms backend (real-) server
become. Being on the Edge, users will always  Assure there is always at least one real-server
connect to an HTTP server (an Origin Server or available to serve any type of request
another intermediary) that is close.
Getting users, or requests, associated with a smaller
The following picture shows how users in various number of servers can significantly improve the
areas of the world connect to different servers. Some performance of your applications. You can see better
users might connect directly to the HTTP web server (the cache affinity, smaller active data sets, and easier (and
“service”), while others might connect to an intermediary faster) code to evaluate.
server that is close to the user.
The follow picture depicts a typical HTTP Server Load
Balancer setup:

Figure 3. Server Load Balancer

Figure 2. Connection management 1.12. Using Apache TS As SLB


Connections between the intermediaries (the Unfortunately, this is an area where Traffic Server is
connection managers) and Origin Servers (“web site”) are currently behind the curve, and we openly admit it. Basic
long-lived, thanks to HTTP Keep-Alive. Reducing the Load Balancing can be done with the configurations
distance between a user and the server, as well as available, but anything advanced will require coding for a
eliminating many new TCP connections, will reduce custom plugin. There is some hope that an additional
page-load times significantly. In some cases, we’ve piece of technology can be open-sourced, but there is no
measured up to 1 second or more reduction in first page- ETA at this time.
load time, only by introducing the connection manager So why are we talking about this at all? Well, it is an
intermediaries. important feature that is somewhat lacking in Apache
Traffic Server, and our hope is that discussing this openly
HTTP SERVER LOAD BALANCER will attract attention and interest from other developers
Load Balancing is a basic technique for routing traffic, who would like to work on these features. Also, in all
such as HTTP requests, to a server in a way that achieves fairness, some of the other HTTP intermediaries do a
optimal performance, high availability, or easier service great job here; we have much to learn from them.
implementation. Hardware Server Load Balancer can
handle any (or most) TCP and UDP protocols, while an CONCLUSIONS
HTTP specific SLB would obviously only do HTTP and Apache Traffic Server is one of several free HTTP

6 Apache Traffic Server – HTTP Proxy Server On The Edge


intermediaries, which when properly used can improve Traffic Server will catch up where it is missing
scalability, availability and performance for many functionality or features. Performance-wise, we believe it
mission-critical services. We do not expect TS to be the already delivers an outstanding result.
optimal solution for every possible application, but we
think it is a viable option for many common use cases. The Apache Traffic Server mailing lists[1] and IRC
chat rooms are a great place to get started. Please come to
We believe that Traffic Server provides a solution that ask questions or to provide input. We offer quick
is flexible, easy to setup, protocol conformant, and feedback on your ideas and projects, and would love to
provides very high performance. In particular, Traffic hear what we can work on to make Apache Traffic Server
Server: even better.
 Is (or will be) at least as fast as the existing REFERENCES
alternatives
[1] http://trafficserver.apache.org/
 Scales well on modern SMP systems [2] http://www.monkey.org/~provos/libevent/
[3] http://software.schmorp.de/pkg/libev.html
 Has a feature set that is on par with the best [4] http://www.squid-cache.org/
intermediaries [5] http://varnish-cache.org/
As the community grows and matures, the hope is that [6] http://www.nginx.org/
[7] http://haproxy.1wt.eu/

Apache Traffic Server – HTTP Proxy Server On The Edge 7

You might also like