2008 Recap – Infrastructure Redundancy – a Good Thing ™
Redundancy? No, I’m not talking about effects of the global economical crisis.
I’m talking about the fact that Murphy wore one hairy moustache.
Building infrastructure for the Movember campaign requires a specific mindset of its systems engineers, and that is that anything can go wrong. The campaign’s design is to continually get people to visit the site, register and donate to the cause, non-stop, for at least the month of November. For this reason, one of my three aims as the systems administrator is to avoid downtime at all costs. (The other two aims are scalability and security – these will be the subject of future articles)
Taking a plethora of lessons learnt from 2007, we set ourselves the task of making 2008 a success in regards to the murky depths of server-side architecture that most people never get to see. The aim was to ensure that any worst-case incident affecting a server running in production, no matter its purpose (whether that be content serving, file serving, database serving or proxy serving and so on) could be isolated from the rest of the infrastructure, allowing the server to be brought down or removed with no downtime to the site. This is called high availability. Our attitude was that there’s no reason why any part of the application could not be made redundant regardless of its role. This article is the first in a series of articles explaining how we did it.
How did we do it?
Reverse Proxies and Application Servers
Our squid reverse proxy servers, the heros of page response times and load reduction, were distributed around the world to achieve four aims:
- Ensure lookup requests of the Movember domain(s) resolved to a user’s closest proxy server geographically, to result in faster response times (using powerdns and geobackend, subject of a future article);
- Cache static content such as images to improve those response times and page loads even further;
- Use multiple cache_peer parameters to distribute requests that could not be cached (i.e PHP files, dynamic content) evenly to multiple application servers;
- Respond to HTTPS requests and ensure these uncacheable requests remained encrypted leading back to the application servers through VPN.
The proxies were paired in the UK, US and Australia, and each pair operated as loadbalancers of themselves using keepalived, assigning a floating IP which either proxy could assume. If one proxy in the UK went down, the ’sibling’ proxy could take the floating IP and continue to serve requests. The proxies were also able to communicate with each other to retrieve cached content from their sibling via ICP if necessary.
This is an example squid configuration file, taken from one of the US proxies. We used squid3, which was available in the Ubuntu repositories, but we compiled it ourselves to include SSL support as the Ubuntu shipped squid3 binaries didn’t support this.
# Ports we will listen on
http_port 80 vhost
https_port 443 vhost cert=/etc/squid3/new2008-movember.com.ssl key=/etc/squid3/new2008-movember.com.key cafile=/etc/squid3/new.intermediate.cer defaultsite=www.movember.com cipher=DEFAULT:!EXPORT:!LOW options=NO_SSLv2
icp_port 3130
# PCI Verizon scan results
reply_header_access X-Cache-Lookup deny all
reply_header_access X-Cache deny all
reply_header_access All allow all
via off
httpd_suppress_version_string on
# set a cache_peer for sibling
cache_peer 67.192.120.171 sibling 80 3130 no-digest
# set a cache_peer per application servers
cache_peer 10.0.1.66 parent 80 0 no-query originserver no-digest no-netdb-exchange login=PASS front-end-https=auto round-robin
cache_peer 10.0.1.67 parent 80 0 no-query originserver no-digest no-netdb-exchange login=PASS front-end-https=auto round-robin
# cache_peer 10.0.1.69 parent 80 0 no-query originserver no-digest no-netdb-exchange login=PASS front-end-https=auto round-robin
# cache_peer 10.0.1.70 parent 80 0 no-query originserver no-digest no-netdb-exchange login=PASS front-end-https=auto round-robin
# cache_peer 10.0.1.71 parent 80 0 no-query originserver no-digest no-netdb-exchange login=PASS front-end-https=auto round-robin
# cache_peer 10.0.1.72 parent 80 0 no-query originserver no-digest no-netdb-exchange login=PASS front-end-https=auto round-robin
# if there’s no response of any kind within 20 seconds
# let’s force a purdy splash page
forward_timeout 20 seconds
# Redirect to a pretty page on error
error_directory /usr/share/squid3/errors/Movember
# Use 80% of /srv/squid == 90G * .8 = 72G
cache_dir aufs /srv/squid 72000 16 256
# Some ACLs
acl manager proto cache_object
acl SSL_ports port 443
acl localhost src 127.0.0.0/8
acl Safe_ports port 80 # http
acl Safe_ports port 443 # https
acl CONNECT method CONNECT
acl movember_sites dstdomain .movember.com .movember.com.au .movember.co.nz
acl proxyus01 src 67.192.120.170
acl proxyus02 src 67.192.120.171
acl monitoring src 202.44.98.11
acl inodes src 124.168.137.162
http_access allow manager localhost
http_access allow manager inodes
http_access deny manager
http_access deny !Safe_ports
http_access allow movember_sites
htcp_access deny all
icp_access allow proxyus02
icp_access deny all
hierarchy_stoplist cgi-bin ?
access_log /var/log/squid3/access.log squid !proxyus01 !proxyus02 !monitoring
acl QUERY urlpath_regex cgi-bin \?
cache deny QUERY
# Refresh rules
refresh_pattern ^ftp: 1440 20% 10080
refresh_pattern ^gopher: 1440 0% 1440
refresh_pattern . 0 20% 4320
coredump_dir /var/spool/squid3
# Tune memory usage
# Let’s use some of the 4G we have
cache_mem 2048 MB
# Make sure we can cache all the bggest js etc in RAM
maximum_object_size_in_memory 100 KB
# We want to cache bigger objects on disk
maximum_object_size 32 MB
A side effect of this arrangement meant that the ‘application servers’, meaning the Zend Apache servers serving the content, were probably the easiest to make redundant because all they had to do was serve identical content and let the proxies distribute the non-cachable (i.e dynamic) requests inbound. We kept the application servers serving identical content by checking our our subversion-controlled code on the first app server, which then rsynced these incremental changes to all other app servers where necessary.
We used six application servers during peak periods, scaling from two and back down to one post-campaign, and eventually doing away with the proxies altogether.
















