2008 Recap – Infrastructure Redundancy – a Good Thing ™

Redundancy? No, I’m not talking about effects of the global economical crisis.

I’m talking about the fact that Murphy wore one hairy moustache.

Building infrastructure for the Movember campaign requires a specific mindset of its systems engineers, and that is that anything can go wrong. The campaign’s design is to continually get people to visit the site, register and donate to the cause, non-stop, for at least the month of November. For this reason, one of my three aims as the systems administrator is to avoid downtime at all costs. (The other two aims are scalability and security – these will be the subject of future articles)

Taking a plethora of lessons learnt from 2007, we set ourselves the task of making 2008 a success in regards to the murky depths of server-side architecture that most people never get to see. The aim was to ensure that any worst-case incident affecting a server running in production, no matter its purpose (whether that be content serving, file serving, database serving or proxy serving and so on) could be isolated from the rest of the infrastructure, allowing the server to be brought down or removed with no downtime to the site. This is called high availability. Our attitude was that there’s no reason why any part of the application could not be made redundant regardless of its role. This article is the first in a series of articles explaining how we did it.

How did we do it?

Reverse Proxies and Application Servers

Our squid reverse proxy servers, the heros of page response times and load reduction, were distributed around the world to achieve four aims:

  • Ensure lookup requests of the Movember domain(s) resolved to a user’s closest proxy server geographically, to result in faster response times (using powerdns and geobackend, subject of a future article);
  • Cache static content such as images to improve those response times and page loads even further;
  • Use multiple cache_peer parameters to distribute requests that could not be cached (i.e PHP files, dynamic content) evenly to multiple application servers;
  • Respond to HTTPS requests and ensure these uncacheable requests remained encrypted leading back to the application servers through VPN.

The proxies were paired in the UK, US and Australia, and each pair operated as loadbalancers of themselves using keepalived, assigning a floating IP which either proxy could assume. If one proxy in the UK went down, the ’sibling’ proxy could take the floating IP and continue to serve requests. The proxies were also able to communicate with each other to retrieve cached content from their sibling via ICP if necessary.

This is an example squid configuration file, taken from one of the US proxies. We used squid3, which was available in the Ubuntu repositories, but we compiled it ourselves to include SSL support as the Ubuntu shipped squid3 binaries didn’t support this.

# Ports we will listen on
http_port 80 vhost
https_port 443 vhost cert=/etc/squid3/new2008-movember.com.ssl key=/etc/squid3/new2008-movember.com.key cafile=/etc/squid3/new.intermediate.cer defaultsite=www.movember.com cipher=DEFAULT:!EXPORT:!LOW options=NO_SSLv2
icp_port 3130
# PCI Verizon scan results
reply_header_access X-Cache-Lookup deny all
reply_header_access X-Cache deny all
reply_header_access All allow all
via off
httpd_suppress_version_string on
# set a cache_peer for sibling
cache_peer 67.192.120.171 sibling 80 3130 no-digest
# set a cache_peer per application servers
cache_peer 10.0.1.66 parent 80 0 no-query originserver no-digest no-netdb-exchange login=PASS front-end-https=auto round-robin
cache_peer 10.0.1.67 parent 80 0 no-query originserver no-digest no-netdb-exchange login=PASS front-end-https=auto round-robin
# cache_peer 10.0.1.69 parent 80 0 no-query originserver no-digest no-netdb-exchange login=PASS front-end-https=auto round-robin
# cache_peer 10.0.1.70 parent 80 0 no-query originserver no-digest no-netdb-exchange login=PASS front-end-https=auto round-robin
# cache_peer 10.0.1.71 parent 80 0 no-query originserver no-digest no-netdb-exchange login=PASS front-end-https=auto round-robin
# cache_peer 10.0.1.72 parent 80 0 no-query originserver no-digest no-netdb-exchange login=PASS front-end-https=auto round-robin
# if there’s no response of any kind within 20 seconds
# let’s force a purdy splash page
forward_timeout 20 seconds
# Redirect to a pretty page on error
error_directory /usr/share/squid3/errors/Movember
# Use 80% of /srv/squid == 90G * .8 = 72G
cache_dir aufs /srv/squid 72000 16 256
# Some ACLs
acl manager proto cache_object
acl SSL_ports port 443
acl localhost src 127.0.0.0/8
acl Safe_ports port 80 # http
acl Safe_ports port 443 # https
acl CONNECT method CONNECT
acl movember_sites dstdomain .movember.com .movember.com.au .movember.co.nz
acl proxyus01 src 67.192.120.170
acl proxyus02 src 67.192.120.171
acl monitoring src 202.44.98.11
acl inodes src 124.168.137.162
http_access allow manager localhost
http_access allow manager inodes
http_access deny manager
http_access deny !Safe_ports
http_access allow movember_sites
htcp_access deny all
icp_access allow proxyus02
icp_access deny all
hierarchy_stoplist cgi-bin ?
access_log /var/log/squid3/access.log squid !proxyus01 !proxyus02 !monitoring
acl QUERY urlpath_regex cgi-bin \?
cache deny QUERY
# Refresh rules
refresh_pattern ^ftp: 1440 20% 10080
refresh_pattern ^gopher: 1440 0% 1440
refresh_pattern . 0 20% 4320
coredump_dir /var/spool/squid3
# Tune memory usage
# Let’s use some of the 4G we have
cache_mem 2048 MB
# Make sure we can cache all the bggest js etc in RAM
maximum_object_size_in_memory 100 KB
# We want to cache bigger objects on disk
maximum_object_size 32 MB

A side effect of this arrangement meant that the ‘application servers’, meaning the Zend Apache servers serving the content, were probably the easiest to make redundant because all they had to do was serve identical content and let the proxies distribute the non-cachable (i.e dynamic) requests inbound. We kept the application servers serving identical content by checking our our subversion-controlled code on the first app server, which then rsynced these incremental changes to all other app servers where necessary.

We used six application servers during peak periods, scaling from two and back down to one post-campaign, and eventually doing away with the proxies altogether.

2008 Recap – Infrastructure Redundancy – Brain-splitting fun with DRBD/Heartbeat

File serving

In that innocent age of 2007, when it rained more, the economy was great and there was little time to think about high availability of your website, we used an NFS server to store assets or static content, and mounted the share on the application servers as NFS clients. This worked reasonably well, but it was not redundant. What would happen if we lost the NFS server? The client mounts would become stale and no images would load on the website.

We didn’t solve this issue in 2007, rather we got lucky. In 2008 we tried a new approach, and that was making our NFS server redundant. But how do you keep two file servers storing identical changes when much uploaded data was being transmitted, and furthermore, how do you let the application servers understand that if one file server drops, not to panic?

In 2008 we used DRBD (Distributed Replicated Block Device) which provides a method of replicating the blocks of a device (such as a partition) between two servers over TCP/IP. Essentially, for the nerds out there, it’s RAID 1 over IP. DRBD was used in this way to replicate the shared content between two NFS servers, where the ‘export’ path was the shared DRBD partition.

Our DRBD set up was active-passive, which meant that only one of the two NFS servers would mount the share at any given time, though replication would continue over IP at all times. This was because NFS, being a fairly low level kernel service, plays a lot nicer when it thinks it is the only ‘master’. But we needed a way to shut down node 1, have NFS stop and load NFS on the second node, and all the same time make this invisible to the application. To make it invisible to the app, we’d assign a floating IP to the NFS servers and use this floating IP to mount the share on the apps.

Our DRBD config:

global {
usage-count no;
}
common {
syncer {
rate 100M;
}
}
resource nfs {
protocol C;
startup {
degr-wfc-timeout 120; # 2 minutes.
}
disk {
# on-io-error detach;
}
net {
cram-hmac-alg sha1;
shared-secret “removed”;
}
on movember-dvmh-nfs-01 {
device /dev/drbd1;
disk /dev/mapper/ubuntu-SrvNfs;
address 10.0.1.91:7788;
meta-disk /dev/mapper/ubuntu-drbd[0];
}
on movember-dvmh-nfs-02
device /dev/drbd1;
disk /dev/mapper/ubuntu-SrvNfs;
address 10.0.1.92:7788;
meta-disk /dev/mapper/ubuntu-drbd[0];
}
}

Sure, we had used keepalived with the proxies as mentioned above, which is blazingly fast, but it didn’t have to think about services, it just had to let the nodes assume the floating IP. The use of NFS required something a little more heavy in terms of floating IPs, something that could stop and start services like NFS to make sure nothing nasty happened with the filesystem and locks and such.

The solution was to use Heartbeat, another high-availabilty solution that features plugins and hooks for other services such as DRBD. What this meant was that we could assign Heartbeat the responsibility of handing out the floating IP and monitoring each node’s health.

Our Heartbeat ha.cf:

# debug output
debugfile /var/log/ha-debug.log
# all other logs
logfile /var/log/ha-log.log
logfacility local0
keepalive 1
deadtime 10
warntime 3
initdead 20
bcast eth0
auto_failback off
# STONITH
#stonith_host * external/ssh 10.0.1.90 root nottherootpassword
node movember-dvmh-nfs-01
node movember-dvmh-nfs-02

Our Heartbeat haresources config:

movember-dvmh-nfs-01 movembernfs IPaddr::10.0.1.90/27/eth0 drbddisk::nfs Filesystem::/dev/drbd1::/srv/nfs::xfs

See the documentation for more information on the format of haresources.

On each server we removed the init scripts for NFS, as we were handing over the responsibility of starting and stopping NFS to Heartbeat with the configurations above.

update-rc.d -f nfs-kernel-server remove

update-rc.d -f nfs-common remove

NFS then just had to usr /srv/nfs as its export, and we’d mount the share on the clients using the floating IP 10.0.1.90.

When the time came to re-allocate the resources to node B if node A died, Heartbeat was able to shut down NFS (if it could) mount the DRBD device on node B and start NFS to export that share. As well as this, we could tune Heartbeat to do this with fairly short delays, to ensure that requests for files would not hang for too long while the app tried to sort out where its files had gone.

Couple this with the fact that the massive volume of requests to the site meant that the majority of static files typically served by NFS were already cached at the reverse proxy level and served straight back to the user out of RAM. Shutting down an NFS server was truly a practically invisible process to all concerned.

Gracefully, anyway.

What we discovered was that things are not a problem with Heartbeat, generally speaking, as it was quite efficient at detecting a dead node and switching the resources. The issue we found was DRBD, and that if network connectivity was disrupted between the two DRBD nodes, it could be very easy to end up in a split-brain situation, where DRBD felt there was no way to distinguish which node was primary (rather, both nodes think they have the right to be primary).

The reason this is a problem is that if a secondary node, which has out-of-date data, starts getting written to due to mistakenly being associated as a primary node, you have lost your data integrity as now both nodes are in a completely inconsistent state.

In physical environments, this is often worked around by providing a secondary method of communication between the peer nodes such as a crossover cable. We use VMware virtualisation and there was no additional interface between these two machines, which, though we had a high level of redundancy, ran a risk of split brain. In certain situations, such as VMotioning an NFS server, latency resulted in an inconsistency between the nodes and had to be manually recovered by discarding the data from the secondary node and resyncing. Fortunately in such cases, it was clear which node was the ’survivor’ and which was the ‘victim’.

The ultimate lesson was that DRBD was a risky solution in a virtualised environment. The other lesson was that a DRBD/Heartbeat solution doesn’t scale. In 2009, we have been researching entirely new technologies such as the use of GlusterFS, CouchDB or MogileFS in an effort to distribute the filesystem more sanely.

Stay tuned for part 3 of this recap where we discuss our experience using MySQL Replication in an active/passive relationship.