2008 Recap – Infrastructure Redundancy – Brain-splitting fun with DRBD/Heartbeat
File serving
In that innocent age of 2007, when it rained more, the economy was great and there was little time to think about high availability of your website, we used an NFS server to store assets or static content, and mounted the share on the application servers as NFS clients. This worked reasonably well, but it was not redundant. What would happen if we lost the NFS server? The client mounts would become stale and no images would load on the website.
We didn’t solve this issue in 2007, rather we got lucky. In 2008 we tried a new approach, and that was making our NFS server redundant. But how do you keep two file servers storing identical changes when much uploaded data was being transmitted, and furthermore, how do you let the application servers understand that if one file server drops, not to panic?
In 2008 we used DRBD (Distributed Replicated Block Device) which provides a method of replicating the blocks of a device (such as a partition) between two servers over TCP/IP. Essentially, for the nerds out there, it’s RAID 1 over IP. DRBD was used in this way to replicate the shared content between two NFS servers, where the ‘export’ path was the shared DRBD partition.
Our DRBD set up was active-passive, which meant that only one of the two NFS servers would mount the share at any given time, though replication would continue over IP at all times. This was because NFS, being a fairly low level kernel service, plays a lot nicer when it thinks it is the only ‘master’. But we needed a way to shut down node 1, have NFS stop and load NFS on the second node, and all the same time make this invisible to the application. To make it invisible to the app, we’d assign a floating IP to the NFS servers and use this floating IP to mount the share on the apps.
Our DRBD config:
global {
usage-count no;
}
common {
syncer {
rate 100M;
}
}
resource nfs {
protocol C;
startup {
degr-wfc-timeout 120; # 2 minutes.
}
disk {
# on-io-error detach;
}
net {
cram-hmac-alg sha1;
shared-secret “removed”;
}
on movember-dvmh-nfs-01 {
device /dev/drbd1;
disk /dev/mapper/ubuntu-SrvNfs;
address 10.0.1.91:7788;
meta-disk /dev/mapper/ubuntu-drbd[0];
}
on movember-dvmh-nfs-02
device /dev/drbd1;
disk /dev/mapper/ubuntu-SrvNfs;
address 10.0.1.92:7788;
meta-disk /dev/mapper/ubuntu-drbd[0];
}
}
Sure, we had used keepalived with the proxies as mentioned above, which is blazingly fast, but it didn’t have to think about services, it just had to let the nodes assume the floating IP. The use of NFS required something a little more heavy in terms of floating IPs, something that could stop and start services like NFS to make sure nothing nasty happened with the filesystem and locks and such.
The solution was to use Heartbeat, another high-availabilty solution that features plugins and hooks for other services such as DRBD. What this meant was that we could assign Heartbeat the responsibility of handing out the floating IP and monitoring each node’s health.
Our Heartbeat ha.cf:
# debug output
debugfile /var/log/ha-debug.log
# all other logs
logfile /var/log/ha-log.log
logfacility local0
keepalive 1
deadtime 10
warntime 3
initdead 20
bcast eth0
auto_failback off
# STONITH
#stonith_host * external/ssh 10.0.1.90 root nottherootpassword
node movember-dvmh-nfs-01
node movember-dvmh-nfs-02
Our Heartbeat haresources config:
movember-dvmh-nfs-01 movembernfs IPaddr::10.0.1.90/27/eth0 drbddisk::nfs Filesystem::/dev/drbd1::/srv/nfs::xfs
See the documentation for more information on the format of haresources.
On each server we removed the init scripts for NFS, as we were handing over the responsibility of starting and stopping NFS to Heartbeat with the configurations above.
update-rc.d -f nfs-kernel-server remove
update-rc.d -f nfs-common remove
NFS then just had to usr /srv/nfs as its export, and we’d mount the share on the clients using the floating IP 10.0.1.90.
When the time came to re-allocate the resources to node B if node A died, Heartbeat was able to shut down NFS (if it could) mount the DRBD device on node B and start NFS to export that share. As well as this, we could tune Heartbeat to do this with fairly short delays, to ensure that requests for files would not hang for too long while the app tried to sort out where its files had gone.
Couple this with the fact that the massive volume of requests to the site meant that the majority of static files typically served by NFS were already cached at the reverse proxy level and served straight back to the user out of RAM. Shutting down an NFS server was truly a practically invisible process to all concerned.
Gracefully, anyway.
What we discovered was that things are not a problem with Heartbeat, generally speaking, as it was quite efficient at detecting a dead node and switching the resources. The issue we found was DRBD, and that if network connectivity was disrupted between the two DRBD nodes, it could be very easy to end up in a split-brain situation, where DRBD felt there was no way to distinguish which node was primary (rather, both nodes think they have the right to be primary).
The reason this is a problem is that if a secondary node, which has out-of-date data, starts getting written to due to mistakenly being associated as a primary node, you have lost your data integrity as now both nodes are in a completely inconsistent state.
In physical environments, this is often worked around by providing a secondary method of communication between the peer nodes such as a crossover cable. We use VMware virtualisation and there was no additional interface between these two machines, which, though we had a high level of redundancy, ran a risk of split brain. In certain situations, such as VMotioning an NFS server, latency resulted in an inconsistency between the nodes and had to be manually recovered by discarding the data from the secondary node and resyncing. Fortunately in such cases, it was clear which node was the ’survivor’ and which was the ‘victim’.
The ultimate lesson was that DRBD was a risky solution in a virtualised environment. The other lesson was that a DRBD/Heartbeat solution doesn’t scale. In 2009, we have been researching entirely new technologies such as the use of GlusterFS, CouchDB or MogileFS in an effort to distribute the filesystem more sanely.
Stay tuned for part 3 of this recap where we discuss our experience using MySQL Replication in an active/passive relationship.















Comments
No comments so far.