How should an integrator cope with fail-over?

Printer-friendly version

Delivery Manager is distributed across multiple servers, and across multiple sites. From time to time, connectivity to one of those systems or sites will fail. Under those circumstances, you need to fail-over to the reachable servers.

 

DNS Fail-over

For most people, this happens automatically. We have a number of DNS load balancers (supplied by Zeus), so visitors will be given the IP address of an active entry point.

The main advantage is that everything happens automatically, and there is nothing for integrators to specifically cater for. The client end needs access to the DNS, and the local DNS server (if there is one) needs to properly respect TTL and refresh intervals for the metapack.com zone. But as most DNS server products do that out of the box, there should be no issue there - it only comes to the fore when local administrators have made customisations.

There are issues with Global Server Load Balancing (GSLB), mostly concerned with the behaviour at the time of failing-over (it is possible it might take a minute or two for the DNS records to be flushed from caches and usable by browser clients, for example). But the time it takes to have an engineer reconfigure a router (or even investigate) in the middle of the night is longer than a couple of minutes.

This is generally sufficient for critical services in the warehouse and third-party integrations.

 

When DNS fail-over cannot be used...

There are times when the link between the warehouse and our network becomes unavailable and it is outside of MetaPack's networks. In those circumstances, there's nothing we can do at our end, so you will have to implement a solution at your end.

It's not just a matter of implementing a second ISP as contingency (although that is a good idea). There are two main points to cover:

  • Detection of the link failure, and
  • The response (probably automated) to that link failure.

Detection can be achieved through constant monitoring, and the response might be the updating of a router with a new route to the alternative host. This could be done using a manual process, or by a TFTP script to the router. It's all a bit fiddly, and is prone to failure itself.

However, a better solution is to implement High Availability using standard tools. I recommend using haproxy if you're UNIX/Linux based. There are dozens of web proxy solutions for Windows, and no two are alike. Although most people only use HTTP, you might need to consider issues of HTTPS, if your security policy demands it.

You can find the IP addresses of our servers in the reference data section.