With Amazon I recently ran into a case where I needed an SFTP service that was accessible to the internet but could survive up to 2x AWS availability zone outages. Sadly, amazon only allows certain ports to be load balanced using the Elastic Load Balancers. Since we could not use any other port but 22 we were forced to look at a few solutions such as HAProxy, and some commercial solutions. The latter being very expensive from a licensing and time perspective.
In order to survive two availability zone outages any infrastructure that the SFTP process relies on also needs to be available. Below is a list of the various systems that are required for this whole process to run.
- Shared Authentication system
- Shared Filesystem
- SFTP Server
- Some way to automatically route traffic between healthy hosts.
I will touch upon each of these in seperate blog posts but for this one I want to discuss the overall arcitecture of this system.
For centralized authentication you have a few options. One of which would be some sort of master/slave LDAP system that spans availability zones. Another might be using SQLAuthTypes with an RDS instance.
One important thing is to ensure that your SFTP Servers share a common file system. This can be accomplished a myriad of ways, but if you want to go the cheaper route then I recommend GlusterFS. I wont dive into the setup of this too much as that could be a whole article in itself. However with AWS you would want three glusterFS servers each in an availability zone. You would configure a replica volume (Data is replicated across all three glusterFS servers) and this volume would be mounted on each SFTP server. In the event of a glusterFS failure the client would immediately start reading/writing from one of the other gluster servers.
Another important thing to remember is that you want things to be transparent to the end user. You want to sync the ssh host keys in /etc/ssh across the servers. This way if a user connects via SFTP and gets routed to a diffent SFTP server they wont get any warnings from their client.
What ties all of this together is route53. Amazon recently introduced the ability to create health checks within the DNS Service that they offer customers. The documentation for configuring health checks is certainly worth reading over.
First health checks must be external, so lets assume you have three servers with elastic IPs.
We first want to configure three TCP health checks for the above IPs that check for port 22.
We then want to add record sets inside of the zone file we have. Make sure that you have the TTL set to either 30 or 60 seconds. In the event an instance goes down you want that record to be yanked out as quickly as possible. Depending on how you want to configure things (active/active or active/passive) you may want to adjust the Weighted value. If each record has the same weight then a record will be returned at random. If you weight sftp-1 to 10 and sftp-2/3 to 20 then sftp-1 will always be returned unless it is unavailable then either of the other two will.
As you can see, the final configuration shows three servers that will response for sftp.example.com
Now that we have the above configured. If you run dig you will see that the TTL value is low, if you stop SFTP then that record will no longer be returned.
In my next post in this series, I will discuss glusterFS.