Building a Highly Available architecture is not easy. It requires the builder to have a macro view on how to set up the system to a micro view on how each server should be run. Having deployed systems that has hit 300,000 DAU (daily average users) before, I would like to share tips on how to build those systems, and for these examples I put it generically cloud platforms:
1. Use autoscaler.
This is no-brainer isn’t it :). You need the autoscaler to ensure the uptime of your systems. Once you managed to image a server for production, set up the autoscaler to scale every time CPU hits above 70%. Make sure your system is fast to spawn instances when scaling up and slow to take out instances when scaling down.
2. Use health checks
Another no-brainer. Every cloud load balancer has their own health checks, and usually it checks on a single port, and will automatically take out instance when that particular port is inaccessible. Make sure that it is activated, which leads me to the next point
3. Do not use multiple app/services on the same server.
I know that a lot of system administrators likes to squeeze as many application as possible onto the same server. For example, I use a webapp that uses port 80 and I use a reporting tool on port 8080, and I put them on the same server. That is wrong, and it is a security hazard. Split out all the applications that uses different ports onto separate load balancers and use their own health checks to check on that distinct port number. One server should have only one responsibility, similar to the OO Single Responsibility Principle. If you happen to use Nginx and php-fpm, you can split the application layer and the webapp layer onto two separate servers, with Nginx on port 80 and php-fpm on port 9000.
4. Use sharded memcached / redis to take the load off your database or store your tmp files or login details / tokens.
If you have the $$, this can be outsourced to third party vendors like RedisLabs, who will help you take care of your uptime.
5. Use CDN for your static content.
CDN helps to remove a huge load from your servers, use them if possible. If your servers take in upload content, save them to google storage or AWS S3
6. Use uptime monitoring software
Software like upstart, supervisord and monit are great tools to ensure uptime of critical software. They monitor ports like 80 within the server and if it is not detected, they will restart the relevant software after a set period of time. Go check them out and use them if possible.
7. Do not use database managed service if you expect traffic to be very heavy.
AWS has their RDS and GCE has cloudsql. My take is don’t use them if you are expecting heavy usage of 100,000 – 200,000 users per day. They are useful for light to medium usage. For heavy lifting use PerconaDB or dbshards. It is important that you are not hindered to scale up in terms of IOPS or disk space on your database architecture.
8. Use frameworks as much as possible for your app.
I find that frameworks helps you to scale. Some framework like ruby on rails and laravel helps to develop the database on the app layer. It helps to keep uniformity that helps a lot in HA. Frameworks also do away the need to use composite key and foreign keys that makes your app more scalable.
9. Test, test, test, test your app
I can’t overemphasis this enough. Have a lot of unit test, stress test, beta user tests before you launch. Use tools like loader.io or Apache Jmeter to simulate at least 10% of your expected traffic.
10. Know when to use high cpu and high memory instances.
There are some app / software that has very cpu or memory intensive. Example in the MySQL NDB, the sql node instances is memory intensive, whereas the data node instance is cpu intensive. Know your app / software well to know which type of instance is appropriate without jacking cost to sky high levels.
11. Monitor, monitor, monitor your architecture.
Use tools like google stackdriver or AWS cloudwatch to help you monitor the system that you set it up. Take note of components that crashes too often. Maybe the memory or CPU is not enough? Change the instance type. Is it because there are more than one critical software running on the same server? Split them up. Always monitor them for changes needed.
12. Strictly no sharding on the app layer.
I know of people who tries to overcome the limitations of RDS / Cloudsql by sharding on the app layer. Which means if the app wants user data, they will query on one database and if they want characters data, they will query on the other database. DO NOT do that. It will cause sub-optimal performance. Do not shard cache data too. Sharding should be strictly on the data layer.
13. Use deployment tools for uniformity, lean instances
Tools like docker and vagrant helps to ensure uniformity in your deployment. It also ensures that your VM instances have only the bare minimum software for what you need, hence help your instance survivability under heavy load slaughter. You may ask why don’t we use the imaging capabilities of AWS or GCE. Problem is when you image a server from development or staging, or from staging to production, there is a chance that there are some redundant software that you installed for experiment purposes but it turned and bite you when run on production. So it is better to install from ground up.
14. Have means to regulate incoming users
For us we all know that no matter how fast the instance spawn, it can never be faster than users logging into the system. Just like we know that HDB cannot build houses fast enough to meet sudden surge in demand, so what to they do? They implement ABSD and TDSR to regulate demand. Same thing for HA architecture. You need to have the means to slow down users accessing to the system to the point that the autoscaler can cope. What I do is I have a cron job that query for the number of login token in redis. If within an hour there is more than 100,000 users that has logged in, I will activate on the server side an algorithm that will prevent one user from logging in out of 4, in which the user will see “Server is busy, try again later”. That will help reduce the number of users coming into the system, without losing them.
In conclusion, HA architecture is a lot of knowledge and I think I do not cover all the grounds yet. Feel free to drop comments if you think I miss out anything. In short, the more you prep up the system, the less your system crash and burn.