This is the first post of a series of many on architecting for the cloud.
In this post we’ll design a scalable, resilient and highly available architecture for hosting a dynamic website on top of AWS. We’ll go over some of the basics and see how easy it is to leverage AWS’s services and infrastructure in order to improve our web site’s reliability and resiliency.
The Basics – Regions and Availability Zones
One of the most important decisions you need to make when using AWS is which region to use for hosting your services and applications. AWS regions are groups of relatively independent data-centers. Remember that traffic between different regions goes through the internet and therefore has extra charge, so you’ll want to make sure that all AWS resources that your system needs and uses are in the same region.
Every single region is comprised of two or more Availability Zones (or AZ), which are independent and isolated data-centers. This means that for an entire region to go out of service, all its independent AZs would have to go down at the same time. Traffic between AZs within a region goes through AWS’s own high-speed data-link and doesn’t cost extra.
As a rule of thumb, choose the region that is closest to your customers. Within that region, one of the easiest ways of creating highly resilient and fault-tolerant architectures in AWS is by distributing the EC2 instances and services across multiple AZs. This way, in the unlikely event that one AZ goes down, the system will still be functional. Note that managed services like S3 or DynamoDB are highly available by default.
Like most websites, ours will have three main components: server-side layer, persistence layer and static assets. Let’s see how we deal with each component separately.
There are three main ways to run our custom code on AWS:
- Directly on EC2 instances
- Running it on an ECS cluster
- Combining API Gateway + Lambda
Using API Gateway + Lambda is a good choice for implementing REST API endpoints, but definitely not for running a custom website. Containerizing code to run on an ECS cluster is good for when we have to manage and run several services. In our case, since we only have one website to run, we’ll stick to the traditional way of running our code directly on EC2 instances.
Every AWS account gets a default VPC in each available region. Each default VPC contains a default subnet in every Availability Zone. When launching an EC2 instance you need to choose the VPC and the subnet where the instance will reside. Note that while a VPC can span multiple AZs within a region, a subnet in a VPC can only be in one AZ.
We’ll create an Autoscaling group across three availability zones. The EC2 instances will be placed in the default subnet of each AZ. We’ll also create an Elastic Load Balancer (ELB), which will distribute traffic evenly among our web servers. Both the ELB and the Autoscaling group will be configured to monitor our instances and in case there is a problem, the ELB will stop sending traffic to that instance and the Autoscaling group will replace it with a new one.
We would like to use MySQL for our persistence layer. We can either install and configure MySQL on our own EC2 instance, or use Amazon RDS, which is AWS’s managed relational database service.
One of the advantages of using RDS is that it supports multi-AZ deployments and synchronous replication between replicas out of the box. That means that we can have a primary server in one AZ and secondary, stand-by servers in other AZs. If the primary server was to become unavailable (say the AZ went down or there were problems with the underlying EC2 instance), RDS would automatically fail-over to one of the secondary replicas. This would be transparent to our application, and since data has been replicated synchronously, there would be no data loss. Other features of RDS are automatic backups and software patching. Bear in mind that if you use RDS you won’t have access to the underlying EC2 servers.
Note that the RDS instances have been placed in their own separate subnet. This is for security reasons. The web server instances have to be placed in public subnets, which are available from the internet. Since we don’t need our RDS instances to be publicly accessible, we can put them in their own private subnets. Only traffic from the web servers on the corresponding MySQL port will be allowed.
This one is pretty straightforward. We’ll put our assets in S3 and deliver them via CloudFront. This will enable our end-users to receive our static assets with the lowest latency possible. CloudFront will serve as a proxy between our end-users and S3. User requests will go to CloudFront, which will fetch the requested files from S3 and cache for subsequent requests.
What’s great about our design is that it doesn’t have any single points of failure. If any of our EC2 instances goes down, or even if an entire AZ goes down, our system will still be online. We’re also using managed services, like ELB and S3, which have been built specially for resiliency and high availability.
Still there are a few things that we can do to improve our original design in terms of availability. The first thing is use CloudFront for delivering our website’s dynamic content. This will reduce the load on our servers and reduce load times for our end-users.
We can also leverage Route53 and its smart routing policies in order to improve our user’s experience in case of outages or downtimes. Route53, which is Amazon’s DNS service, can be used to route traffic at the DNS level. We can configure Route53 to use a fail-over routing policy when resolving requests to our domain name. Route53 will perform health checks on our main website. As long as our website is working as expected, Route53 will route traffic to our website. If our website becomes unavailable, Route53 will start routing traffic to a secondary resource. This secondary resource can be a different copy of our website running on another AWS region, or static content from S3.
We’ll set-up a static version of our website on S3, which will act as a fail-over in case our main website becomes unavailable. This way, in the event of downtime, users will get a friendly, less functional version of our website and not timeout.
Handling large read volumes
Now, let’s assume our website has been extremely successful and that due to the increase in traffic, the volume of reads to the database is becoming a problem. In this case, there are three approaches we can use:
- Scaling up: this means running our db servers on larger, more powerful EC2 instances. This approach requires no changes to our application. However there is a limit to how big instances can be and much network throughput each instance can handle. We can use this approach as a quick fix, while we come up with a more long-term solution.
- Scaling out: this means adding read replicas to our databases. Even though RDS synchronously replicates our database to secondary instances in different AZs, our application cannot connect directly to any of these replicas. If we want to offload the read operations of the primary db, we need to tell RDS to create specialized read replicas.
- Caching db results: we can use an in-memory data store for caching db queries. Amazon Elasticache supports Memcached and Redis, and it’s the ideal solution for this use case.
- Web servers: we have multiple EC2 instances running on multiple AZs. They are also part of an autoscaling group, which will ensure that we always have a minimum preset of servers running.
- Database: the database EC2 instances are also running on multiple AZs. If the primary server goes down, RDS will automatically switch-over to a secondary replica.
- Caching: the website is being served via CloudFront, which will reduce the load on our web and database servers
- Failover DNS routing: Route53 will also monitor our main website. In case it goes down, users will be redirected to an alternative static version of the site
- Caching db queries: this not only reduces request times for our users, but it also protects our database in case of sudden spikes in traffic
Let’s go over the features of our design that contribute to the overall resiliency and availability of our website:
If you want to learn more about designing resilient system architectures on AWS, I recommend reading their Well-Architected Framework whitepaper. If you have ideas of how to improve what we’ve built in this post, please let me know.