An Introduction to the
AWS Well Architected Framework
AWS has the
Well Architected Framework which contains guidelines and signposts for putting
together an efficient, cost effective, robust infrastructure for greenfield
implementations as well as for evaluating existing cloud environments (Well
Architected Tool). Within this framework, AWS also provides domain specific manuals
and whitepapers for building well architected frameworks for industries like
gaming, SAP, streaming media etc. These focus on nuances that need to be kept
in mind while building infrastructure specific to these use cases.
AWS
provides a commitment that applications built on infrastructure adhering to
principles defined in the Well Architected Framework will stand up to scrutiny on
multiple industry standard benchmarks
The AWS Well Architected Framework is built on 6 pillars identified as crucial
Let’s look in detail at what each of these mean
Operational Excellence
One should
always aim to improve the overall operational health of the system. But in
order to improve, one should be able to gauge where it is currently. In other
words, one should be able to measure. It is not possible to measure without
defining KPIs. KPIs need to be defined based on business outcomes and customer
outcomes. Customer outcomes and business outcomes are defined based on business
priorities. So, working our way backwards, in order to improve the overall operational
health, it is very important to understand business priorities, and then go
from there
These
priorities should be the driver for the environment setup. Obviously, business
priorities change, and the environment should be flexible enough to adapt.
Identify touch points within the environment ecosystem which should be amenable
for changes with changes in business priorities
Let us see
an example. A B2C website may have an extremely fast backend RDS system linked
to it since the business priority during normal working hours is speed of
response to user clicks. But during off peak hours, when business computational
load of the day’s operations on the RDS becomes high, the business priority may
no longer be the same. The hosting cloud environment must be flexible enough and
be able to adapt itself when such business priorities change
With the
Cloud, it is relatively easy to collect stats on how architectural decisions
affect workload behaviour. Unlike with traditional data centers, with this
data, it is possible to make changes to the environment for the better
performance of workloads. In other words, it is important to have levers built into
the system which can be tweaked to improve operational efficiency
It is
important to anticipate failures. Think of what-if scenarios and get a good
understanding of what the impact on business will be in each of these scenarios.
Also come up with response strategies and test them out to ensure they will be
effective in real-life situations
Let us try
to understand this with an example. It is normal to see workload volumes
increasing during certain predetermined hours or days. After configuring your
system to scale up when workload increases, simulate a failure scenario wherein
auto scaling doesn’t happen in response to increased workload. What would be
the response of admin personnel in this situation. There needs to be a clearly
chalked out SOP when a situation like this is encountered and test whether this
works
To the
extent possible, always automate the response to an event. AWS provides
multiple ways to do this. Cloudwatch is a service which lends itself admirably
for these use cases. CloudWatch event rules and CloudWatch alarms are examples
which can be leveraged here
Security
Although
this term sounds as if it is self explanatory, this is probably one of the most
challenging pillars to implement, especially in industries where data is
sensitive
Often, security issues happen due to lack of understanding and implementation of the most basic best practices. Before attempts are made to implement complex and fancy security procedures, ensure that some of the more fundamental, common sense security best practices are taken care of. This can potentially address most of the concerns
One of the top things that needs to be kept in mind while implementing security for data, systems and assets is that security needs to be implemented at multiple levels and layers. For your use case(s), figure out which are these levels and layers
Both data at rest and in motion needs to be protected. Data in motion can either be flowing within AWS touchpoints or it could be from on-prem to cloud or vice versa. Encrypting data in motion can be a challenging task. There are multiple options AWS provides for encrypting data in motion and the following are the best practices
Principals need to be given permissions based on the least privilege principle and separation of duties need to be enforced.
Reduce/eliminate reliance on long term static credentials.
Security environment must support time travel in order to pinpoint a principal who has executed a particular command.
It is very important to have an incident management system in place. In spite of good security mechanisms, there will be incidents. Teams must be able to isolate systems under attack at very short notice.
Automate the incident response
Run simulations of security breaches and have the appropriate team detect breaches in the minimum time possible.
Source of threat to the environment may vary depending on the industry the customer operates in. So, it is important to be extremely conversant with the potential threats unique to the industry and tailor a security strategy that addresses these threats
Some of the
important services that can be leveraged to implement security across different
areas are as follows
Areas |
Key
Services |
||||
Data
Protection |
EBS |
S3 |
RDS |
KMS |
Cloud HSM |
Privilege
management |
IAM |
MFA Token |
Permissions |
Roles |
|
Infrastructure
protection |
VPC |
WAF |
Shield |
CloudFront |
Route 53 |
Detective
controls |
CloudTrail |
Config |
CloudWatch |
|
|
Protecting data in transit:
· Use utilities like AWS PrivateLink
to create a secure and private network connection between AWS VPC or on-prem
installations to AWS based services. With PrivateLink, traffic stays on the
Amazon backbone and therefore doesn’t traverse the internet and thus is safe
· Use tools like GuardDuty to
automatically detect attempts to move data outside of defined boundaries
· Encryption in transit can be
enforced in AWS. AWS services provide HTTPS endpoints using TLS for
communication, thus providing encryption in transit when communicating with the
AWS APIs. It is also possible to use VPN connectivity into VPC from an external
network for data encryption
Reliability
Reliability is the ability of a workload to perform its
intended function correctly and consistently when it’s expected to, throughout
its lifecycle
· No matter how robust an environment
is, it can buckle under unexpected load. When this happens, the system should
be able to automatically recover. A reliable system must be able to
automatically recover from a failure. Taking this to the next level, the system
should have the intelligence to anticipate failure and automate the appropriate
response
· Percentage availability of an
application would be dictated by its functionality. Applications with critical
functionality would need to be made highly available in multi AZ mode vis-à-vis
other applications being made highly available in single AZ. This is equally
applicable for front end applications as well as for backend apps like
databases
· Be very clear about needs for RPO
and RTO and use this info to build in reliability
· In addition to verifying the load
works in best case scenarios, conduct negative testing in order to simulate
scenarios that would cause the load to fail. This gives the opportunity to test
recovery procedures
· Where possible, replace single large
resources with multiple smaller resources. More importantly, ensure they don’t
share a single point of failure
· Simulate failures and define SOPs in
order for applications to be brought back live in case of failures
· Having a good monitoring system is
essential for a reliable system
· Use logs and metrics wisely. Very often,
logs and metrics tell a story on how your environment is being utilized and
under what load. So, carry out periodic analysis on them and take appropriate
action
· Monitoring + Alerting + Automation=Self
Healing. For example, CloudWatch and Autoscaling can be used to together to
recover from failed EC2 instances
· It is relatively easy to automate
the system in such a fashion that it reacts to certain trigger events. Leverage
this to get the environment to auto correct itself when faced with an imminent failure
· Backups. Depending on the
criticality of the data, RPO and RTO requirements, functionality of the data
and its application usage, one needs to come up with appropriate backup
strategies which includes the frequency at which it needs to be backed up. Not
only does data need to be backed up, it is also equally important to ensure the
reproduceability of it and keep the time it takes to reproduce it when needed,
down to a minimum. Obviously, after reproducing it, it should be in a state
where it can be used.
Performance Efficiency
The Performance Efficiency pillar includes the ability to
use computing resources efficiently to meet system requirements, and to
maintain that efficiency as demand changes and technologies evolve. As with the other pillars there are a few
things to keep in mind here
· xaaS. These days many technologies
can be consumed as a service and does not entail installing and administering them
ourselves. The service provider would be an expert in ensuring optimal
efficiency of the service; leverage that to the extent possible
· Go serverless. As a corollary to the
above principle, it makes sense to leverage serverless compute and serverless
storage in order to avoid provisioning them ourselves. However, there may be a
tradeoff on cost and one would need to balance performance efficiency against
it
· Having made these choices, it is
important to review them periodically and check if there are variances from
expected performance. If there are then, those need to be addressed
· Usually, there will be multiple
services for similar use cases. Understand which service(s) best fits your
specific use case. Consult AWS if needed
· When trying to improve performance
efficiency, work against a specific target and benchmark existing efficiency
for a given set of parameters. This will let you measure improvements
objectively
· Work against permissible cost for
your set of requirements
· Analyze metrics, access patterns and
choosing storage and compute options based on these
· Network parameters usually will have
a big impact on performance and efficiency. Study this closely and make
appropriate configuration changes
Cost Optimization
Cost optimization
is the process of minimizing costs thru continual process of refinement and
improvement, without compromising on business outcomes
· Use tools like AWS Budgets wisely
and extensively to stay within limits. However, monitor costs proactively and
don’t depend just on notifications
· Keep Finance and Technology teams in
the loop on all decisions in the cloud journey
· Aim to innovate without overspending
· Create groups and roles which
control who can commission/decommission instances and resources. Good way to keep
in check rising costs
· Have policies in place which preempt
unnecessary resource use.
· Know exactly where costs are being
incurred and focus more on those areas to implement cost controls. 80% of the
costs may be incurred by 20% of the services. Focus more on these services
Sustainability
This pillar was recently
added, and it focuses on the long-term environmental, economic and societal
impact of your business activities on AWS cloud
It is important to understand
that all workloads leave a carbon footprint. Since all unnecessary storage and
compute leads to wastage of energy, one of the core guiding principles this
pillar advocates, is to eliminate redundant storage and compute and do
everything that supports this cause
Let us look
at a couple of instances where application of Well Architected Framework
Principles led to benefits for customers
· Consulting solutions firm Burns
& McDonnell saved 30% of its overall AWS bill in the first week after
taking action based on the Well Architected Framework guiding principles.
· BMC Software used the principles of
Well Architected Framework and saw the following benefits
· Was able to start delivering
immediate value to customers
· Expanded offerings to new companies
and departments
· Received positive customer feedback
within its first 4 months
· Exceeded internal business objectives
The Well Architected Framework is a
good place to start for teams looking to optimize their AWS based solutions
along different dimensions. I encourage delivery teams to apply this framework
and aim to build robust solutions