Tuesday, July 26, 2022

An Introduction to the AWS Well Architected Framework

An Introduction to the AWS Well Architected Framework

 

AWS has the Well Architected Framework which contains guidelines and signposts for putting together an efficient, cost effective, robust infrastructure for greenfield implementations as well as for evaluating existing cloud environments (Well Architected Tool). Within this framework, AWS also provides domain specific manuals and whitepapers for building well architected frameworks for industries like gaming, SAP, streaming media etc. These focus on nuances that need to be kept in mind while building infrastructure specific to these use cases.

AWS provides a commitment that applications built on infrastructure adhering to principles defined in the Well Architected Framework will stand up to scrutiny on multiple industry standard benchmarks

 

The AWS Well Architected Framework is built on 6 pillars identified as crucial

  • Operational excellence 
  • Security
  • Reliability
  • Performance Efficiency
  • Cost Optimization
  • Sustainability
  • Let’s look in detail at what each of these mean

    Operational Excellence

     Operational excellence can be imagined as the ability to create and operate an environment hosting the applications and workloads with enough levers built in which can be tweaked for optimization. AWS recommends that, in order to limit human error we use code for setting up, running and automating the environment. The best practice is to keep on refining operating procedures frequently and arriving at the best possible combination of the provisioned, tweakable parameters for a given set of requirements. It is also advisable to make incremental/decremental small changes to the parameters which can be rolled back in case the desired results are not achieved, thus keeping adverse impact on customers down to a bare minimum.

     

    One should always aim to improve the overall operational health of the system. But in order to improve, one should be able to gauge where it is currently. In other words, one should be able to measure. It is not possible to measure without defining KPIs. KPIs need to be defined based on business outcomes and customer outcomes. Customer outcomes and business outcomes are defined based on business priorities. So, working our way backwards, in order to improve the overall operational health, it is very important to understand business priorities, and then go from there

    These priorities should be the driver for the environment setup. Obviously, business priorities change, and the environment should be flexible enough to adapt. Identify touch points within the environment ecosystem which should be amenable for changes with changes in business priorities

    Let us see an example. A B2C website may have an extremely fast backend RDS system linked to it since the business priority during normal working hours is speed of response to user clicks. But during off peak hours, when business computational load of the day’s operations on the RDS becomes high, the business priority may no longer be the same. The hosting cloud environment must be flexible enough and be able to adapt itself when such business priorities change

    With the Cloud, it is relatively easy to collect stats on how architectural decisions affect workload behaviour. Unlike with traditional data centers, with this data, it is possible to make changes to the environment for the better performance of workloads. In other words, it is important to have levers built into the system which can be tweaked to improve operational efficiency

    It is important to anticipate failures. Think of what-if scenarios and get a good understanding of what the impact on business will be in each of these scenarios. Also come up with response strategies and test them out to ensure they will be effective in real-life situations

    Let us try to understand this with an example. It is normal to see workload volumes increasing during certain predetermined hours or days. After configuring your system to scale up when workload increases, simulate a failure scenario wherein auto scaling doesn’t happen in response to increased workload. What would be the response of admin personnel in this situation. There needs to be a clearly chalked out SOP when a situation like this is encountered and test whether this works

     

    To the extent possible, always automate the response to an event. AWS provides multiple ways to do this. Cloudwatch is a service which lends itself admirably for these use cases. CloudWatch event rules and CloudWatch alarms are examples which can be leveraged here

     

    Security

    Although this term sounds as if it is self explanatory, this is probably one of the most challenging pillars to implement, especially in industries where data is sensitive

    Often, security issues happen due to lack of understanding and implementation of the most basic best practices. Before attempts are made to implement complex and fancy security procedures, ensure that some of the more fundamental, common sense security best practices are taken care of. This can potentially address most of the concerns

    One of the top things that needs to be kept in mind while implementing security for data, systems and assets is that security needs to be implemented at multiple levels and layers. For your use case(s), figure out which are these levels and layers

    Both data at rest and in motion needs to be protected. Data in motion can either be flowing within AWS touchpoints or it could be from on-prem to cloud or vice versa. Encrypting data in motion can be a challenging task. There are multiple options AWS provides for encrypting data in motion and the following are the best practices

    Principals need to be given permissions based on the least privilege principle and separation of duties need to be enforced. 

    Reduce/eliminate reliance on long term static credentials. 

    Security environment must support time travel in order to pinpoint a principal who has executed a particular command. 

    It is very important to have an incident management system in place. In spite of good security mechanisms, there will be incidents. Teams must be able to isolate systems under attack at very short notice.

    Automate the incident response

    Run simulations of security breaches and have the appropriate team detect breaches in the minimum time possible. 

    Source of threat to the environment may vary depending on the industry the customer operates in. So, it is important to be extremely conversant with the potential threats unique to the industry and tailor a security strategy that addresses these threats


    Some of the important services that can be leveraged to implement security across different areas are as follows

    Areas

    Key Services

    Data Protection

    EBS

    S3

    RDS

    KMS

    Cloud HSM

    Privilege management

    IAM

    MFA Token

    Permissions

    Roles

     

    Infrastructure protection

    VPC

    WAF

    Shield

    CloudFront

    Route 53

    Detective controls

    CloudTrail

    Config

    CloudWatch

     

     

     

    Protecting data in transit:

    ·       Use utilities like AWS PrivateLink to create a secure and private network connection between AWS VPC or on-prem installations to AWS based services. With PrivateLink, traffic stays on the Amazon backbone and therefore doesn’t traverse the internet and thus is safe

    ·       Use tools like GuardDuty to automatically detect attempts to move data outside of defined boundaries

    ·       Encryption in transit can be enforced in AWS. AWS services provide HTTPS endpoints using TLS for communication, thus providing encryption in transit when communicating with the AWS APIs. It is also possible to use VPN connectivity into VPC from an external network for data encryption

     

    Reliability

     

    Reliability is the ability of a workload to perform its intended function correctly and consistently when it’s expected to, throughout its lifecycle

    ·       No matter how robust an environment is, it can buckle under unexpected load. When this happens, the system should be able to automatically recover. A reliable system must be able to automatically recover from a failure. Taking this to the next level, the system should have the intelligence to anticipate failure and automate the appropriate response

    ·       Percentage availability of an application would be dictated by its functionality. Applications with critical functionality would need to be made highly available in multi AZ mode vis-à-vis other applications being made highly available in single AZ. This is equally applicable for front end applications as well as for backend apps like databases

    ·       Be very clear about needs for RPO and RTO and use this info to build in reliability

    ·       In addition to verifying the load works in best case scenarios, conduct negative testing in order to simulate scenarios that would cause the load to fail. This gives the opportunity to test recovery procedures

    ·       Where possible, replace single large resources with multiple smaller resources. More importantly, ensure they don’t share a single point of failure

    ·       Simulate failures and define SOPs in order for applications to be brought back live in case of failures

    ·       Having a good monitoring system is essential for a reliable system

    ·       Use logs and metrics wisely. Very often, logs and metrics tell a story on how your environment is being utilized and under what load. So, carry out periodic analysis on them and take appropriate action

    ·       Monitoring + Alerting + Automation=Self Healing. For example, CloudWatch and Autoscaling can be used to together to recover from failed EC2 instances

    ·       It is relatively easy to automate the system in such a fashion that it reacts to certain trigger events. Leverage this to get the environment to auto correct itself when faced with an imminent failure

    ·       Backups. Depending on the criticality of the data, RPO and RTO requirements, functionality of the data and its application usage, one needs to come up with appropriate backup strategies which includes the frequency at which it needs to be backed up. Not only does data need to be backed up, it is also equally important to ensure the reproduceability of it and keep the time it takes to reproduce it when needed, down to a minimum. Obviously, after reproducing it, it should be in a state where it can be used.

    Performance Efficiency

                 

    The Performance Efficiency pillar includes the ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes and technologies evolve.  As with the other pillars there are a few things to keep in mind here

    ·       xaaS. These days many technologies can be consumed as a service and does not entail installing and administering them ourselves. The service provider would be an expert in ensuring optimal efficiency of the service; leverage that to the extent possible

    ·       Go serverless. As a corollary to the above principle, it makes sense to leverage serverless compute and serverless storage in order to avoid provisioning them ourselves. However, there may be a tradeoff on cost and one would need to balance performance efficiency against it

    ·       Having made these choices, it is important to review them periodically and check if there are variances from expected performance. If there are then, those need to be addressed

    ·       Usually, there will be multiple services for similar use cases. Understand which service(s) best fits your specific use case. Consult AWS if needed

    ·       When trying to improve performance efficiency, work against a specific target and benchmark existing efficiency for a given set of parameters. This will let you measure improvements objectively

    ·       Work against permissible cost for your set of requirements

    ·       Analyze metrics, access patterns and choosing storage and compute options based on these

    ·       Network parameters usually will have a big impact on performance and efficiency. Study this closely and make appropriate configuration changes

     

    Cost Optimization

    Cost optimization is the process of minimizing costs thru continual process of refinement and improvement, without compromising on business outcomes

     

    ·       Use tools like AWS Budgets wisely and extensively to stay within limits. However, monitor costs proactively and don’t depend just on notifications

    ·       Keep Finance and Technology teams in the loop on all decisions in the cloud journey

    ·       Aim to innovate without overspending

    ·       Create groups and roles which control who can commission/decommission instances and resources. Good way to keep in check rising costs

    ·       Have policies in place which preempt unnecessary resource use.

    ·       Know exactly where costs are being incurred and focus more on those areas to implement cost controls. 80% of the costs may be incurred by 20% of the services. Focus more on these services

     

    Sustainability

     

    This pillar was recently added, and it focuses on the long-term environmental, economic and societal impact of your business activities on AWS cloud

     

    It is important to understand that all workloads leave a carbon footprint. Since all unnecessary storage and compute leads to wastage of energy, one of the core guiding principles this pillar advocates, is to eliminate redundant storage and compute and do everything that supports this cause

     

    Let us look at a couple of instances where application of Well Architected Framework Principles led to benefits for customers

    ·       Consulting solutions firm Burns & McDonnell saved 30% of its overall AWS bill in the first week after taking action based on the Well Architected Framework guiding principles.

    ·       BMC Software used the principles of Well Architected Framework and saw the following benefits

    ·       Was able to start delivering immediate value to customers

    ·       Expanded offerings to new companies and departments

    ·       Received positive customer feedback within its first 4 months

    ·       Exceeded internal business objectives

     

     

    The Well Architected Framework is a good place to start for teams looking to optimize their AWS based solutions along different dimensions. I encourage delivery teams to apply this framework and aim to build robust solutions