Tuesday, July 26, 2022

An Introduction to the AWS Well Architected Framework

An Introduction to the AWS Well Architected Framework

 

AWS has the Well Architected Framework which contains guidelines and signposts for putting together an efficient, cost effective, robust infrastructure for greenfield implementations as well as for evaluating existing cloud environments (Well Architected Tool). Within this framework, AWS also provides domain specific manuals and whitepapers for building well architected frameworks for industries like gaming, SAP, streaming media etc. These focus on nuances that need to be kept in mind while building infrastructure specific to these use cases.

AWS provides a commitment that applications built on infrastructure adhering to principles defined in the Well Architected Framework will stand up to scrutiny on multiple industry standard benchmarks

 

The AWS Well Architected Framework is built on 6 pillars identified as crucial

  • Operational excellence 
  • Security
  • Reliability
  • Performance Efficiency
  • Cost Optimization
  • Sustainability
  • Let’s look in detail at what each of these mean

    Operational Excellence

     Operational excellence can be imagined as the ability to create and operate an environment hosting the applications and workloads with enough levers built in which can be tweaked for optimization. AWS recommends that, in order to limit human error we use code for setting up, running and automating the environment. The best practice is to keep on refining operating procedures frequently and arriving at the best possible combination of the provisioned, tweakable parameters for a given set of requirements. It is also advisable to make incremental/decremental small changes to the parameters which can be rolled back in case the desired results are not achieved, thus keeping adverse impact on customers down to a bare minimum.

     

    One should always aim to improve the overall operational health of the system. But in order to improve, one should be able to gauge where it is currently. In other words, one should be able to measure. It is not possible to measure without defining KPIs. KPIs need to be defined based on business outcomes and customer outcomes. Customer outcomes and business outcomes are defined based on business priorities. So, working our way backwards, in order to improve the overall operational health, it is very important to understand business priorities, and then go from there

    These priorities should be the driver for the environment setup. Obviously, business priorities change, and the environment should be flexible enough to adapt. Identify touch points within the environment ecosystem which should be amenable for changes with changes in business priorities

    Let us see an example. A B2C website may have an extremely fast backend RDS system linked to it since the business priority during normal working hours is speed of response to user clicks. But during off peak hours, when business computational load of the day’s operations on the RDS becomes high, the business priority may no longer be the same. The hosting cloud environment must be flexible enough and be able to adapt itself when such business priorities change

    With the Cloud, it is relatively easy to collect stats on how architectural decisions affect workload behaviour. Unlike with traditional data centers, with this data, it is possible to make changes to the environment for the better performance of workloads. In other words, it is important to have levers built into the system which can be tweaked to improve operational efficiency

    It is important to anticipate failures. Think of what-if scenarios and get a good understanding of what the impact on business will be in each of these scenarios. Also come up with response strategies and test them out to ensure they will be effective in real-life situations

    Let us try to understand this with an example. It is normal to see workload volumes increasing during certain predetermined hours or days. After configuring your system to scale up when workload increases, simulate a failure scenario wherein auto scaling doesn’t happen in response to increased workload. What would be the response of admin personnel in this situation. There needs to be a clearly chalked out SOP when a situation like this is encountered and test whether this works

     

    To the extent possible, always automate the response to an event. AWS provides multiple ways to do this. Cloudwatch is a service which lends itself admirably for these use cases. CloudWatch event rules and CloudWatch alarms are examples which can be leveraged here

     

    Security

    Although this term sounds as if it is self explanatory, this is probably one of the most challenging pillars to implement, especially in industries where data is sensitive

    Often, security issues happen due to lack of understanding and implementation of the most basic best practices. Before attempts are made to implement complex and fancy security procedures, ensure that some of the more fundamental, common sense security best practices are taken care of. This can potentially address most of the concerns

    One of the top things that needs to be kept in mind while implementing security for data, systems and assets is that security needs to be implemented at multiple levels and layers. For your use case(s), figure out which are these levels and layers

    Both data at rest and in motion needs to be protected. Data in motion can either be flowing within AWS touchpoints or it could be from on-prem to cloud or vice versa. Encrypting data in motion can be a challenging task. There are multiple options AWS provides for encrypting data in motion and the following are the best practices

    Principals need to be given permissions based on the least privilege principle and separation of duties need to be enforced. 

    Reduce/eliminate reliance on long term static credentials. 

    Security environment must support time travel in order to pinpoint a principal who has executed a particular command. 

    It is very important to have an incident management system in place. In spite of good security mechanisms, there will be incidents. Teams must be able to isolate systems under attack at very short notice.

    Automate the incident response

    Run simulations of security breaches and have the appropriate team detect breaches in the minimum time possible. 

    Source of threat to the environment may vary depending on the industry the customer operates in. So, it is important to be extremely conversant with the potential threats unique to the industry and tailor a security strategy that addresses these threats


    Some of the important services that can be leveraged to implement security across different areas are as follows

    Areas

    Key Services

    Data Protection

    EBS

    S3

    RDS

    KMS

    Cloud HSM

    Privilege management

    IAM

    MFA Token

    Permissions

    Roles

     

    Infrastructure protection

    VPC

    WAF

    Shield

    CloudFront

    Route 53

    Detective controls

    CloudTrail

    Config

    CloudWatch

     

     

     

    Protecting data in transit:

    ·       Use utilities like AWS PrivateLink to create a secure and private network connection between AWS VPC or on-prem installations to AWS based services. With PrivateLink, traffic stays on the Amazon backbone and therefore doesn’t traverse the internet and thus is safe

    ·       Use tools like GuardDuty to automatically detect attempts to move data outside of defined boundaries

    ·       Encryption in transit can be enforced in AWS. AWS services provide HTTPS endpoints using TLS for communication, thus providing encryption in transit when communicating with the AWS APIs. It is also possible to use VPN connectivity into VPC from an external network for data encryption

     

    Reliability

     

    Reliability is the ability of a workload to perform its intended function correctly and consistently when it’s expected to, throughout its lifecycle

    ·       No matter how robust an environment is, it can buckle under unexpected load. When this happens, the system should be able to automatically recover. A reliable system must be able to automatically recover from a failure. Taking this to the next level, the system should have the intelligence to anticipate failure and automate the appropriate response

    ·       Percentage availability of an application would be dictated by its functionality. Applications with critical functionality would need to be made highly available in multi AZ mode vis-à-vis other applications being made highly available in single AZ. This is equally applicable for front end applications as well as for backend apps like databases

    ·       Be very clear about needs for RPO and RTO and use this info to build in reliability

    ·       In addition to verifying the load works in best case scenarios, conduct negative testing in order to simulate scenarios that would cause the load to fail. This gives the opportunity to test recovery procedures

    ·       Where possible, replace single large resources with multiple smaller resources. More importantly, ensure they don’t share a single point of failure

    ·       Simulate failures and define SOPs in order for applications to be brought back live in case of failures

    ·       Having a good monitoring system is essential for a reliable system

    ·       Use logs and metrics wisely. Very often, logs and metrics tell a story on how your environment is being utilized and under what load. So, carry out periodic analysis on them and take appropriate action

    ·       Monitoring + Alerting + Automation=Self Healing. For example, CloudWatch and Autoscaling can be used to together to recover from failed EC2 instances

    ·       It is relatively easy to automate the system in such a fashion that it reacts to certain trigger events. Leverage this to get the environment to auto correct itself when faced with an imminent failure

    ·       Backups. Depending on the criticality of the data, RPO and RTO requirements, functionality of the data and its application usage, one needs to come up with appropriate backup strategies which includes the frequency at which it needs to be backed up. Not only does data need to be backed up, it is also equally important to ensure the reproduceability of it and keep the time it takes to reproduce it when needed, down to a minimum. Obviously, after reproducing it, it should be in a state where it can be used.

    Performance Efficiency

                 

    The Performance Efficiency pillar includes the ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes and technologies evolve.  As with the other pillars there are a few things to keep in mind here

    ·       xaaS. These days many technologies can be consumed as a service and does not entail installing and administering them ourselves. The service provider would be an expert in ensuring optimal efficiency of the service; leverage that to the extent possible

    ·       Go serverless. As a corollary to the above principle, it makes sense to leverage serverless compute and serverless storage in order to avoid provisioning them ourselves. However, there may be a tradeoff on cost and one would need to balance performance efficiency against it

    ·       Having made these choices, it is important to review them periodically and check if there are variances from expected performance. If there are then, those need to be addressed

    ·       Usually, there will be multiple services for similar use cases. Understand which service(s) best fits your specific use case. Consult AWS if needed

    ·       When trying to improve performance efficiency, work against a specific target and benchmark existing efficiency for a given set of parameters. This will let you measure improvements objectively

    ·       Work against permissible cost for your set of requirements

    ·       Analyze metrics, access patterns and choosing storage and compute options based on these

    ·       Network parameters usually will have a big impact on performance and efficiency. Study this closely and make appropriate configuration changes

     

    Cost Optimization

    Cost optimization is the process of minimizing costs thru continual process of refinement and improvement, without compromising on business outcomes

     

    ·       Use tools like AWS Budgets wisely and extensively to stay within limits. However, monitor costs proactively and don’t depend just on notifications

    ·       Keep Finance and Technology teams in the loop on all decisions in the cloud journey

    ·       Aim to innovate without overspending

    ·       Create groups and roles which control who can commission/decommission instances and resources. Good way to keep in check rising costs

    ·       Have policies in place which preempt unnecessary resource use.

    ·       Know exactly where costs are being incurred and focus more on those areas to implement cost controls. 80% of the costs may be incurred by 20% of the services. Focus more on these services

     

    Sustainability

     

    This pillar was recently added, and it focuses on the long-term environmental, economic and societal impact of your business activities on AWS cloud

     

    It is important to understand that all workloads leave a carbon footprint. Since all unnecessary storage and compute leads to wastage of energy, one of the core guiding principles this pillar advocates, is to eliminate redundant storage and compute and do everything that supports this cause

     

    Let us look at a couple of instances where application of Well Architected Framework Principles led to benefits for customers

    ·       Consulting solutions firm Burns & McDonnell saved 30% of its overall AWS bill in the first week after taking action based on the Well Architected Framework guiding principles.

    ·       BMC Software used the principles of Well Architected Framework and saw the following benefits

    ·       Was able to start delivering immediate value to customers

    ·       Expanded offerings to new companies and departments

    ·       Received positive customer feedback within its first 4 months

    ·       Exceeded internal business objectives

     

     

    The Well Architected Framework is a good place to start for teams looking to optimize their AWS based solutions along different dimensions. I encourage delivery teams to apply this framework and aim to build robust solutions

     

    Monday, May 30, 2022

    The rise of Annamalai - TN state BJP president


     

    In the dark cesspool politics that TN finds itself in today, there is a ray of hope that is emerging. His name is Annamalai. The BJP in a masterstroke appointed Annamalai as its state president in the 2nd half of 2021. Since then, Annamalai has been giving the ruling DMK nightmares on a daily basis.


    An ex IPS officer, Annamalai is honest, upright, articulate and fluent in English - everything that DMK politicians are not. His communication in Tamil, his mother tongue is superb. Apparently, he is working to pick up Hindi as well


    Annamalai regularly holds press meets in which he picks up bad policy and administrative decisions of the DMK and tears into the administration. He does his homework thoroughly (another trait which is new to TN politics) and comes fully prepared to field questions from the media, the majority of whom are DMK stooges. A hallmark of Annamalai's interactions with the media is how he rattles off stats and figures backing his claims.


    On multiple occasions in the past few months, the DMK government has had to back down on their decisions owing to immense pressure from him. It seems to me as if for the first time in TN, the people of the state are waking up to the realization that there is an alternative to the so-called Dravidian ideology that both DMK and AIADMK have been following and propagating in the state for close to 55-60 years (after the Congress ceased to be a power in the state).

     

    We need to keep in mind that the AIADMK is the principal opposition party in TN. So in a way, Annamalai's offensive is not only a cause of concern as far as the DMK is concerned, it is also a slap in the face of the AIADMK. He has taken the wind out of the sails of the principal opposition party. Of course, there is not much difference ideologically between the DMK and the AIADMK, which would explain why the AIADMK has no locus standi to oppose the DMK on its politically motivated administrative decisions

     

    Annamalai is very closely aligned with Modi's long-term plan for Tamil Nadu. We have been seeing how the PM has been relentlessly focussing on TN; right from his decision to host China's Xi in Mahabalipuram, his infrastructure investments in TN, his strategic quotes of Tamil poems both at home and abroad, decisions of the Railways to modernize railway stations in key TN cities, Modi has been TN centric both subtly and in an open way.

    So, it is clear that the central BJP leadership is going all out to win the confidence of the Tamil people. And it has begun to show results as can be seen from the local body elections in Chennai and in the 2021 Assembly elections where BJP got a significant 8% vote share.

     

    Annamalai is confident of winning 15 Lok Sabha seats in the 2024 general elections and capturing power in the 2026 Assembly elections in Tamil Nadu. I think that is definitely on.

    Thursday, January 13, 2022

    MahaPeriyava Shankarapuram: Shri Vakil Anna - man on a mission

     


    Vakil Shri VenkataSubramanyam is a man on a mission. This devotee of MahaPeriyava is trying to bring the Brahmin population spread across the world, onto a common platform. The advantages of this, as one can readily imagine, are several and will work on multiple levels

    As part of his other ongoing programmes (which are too many to go into in detail here), he has indirectly been doing this for the past several years and has already onboarded 28000+ Brahmin families onto his platform. Technology has been an enabler here and he has been leveraging WhatsApp extensively.

    Vakil Anna has a team of dedicated volunteers working for him and through them, operates dozens of WhatsApp groups and corresponds with them via voice messages on a daily basis, often more than once a day.

    Many of you may already be familiar with Vakil Anna and his work. His Shankarapuram project (https://www.srisankarapuram.com/) is now world famous and the scale and speed at which activities are going on in this village for the greater common good, is phenomenal.

     

    The Brahmin of today has forsaken the three fundamental things required of him - Agni SamrakshaNam (Yagnas, Homams etc), Go SamrakshaNam (rearing cow/calf in his house), Veda SamrakshaNam (chanting of the Vedas). Sandhyavandanam, which is the core aspect of the Brahmin has also been sacrificed; money making has become the prime motive. The situation has been exacerbated in the last 75-80 years. This has led to the delicate equilibrium of the world go out of sync; this has also upset the symbiotic relationship between humans and the Devas. The disastrous results are there for all to see around us. The rot that has set in the society has a direct connection with the Brahmin giving up his prescribed duties

     

    Due to the almost superhuman effort by Vakil Anna for the past few years, a transformation can been seen at least in a few Brahmin families fortunate enough to be in Vakil Anna's groups. They are realizing that maybe they have gone astray and that corrective measures need to be put in place at least now, to slow down the degeneration of the community. They are realizing that if they don't act now, their children will have no religious moorings at all and will drift further and further away from our traditions, culture and values.

    But this revolution has to catch on. We all need to do our bit. Putting aside our petty differences, we need to show our support to Vakil Anna.

     

    How can you do that?

    • ·       Fill out the google form at https://tinyurl.com/SSMAGGNewmemberForm This will take less than 2 minutes
    • ·       Please share this blog with your Brahmin friends and relatives and encourage them to join the family

     

    Vakil Anna already has an infrastructure in place to take this information forward and onboard you into his groups.