Design Principles

  • Perform Operations with code
  • Annotate document changes: automate the creation of documentation after every build, or you can automatically annotate (注解) hand-crafted documentation. Annotated documentation can be used by people and systems. Use annotations as an input to your operations code
  • Make frequent, small, reversible changes: You want to design workloads to allow components to be updated regularly.
  • Refine standard operations frequently
  • Anticipate failure: pre-mortem (事前验尸)
  • Learn from all operational failures

Core Services

  • AWS CloudFormation is key for operational excellence because it helps you ensure reliability. The service lets you treat infrastructure as code, and it provides templates you can use to replicate your environment.
  • To prepare, AWS Config and AWS Config Rules can be used to create standards for workloads, and to determine whether environments are compliant with those standards before they are put into production.
  • CloudWatch allows you to monitor the operational health of a workload.
  • Amazon Elasticsearch Service allows you to analyze your log data to gain actionable insights quickly and securely.

Anti-Patterns

Focus only on technology metrics

You might not be delivering value to the customer if you’re not paying attention to latency.
You want to also have business metrics (which AWS offers)

Batch Changes

Instead, make small, reversible changes

Manual Changes

It is hard to reproduce errors caused by manual changes

Stale (outdated) Document

Having outdated documentation or no documentation can create problems. Put a process in place to ensure all documentation is up-to-date.

Prepare

Monitor the application, platform, and infrastructure components

You can use CloudWatch alarms, and send the information from CloudWatch logs to a dashboard to see the health of your infrastructure at any time. You can use this information to understand the customer experience and customer behaviors.

Validate workloads before moving into production

Ask yourself: are the workloads supported by operations?

Perform Cloud Operations

  • Use Checklist for standard and required procedures
  • Check that required procedures are adequately captured in runbooks and playbooks
  • Validate trained personnel to make sure everyone is enabled

Test responses to operational events and failures

Make sure you test responses to operational events and failures so that you can quickly recover from them.

Key service: AWS Config, AWS Config Rules to define the standards to stick to.

Operate

  • Achieve business and customer outcomes through the successful operation of a workload.
  • Manage operational events with efficiency and effectiveness. You can do this by:
    • establishing baselines that you use to identify the improvement or degradation of operations,
    • collecting and analyzing your metrics
    • then validating your understanding of how you define operational success and how it changes over time.
  • Communicate the operational status of workloads.
  • Consider that operational health includes both the health of the workload, and the health and success of the operations that act upon the workload—for example, deployment and incident response.
  • Use dashboards and notifications so that information can be accessed automatically. The more people have access to information about the health of your infrastructure, the healthier it will be.
  • Take the time to determine the root cause of workload outage.

Evolve

  • Dedicate work cycles to making continuous incremental improvements.
  • Regularly evaluate and prioritize opportunities for improving procedures for both workloads and operations, such as feature requests, issue remediation, and compliance requirements.
  • Identify areas for improvement, and include feedback loops within your procedures.
  • Share “lessons learned” across teams to share the benefits of those lessons.
    • Analyze trends within the lessons learned
    • Perform cross-team retrospective analysis of operations metrics
    • Identify opportunities and methods for improvement.
    • Implement changes and evaluate the results

Evolve with AWS Services - Developer Tools

AWS CodeStar

Each AWS CodeStar project comes with a project management dashboard, including an integrated issue tracking capability powered by Atlassian JIRA Software.
AWS CodeStar provides a unified user interface, enabling you to easily manage your software development activities in one place.
There is no additional charge for using AWS CodeStar.

AWS CodeCommit

AWS CodeCommit is a fully-managed source control service that hosts secure Git-based repositories.

AWS CodeBuild

AWS CodeBuild is a fully managed continuous integration service that compiles source code, runs tests, and produces software packages

AWS CodeDeploy

AWS CodeDeploy is a fully managed deployment service that automates software deployments to a variety of compute services such as Amazon EC2, AWS Fargate, AWS Lambda, and your on-premises servers.

AWS CodePipeline

AWS CodePipeline is a fully managed continuous delivery service that helps you automate your release pipelines for fast and reliable application and infrastructure updates.

AWS X-Ray

AWS X-Ray[https://aws.amazon.com/xray/] helps developers analyze and debug production, distributed applications, such as those built using a microservices architecture.
With X-Ray, you can:

  • understand how your application and its underlying services are performing
  • identify and troubleshoot the root cause of performance issues and errors
  • have an end-to-end view of requests as they travel through your application
  • see a map of your application’s underlying components.

You can use X-Ray to analyze both applications in development and in production, from simple three-tier applications to complex microservices applications consisting of thousands of services.

Evolve with AWS Services - Amazon Elasticsearch Service

Amazon Elasticsearch Service provides support for open source Elasticsearch APIs, managed Kibana, integration with Logstash and other AWS services, and built-in alerting and SQL querying.
It allows you to analyze your log data to gain actionable insights quickly and securely.

What factors drive your operational priorities?

  • Business Needs
  • Compliance Requirements
  • Risk Management

Determining whether you are ready to support a workload

Best practices:

  • Continuously improving your culture. This best practice governs the way you operate. You must recognize that change is constant, and that you need to continue to experiment and evolve by acting on opportunities to improve.
  • Having a shared understanding of the value to the business. Make sure that you have cross-team consensus on the value of the workload to the business, and that you have procedures that you can use to engage additional teams for support.
  • Ensuring that you have enough personnel so that you can have an appropriate number of trained personnel to support the needs of your workload.
    • Perform regular reviews of workload demands
    • Train existing personnel or adjust personnel capacity as needed.
  • Making sure that governance and guidance are documented and accessible.
    • Ensure that standards are accessible, readily understood, and measurable for compliance. Make sure that you have a way to propose changes to standards, and request exceptions.
  • Using checklists to evaluate whether you are ready to operate workloads. These checklists include operational readiness checklists and security checklists.
  • Having runbooks for events and procedures that you understand well.
  • Having a playbook for failure scenarios.
  • Practicing recovery so that you can identify potential failure scenarios and test your responses—for example, game days, and failure injection.

What factors drive your understanding of operational health?

Best Practices:

  • Defining expected business and customer outcomes. Make sure that you have a documented definition of what success looks like for the workload, from business and customer perspectives.
  • Identifying success metrics. Define metrics that can be used to measure the behavior of the workload against the expectations of the business and of customers.
  • Identifying workload metrics. Define metrics that can be used to measure the status and success of the workload and its components. (Technology metrics)
  • Identifying operations metrics. Define metrics that can be used to measure the execution of operations activities, such as runbooks and playbooks.
  • Establishing baselines for metrics so that they provide expected values as the basis for comparison.
  • Collecting and analyzing your metrics. Perform regular, proactive reviews to identify trends and determine responses.
  • Validating insights. Review the results of your analysis and responses with cross-functional teams and business owners. Adjust the responses as appropriate.
  • Taking a business-level view of your operations. Determine whether you are satisfying customer needs, and identify areas that need improvement so that you can reach your business goals.
  • Determining the priority of operational events based on their impact on the business. When multiple events require intervention, priority is based on the business impact.
  • Putting processes in place to handle event, incident, and problem management.
  • Processing each alert. Any event for which you raise an alert should have a well-defined response, such as a runbook or playbook. The event should also have a specifically identified owner, such as an individual, a team, or a role.
  • Defining escalation(raise) paths. Runbooks and playbooks should have a definition for what triggers an escalation, a process for escalation, and specifically identify the owners for each action. Escalations might include third parties, such as for example, vendors, AWS Support, and others.
  • Identifying decision makers.
  • Communicating operating status through dashboards.
  • Pushing notifications to communicate with your users when the services they consume are being
    impacted, and when the services return to normal operating conditions, such as via email or SMS.
  • Establishing a root cause analysis process that identifies and documents the root cause of an event.
  • Communicating the root cause of an issue or event. Also make sure that you tailor your communications to the target audiences.