This website uses cookies. By using the website you agree with our use of cookies. Know more

Technology

Ready for production?

By Jorge Rodrigues
Jorge Rodrigues
Passionate about SRE. Jogging keeps me sane - adidas are always my choice
View All Posts
Ready for production?

The perspective of Site Reliability Engineering (SRE) teams

In modern technical endeavours, success requires the absolute collaboration of various perspectives and specialties. For a field or discipline to advance and mature, it needs to reach a point where it can reflect a whole diverse set of perspectives and synthesise all the 'nuances' that will make a product more stable and reliable.

In a world where product, development, and SRE work together from the beginning of a project, it is possible to obtain stability, reliability, availability, and security. It is not just value that will be delivered to the business, but also sustainability. 

In the recent past of any software project, only specific requirements were considered at the end of the project, too late to correct detected problems. This contributed to long lead times and questionable quality, negatively impacting one of the company's most precious assets – its customers. Non-functional requirements should not be approached only at the end of a project, they should be part of every stage of the daily work in the software development cycle, resulting in better quality, security, and compliance.

Every organisation always has two goals to achieve:

  • Responding to competitive and volatile scenarios
  • Providing stable and reliable services to the customer

Initiatives in different silos prevent the achievement of global organisational goals. In this way, a governance process is of the utmost importance in today's companies. Governance is the process by which we make sure that the right things are done at the right time and allow potential problems to be identified at an early stage, before going into production.

One of the simplest and most effective ways of ensuring the governance process is to have a document which contains the relevant steps which must be validated throughout the different stages of the software development life cycle. A production readiness checklist can keep everyone on track and help gauge if one service/feature is ready for production and that the team has carried out adequate production planning.

Reasons to exist

In the book "Site Reliability Engineering” there is a chapter entirely dedicated to Production Readiness Review (PRR), which demonstrates the extreme importance of the document. The PRR is a simple checklist that allows us to ascertain whether a service/feature meets acceptable standards to be deployed to production and an adequate plan to do it. Besides that, it creates a bridge of understanding between SRE teams and Development teams.

The PRR targets several aspects of the road to production, but the main objective is: to help improve service reliability in production, by minimising the volume and impact of potential incidents associated with non-functional requirements not being met.

What do you get?

Although it might be considered a work overload for Development teams, this step can safeguard the platform from potential future problems and guarantee the successful launch of a new service/feature. The process includes:

  • Preparing the necessary dependencies in advance

  • Motivating development to implement the necessary tasks to have a stable/reliable system in production

  • Identification and prioritisation of key reliability requirements

  • Planning of knowledge transfer sessions

  • Definition of SLOs and Error Budgets Policies

  • Checking if updates will be too disruptive

  • Ensuring the new service/feature is sufficiently instrumented

  • Ensuring security aspects are considered and implemented

  • All the parties are aware of the change that will occur

How is the objective achieved?

The PRR identifies the reliability requirements that the SRE team considers fundamental, based on their knowledge, to assume the operation of the servicefeature in production. 

Susan J. Fowler mentions in her book ”Production-Ready Microservices: Building Standardised Systems Across an Engineering Organisation” some areas to take into account in the PRR:

  • Service is Stable and Reliable

  • Service is Scalable and Performant

  • Service is Fault Tolerant and Prepared for Any Catastrophe

  • Service is Properly Monitored

  • Service is Documented and Understood


The PRR checklist should not be too extensive and should correspond to the maturity of the processes and good practices established by the organisation.

 

When a Development team requests the SRE team to assume the responsibility of a service/feature, an assessment of its criticality is made, and the capacity of its teams to support the new service/feature in production is also validated. To help us on this analysis, a production readiness review is initiated with the Development team. Only after the validation of the PRR, the SRE team accepts the responsibility of operating the service/feature in production.

The PRR process is not written into stone and must be periodically revised, thus giving the possibility for processes to be improved, as new levels of reliability maturity are reached. After the launch of a new service/feature, we should check if there were any missing points that we hadn't thought of and could cause a problem. Also, after an incident affecting the new service/feature, check to see if any of the contributing factors could have been mentioned in the PRR. Ultimately, these situations should be reviewed and added to the PRR if relevant.

As Pavlos Ratis mentioned in the article "The Production Readiness Spectrum": "What makes a service ‘production ready’ is a moving target. No, it's something we can prescribe. It's a spectrum.” What makes a product reliable today may no longer be valid tomorrow. Technological advances are constant and the PRR process will have to follow this progress. In the beginning with more effort from all teams involved, and later with smaller improvements.


Our organization's technology and requirements change over time, so it's impossible to guarantee a fixed state of production readiness in our infrastructure ad eternum.

Final words

We should be focusing on better and more reliable software, not on better incident response. If we have more reliable software, we will have fewer incidents and can be more focused on tasks that add value.



Related Articles