How to empower teams to better support software systems?

Creating new systems from scratch is so much fun. I love it when you can dream up a project. I have a candy shop full of technologies I can choose from. It is fun creating all those shapes and connecting the lines when laying out the architecture of the system. The highlight for me is when the development starts. Not so much fun but necessary is setting up the CI/CD pipelines, and then that magical moment when you promote the production application! I have the best job in the world!

Well, not for all engineers. What about the engineers or team(s) that have to maintain the system? these engineers don’t have the in-depth understanding of the system I had because I was there from day 1. During any previous planning phases, did we think of them? Probably not.

I want to put the focus back on two often neglected functions to ensure that support and maintainability are taken into account during the initial stages and not reactive to it. This will make supporting any system a more pleasant and productive experience for the next engineer or team.

Handover to another team for support

Once a new system moves into the production environment that is when the real “fun” starts. It is seldom the case that Team A  develops the system or the feature and maintains it until its end-of-life. Let’s for a moment assume that is the case at some point people leave, and the team with the same name now looks different but the system has not changed apart from new enhancements or more features.

Team A, who is the original developer of the system hands it over to Team B, the team supporting the new feature. Team A with a specific skill set and a high-performing fast execution team moves onto a new project. Energy flows where the attention goes. Team A neglected to create sufficient documentation on potential troubles the system experienced during the development and initial production phases.

When Team B takes over it more often than not requires a handover meeting with Team A. Team A now needs to spend energy to get all the documentation up to date and add additional documentation as gaps were identified by Team B’s efforts. The timing sucks because Team A has to context switch and create more documentation in a hurry because they have other priorities to deal with and the focus has already shifted. The quality of the documentation as well as the communication during the handover suffers. 

Look at the following scenarios

Team ACreates complete documentation
Team B Read all the documentation. Is self-serving and productive
Table A

Team A Creates incomplete documentation
Team BReads incomplete documentation. Identifies the gaps
Team AMeets with Team B
Team BMeets with Team A
Team AUpdates outstanding documentation
Table B

I’ll admit. Table A looks like a pipe dream. Nonetheless, let’s marvel at the beauty of it. Extremely efficient!

During the development phase, many issues are identified and fixed which provides a good opportunity to document these problems for future reference. Not all problems have to be documented because of code fixes but there will be some processes and pipelines problems that will re-occur again.

Production is the best place to identify the one-percenters. Systems behave differently in different environments. The production database is much larger than in other non-prod environments. These lower environments have obfuscated data and much less of it. During the initial production phase, there will be enough new problems to document because of the sheer volume and combinations of data to be served.

It was impossible to account for all these scenarios and exceptions during the development and testing phases. ETL Processes fail and they will. Pipelines break during a deployment to production and there will be some other data issues. It is a perfect opportunity for Team A to document all these issues.

In addition to troubleshooting documentation, there must be installation and configuration guides as well. Think of EVERYTHING that will make life easy for newcomers to the system.

Documentation should be written like you are explaining to a non-technical person because often we try and help another person by assuming they have a certain predefined context of the system. That is where we often create problems for ourselves and waste our own time. People will come back and come back again because we neglected to explain the entire context before providing the solution.

Administering and supporting the system

I fail to recall any time in my career in software during the design and development phase when the architects and engineers envision how the system might function, and what the potential problems or one-percenter scenarios will be. It may have been part of the process initially but eventually falls behind due to time and money pressures. It is a habit to only think of the happy path.

We use the best coding techniques and practices. We use all the patterns that make the system robust. We have done everything to make the system perfect.

Then we run into problems…

Most large systems are dependent on a wide variety of data from different sources. In my experience most often data is fed using large ETL batch processes or it can be a high-volume transactional system or both! It becomes complicated to apply any large-scale fixes during failures, data removal, or large-scale data integrity problems. Flexibility is gold!

These will be the standard questions to ask to ascertain from a technical point of view if the system can be remedied at its simplest.

  • Is there a job/process that can be run to fix any data issues which might have been processed incorrectly or incorrectly inserted during input?
  • Can these processes be run during business hours?
  • Does everyone on the team have sufficient permissions or access to the servers?
  • Do we have the ability to update records in batches?

If it is yes to all the above it is still not the ideal situation. Some engineers lack confidence and if these services are restarted and fail or done at the wrong time it can affect clients and SLAs. Not everyone has elevated permission and access to database servers for example. We often lack the flexibility and by that I mean tooling to remedy any data processing problems at a large scale. Let’s take a look at the potential two areas which can make optimize support in this area.

Automation

In the outline of work, we have the luxury of the skill to automate critical business processes and/or mundane tasks we have to execute daily. We are good at automating releases and testing. Those are fundamentals of our systems and are necessary! We have to!

I don’t think we are any good at automation for self-healing or self-correction. Netflix’s infrastructure is too large for humans to monitor and out of necessity they implemented intelligent systems to monitor and apply corrections. Does this kind of intelligence always have to be born out of necessity? Why can’t it be baked into just the good practices philosophy? Are we afraid of losing control and losing our jobs? Or are we lazy in the mind?

The fact is that this type of automation will set you apart from the rest. Remember; Energy flows where the attention goes. Having these processes in place opens engineers up to focus on innovation and revenue opportunities or to pay back the debt (tech). You don’t have to be a large-scale company to achieve this.

Let 3rd party tools do the night shift instead of you looking at your email or teams/slack message every 2 hours. If these are properly configured they can do a lot more than you and best of all it never gets tired! We have lambdas, Azure function, python scripts, PowerShell, Apache Nifi, or any other robust task automation tool. There are multiple options and no excuses.

Administration interfaces

What if we need to invalidate client records and remove them from the system on request? What if we need to spot-fix a couple of records with incorrect information and it is too expensive to re-run the processes or to fix them at the source? Typically, a team would run a script directly injected into the database. Is there a review process? Is the script optimized enough if it needs to delete thousands of records from the database?

Will this script content for resources on the database server during business hours or after hours? Is this process secure? With some planning upfront and thinking about these scenarios, engineering teams can create APIs and administration UI systems to capture these one-percenters. APIs are effective to use if UIs aren’t available and with proper authorization and authentication, non-technical people can use these too.

These problems are those one-percenters but they tend to take so much time and effort to mitigate and remedy. It would be prudent to develop these mitigation steps during the development phase to administer these difficult requests. Every microservice or system must have some administration component for the data it produces.

These administration functions both UI or API endpoints that can be used by non-engineers and typically will be safe to use removing expensive time and effort from engineers to execute these processes or scripts. It creates flexibility and confidence in the system.

Conclusion

It is important to include this documentation, administration, and automation during the design and development stages. Create awareness and be diligent about it from the start. The price you pay upfront will be relatively small to the huge price you will pay later.

It is not only about making life easier for you or your team. Be different and make it easier for everyone supporting the system down the line. The lifespan of a good system can easily last 5-10 years. There will be many people and engineers responsible for it over that period.

I would love to hear your thoughts…

How to empower teams to better support software systems?