Where are we going with automation? What can we achieve in the near future with processes such as Continuous Integration and Continuous Delivery, Process Automation, Infrastructure as Code, Network as Code and AI for IT Operations? What do we need to make automation possible and to what extent are the necessary technologies already sufficiently developed? In this blog, I answer these questions using a futuristic, yet very concrete example: IT operations for railways.
Imagine a software component that transmits data from railway architecture and for example moving equipment to other components. These components convert the raw data into information, such as "train number 123 is on track section 456". That information is mission-critical because without it, train controllers cannot operate. Does the software malfunction? Then the trains are sent back to the stations. Until the problem is fixed, traffic is stopped. It can take hours before the trains are running again. For this example, let's call the time when the component fails and stops transmitting data "time T”. An automated railroad IT system would then initiate two events at time T: (1) initiate a recovery procedure and (2) create a ticket in the IT Service Management system. Next, what happens is this:
- Within two minutes of time T, the backup component is up & running and the information is again available to train control, with the warning that the system is in a recovery state.
- In that same two minutes, i.e. within two minutes of time T, the DevOps team is activated and working on a solution.
- Within two hours of time T, the DevOps team has fixed the problem and the repaired system can be tested. After elementary regression testing, the fix is confirmed within 15 minutes. The repaired system takes over the work of the backup component. After another two hours, everything has been thoroughly tested and the repaired system remains in production.
7 foundations for the future of IT railways
The above scenario takes place in the future. During major incidents, trains will then be stopped for a maximum of two minutes. A more than acceptable scenario. Because now, in 2023, such events would shut down traffic for hours. Most railway software control systems today are not at the level of automation I describe in the future scenario. Why is that so? I asked myself this question too. Looking for an answer, I stumbled upon seven foundations on which the advanced system from my future scenario rests.
- 1. Test Automation
Not surprisingly, this is one of the fundamentals. The ability to automatically test software, from unit tests to integration and acceptance tests, is a must. For those who haven't gotten there yet: make this priority No. 1!
- 2. Continuous Deployment
My future scenario is only possible if changes to the software can be fully automated and go 'live' immediately. In other words, once a change in the code is approved, all steps - from pull request to deployment and including tests - are automated.
- 3. Test strategies
Reducing lead time requires different strategies. For example, there is a lightweight testing strategy and a heavy-duty one. The lightweight test suite confirms problem resolution and performs basic regression testing to ensure core functions remain available.
- 4. Smart monitoring
In the example, the software component is up & running; the CPU and memory profiles are as usual. The component, while still operational, is no longer functioning. Indeed, it no longer transmits messages. To proceed with automation, monitoring must be set up to catch these functional errors.
- 5. Hyper automation: Coupling systems together
To take the next steps in automation, it is essential to link systems such as monitoring, infrastructure and IT Service Management (ITSM). This linking of systems poses a challenge to security policies, for example, policies to isolate critical network domains. Usually, ITSM and critical infrastructure domains are strictly separated. This separation can hinder automation. Organizations must adapt their architecture and policies to enable automation while ensuring the security of critical infrastructures.
Furthermore, "event-driven" architectures are also needed. In the future scenario, the event "component fails" causes the events "run recovery procedure" and "create urgent bug report". These events must also cross network boundaries, for example to connect the non-critical domain of IT services to the critical domain of the critical infrastructure.
- 6. Recovery procedures
The system must be designed with recovery in mind. What are good backup components? These can be earlier versions, a bare minimum version, or a gold version known to be correct but with inferior performance. Are these "recovery procedures" good enough to take over for a while?
7. Mission critical systems in recovery mode
When the backup component works, it does not mean that the entire IT system runs as it normally does. While the required information is provided to the traffic manager, it is provided with a lower level of confidence. After all, the mission-critical IT system is running in a "recovery mode" for some time. An interesting question is what is acceptable here. The system must be safe enough to be used while minimizing the consequences of the incident.
A possible strategy
In my view, the above seven fundamentals are necessary to enable the advanced automation system from my future scenario. Most of these are not "rocket science". The necessary technologies are already available today. What we still need, however, are organizations and people who can apply these processes. We also need to address several key design challenges, such as the testing strategies, recovery procedures and linking automation systems mentioned above. To make big leaps, we need to address these challenges in parallel. One possible strategy in this regard could be to start the following two programs:
- Automation of test and delivery: this program focuses on test automation, a continuous series of automated processes and testing strategies. This program sheds light on key steps to automate software testing and delivery.
- Linking Systems and Monitoring: in this program, you work on linking monitoring, ITSM and your favorite automation platform. It's best to start with a simple use case, because the linking will test your current architecture and security policies quite a bit. To build on automation, you will first need to create initial structures and environments. And just laying that foundation is the hardest part. But rest assured: once you have opened the door to automation, more will come quickly!
Share your views on the future, foundations, and challenges!
Writing this blog was a nice thought experiment for me. I sketched out a futuristic, yet concrete, example and took stock of what I think it would take to make the vision of the future a reality. That led to seven foundations and a strategy for taking the next automation steps. I hereby challenge you to also come up with a future scenario in which automation processes such as Continuous Integration and Continuous Delivery, Process Automation, Infrastructure as Code, Network as Code and AI for IT Operations play the leading role. Perhaps we can brainstorm together on the possible fundamentals and challenges.