For many, security isn’t the first thing they think of when they hear about chaos engineering. Probably even fewer people would consider it a fundamental security practice akin to things like network firewall configuration, identity management, and intrusion detection.
However, the growing complexity of modern software security layers alongside increasingly modular and distributed architectures has reached a point where the risk of failure validates the legitimacy of chaos engineering as a security tool. As such, chaos engineering is not unlikely to enter the realm of not just routine, but essential, security management processes.
Let’s take a look at the reasons chaos engineering is gaining popularity as a security management method, detail how it’s applied in security scenarios, and look at some of the best practices to follow when it’s put into practice – including some of the common pitfalls to avoid.
The role of chaos engineering in security
Chaos engineering is a broad term that describes the act of performing complex system tests by injecting errors before an application encounters them in normal operations, monitoring the results, and documenting the correct course of action. The concept of chaos engineering is often applied to operational hardware — including networks and server pools — as well as to software development and product testing.
Chaos may not sound like something a security specialist or compliance team would want to cultivate within their software systems. However, the goal of chaos engineering is to prevent chaos by identifying unobtrusive problems and potential failures before they occur in production. And as the practice matures, chaos engineering is getting more attention in the field of application security.
By performing chaos engineering directly on the security layers, security specialists have the opportunity to expand the number of situations and attack vectors they can simulate. In addition, they can test how the relationships between each of the multiple layers and functions affect the impact of a particular failure. Ultimately, this will expose areas where layers of security do not provide an effective barrier against attacks and intrusions.
Applying chaos to security: injections and monitoring
Testing chaos engineering for security is a matter of balancing two layers. One layer allows for the injection of errors; the other is where the monitoring and resolution processes take place. For example, one layer will inject test data to simulate unauthorized access attempts. The other layer identifies issues by looking for signals of security breaches, allowing security teams to pinpoint gaps in access control.
If those injections cause a failure or reveal a hole in existing security barriers, the monitoring process must identify the exact point and time at which the problem or breach occurred. Application infrastructure logs and monitoring data, along with the log of injected security-based errors, will also help correlate any infrastructure issues that could pose a security risk.
While it is possible to apply chaos engineering to security and infrastructure separately, doing so would likely be a mistake. Security breaches can occur not only as a result of unexpected events indirectly related to security or threat prevention tools, but also as a result of events in the IT infrastructure. For example, failures in the infrastructure often cause systems to run in a “failure mode” that does not break functional elements. Instead, it may provide a potentially unwanted bypass to certain security elements to enable fixes.
4 tips for a chaos engineering security plan
When conducting chaos-style testing, it’s important not to fall into the trap of focusing on common, predictable problems. Instead, try to focus on issues that, while unlikely, are at least a possibility.
In fact, chaos engineering naturally requires testing for errors that would be introduced due to both human error and system failure. Since the goal is to create “chaos,” limiting it to predictable behavior contradicts the goal. As such, test injections that introduce high levels of random errors are typically the most effective.
Monitoring is the other important element of chaos engineering, especially when it comes to security validation. Ideally, the sheer amount of test data and possible combinations and interactions of events make it very unlikely that most errors are not reproducible. This highlights the critical importance of data: if all the possible information needed to identify and solve a problem is not collected during routine testing, the whole process will waste a lot of time and money.
Logs and telemetry from both infrastructure and applications are an important part of meeting this requirement, as is accurate information about the injected events. Precise and synchronized timestamps are particularly important because without them there is no way to reliably document the relationships between particular causes and effects. It is the connection between chaotic events and poor results that makes chaos engineering worthwhile, which is easy to lose when there is a lapse of time in exact time records.
The last key element of chaos engineering revolves around the people who are responsible for it. Security personnel cannot effectively conduct chaos engineering assessments in isolation because they are unlikely to accurately recreate underlying system errors that cause unexpected failure modes that provide a bypass around certain security measures.
To be effective, chaos engineering requires collaboration between operations personnel and security teams. It is important to establish this collaborative model from the outset when implementing such a program, and equally important to drive it through the resulting test design, execution, and evaluation processes.