22
loading...
This website collects cookies to deliver better user experience
Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system’s capability to withstand turbulent and unexpected conditions.
Recognize the dangers and consequences: By allowing you to create an experiment and quantify how it affects your business, chaos engineering allows you to understand the influence of turbulent conditions on important applications. Companies can make informed judgments and react proactively to avoid or prevent losses when they understand what’s at risk.
Reaction to an incident: Because distributed systems are so complicated, there are several ways for things to go wrong. The notion of disaster recovery and business continuity is critical for firms in highly regulated contexts, such as the financial industry, because even a single instant of outage can be costly. These industries may rehearse, prepare, and put mechanisms in place for real-life situations by conducting chaotic experiments. When an incident occurs, chaos engineering allows teams to have the correct level of awareness, plans, and visibility.
Application Security & Observability: Chaos experiments help you figure out where your systems’ monitoring and observability capabilities are lacking, as well as your team’s ability to respond to crises. Chaos engineering will help you identify areas for improvement and motivate you to make your systems more visible, resulting in better telemetry data.
System Reliability: Chaos engineering enables firms to create dependable and fault-tolerant software systems while also increasing your team’s trust in them. The more reliable your systems are, the more confident you can be in their ability to perform as expected.
Define the steady-state hypothesis: You should begin by imagining what could go wrong. Start with a failed injection and forecast what will happen when it’s live.
Confirm the steady-state and perform several realistic simulations: Test your system using real-world scenarios to observe how it reacts to different stressors and events.
Collect data and monitor dashboards: You must assess the system’s dependability and availability. It’s ideal to employ key performance indicators that are linked to consumer success or usage. We want to see how the failure compares to our hypothesis, therefore we’ll look at things like latency and requests per second.
Changes and issues should be addressed: After conducting an experiment, you should have a good notion of what is working and what needs to be changed. We can now predict what will cause an outage and precisely what will cause the system to fail.
Understand your system’s normal state: Define your system’s steady state. Any chaotic experiment uses a system’s regular behavior as a reference point. You will have a better understanding of the effects of faults and failures if you understand the system when it is healthy.
Use of realistic bugs and failures: All experiments should be based on plausible and realistic settings. When a real-life failure is injected, it becomes clear which processes and technologies need to be upgraded.
Production-level testing: Only by running the test in a production setting can you see how disruptions influence the system. Allow your team to experiment in a development environment if they have little or no experience with chaotic testing. Once the production environment is ready, test it.
Control the radius of the blast: A chaotic test’s blast radius should always be kept as small as possible. Because these tests are conducted in a live setting, there is a potential that they will have an impact on end-users.
Automating chaos: Chaos experiments may be automated to the same degree as your CI/CD pipeline. Continuous chaos allows your team to continuously improve current and future systems.
CRDs for Chaos Management: For coordinating chaos on Kubernetes, the framework should have explicitly defined CRDs. These CRDs provide standard APIs for provisioning and managing chaos in large-scale production systems. These are the elements that make up a chaotic workflow orchestration system.
Open Source: To enable larger community engagement and examination, the framework must be totally open-source under the Apache License 2.0. The number of applications that are migrating to the Kubernetes platform is uncountable. Only the Open Chaos model will thrive and gain the requisite adoption at such a wide scale.
Extensible and Pluggable: The framework should be integrable with the vast number of existing cloud-native applications, essentially built as a component that can be easily plugged in for chaos engineering within an application and can be easily plugged out as well.
Broad Community adoption: The chaos will be carried out against well-known infrastructures such as Kubernetes, applications such as databases, and infrastructure components like as storage and networking. These chaos experiments can be utilized again, and a large community can help identify and contribute to more high-value scenarios. As a result, a Chaos Engineering system should have a central hub or forge where open-source chaos experiments can be shared and code-based collaboration is possible.
Chaos Experiment: Chaos Experiments are the building blocks of the Litmus architecture. Users can develop the desired chaos workflow by choosing from freely available chaos experiments or by creating new ones.
Chaos Workflow: A chaos workflow is a lot more than a chaos experiment. It helps the user define the intended result, observe the result, analyze the overall system behavior, and decide whether the system needs to be changed to improve resilience. For a normal development or operations team, LitmusChaos provides the infrastructure required to design, use, and manage chaotic workflows. Litmus’ teaming and GitOps features considerably aid in the collaborative control of chaotic processes within teams or software organizations.
Litmus WebUI: Litmus UI provides a web user interface, where users can construct and observe the chaos workflow at ease. Also this act as a cross-cloud chaos control plane that is
Litmus Server: Litmus Server act as middleware which is used to handle API request from the user interface, store the config and results into the DB. This also acts as an interface to communicate between the requests and scheduling the workflow to Agent.
Litmus DB: Litmus DB act as a config store for chaos workflows and their results.
Chaos Operator: Chaos-Operator watches for the ChaosEngine CR and executes the Chaos-Experiments mentioned in the CR. Chaos-Operator is namespace scoped. By default, it runs in litmus
namespace.
CRDs: During installation, the following three CRDs are installed on the Kubernetes cluster: chaosexperiments.litmuschaos.io
, chaosengines.litmuschaos.io
, and chaosresults.litmuschaos.io
.
Chaos Experiment: Chaos Experiment is a CR and is available as YAML files on Chaos Hub.
Chaos Engine: ChaosEngine CR connects experiments to applications. The user must construct ChaosEngine YAML by giving the app label and experiments, as well as the CR.
Chaos Results: The results of a ChaosExperiment with a namespace scope are stored in the ChaosResult resource. The experiment itself creates or updates it in runtime. It contains critical information such as the ChaosEngine reference, Experiment State, Experiment Verdict (on completion), and key application/result properties. It can also be used to collect metrics.
Chaos Probes: Litmus probes are pluggable tests that can be defined for any chaotic experiment within the ChaosEngine. These checks are carried out by the experiment pods based on the mode they are defined in, and their success is used to determine the experiment’s judgment (along with the standard “in-built” checks).
Chaos Exporter: Metrics can be exported to a Prometheus database if desired. The Prometheus metrics endpoint is implemented by Chaos-Exporter.
Subscriber: Subscriber is a component on the Agent side that communicates with the Litmus Server component to obtain Chaos process data and return the results.
pod-delete
experiment. Essentially, we’d like to see that whether our Kubernetes deployment from the last blog is resilient against the event of accidental pod-deletion.kubectl get deployments
command:kubectl get pods
: kubectl create namespace litmus
git clone https://github.com/litmuschaos/litmus-helm && cd litmus-helm
helm install litmuschaos — namespace litmus ./charts/litmus-2–0–0-beta/
litmus
namespace using kubectl get all -n litmus
command:litmus-frontend
, litmus-backend
, and mongo
. These comprise the Litmus WebUI, Litmus Server, and Litmus DB respectively, as discussed earlier.litmusportal-frontend-service
, litmusportal-backend-service
, and mongo-service
. These services maintain the endpoints for the pods which we saw earlier.litmusportal-frontend
and litmusportal-backend
, which are also responsible for specifying the replicasets for the same.mongo
to persist the data contained by the DB even if the mongo
pod dies and restarts.litmusportal-frontend-service
. The nodePort assigned to the litmusportal-frontend-service
in my machine has a mapping of 9091:30628
where 9091
is the specified targetPort
while 30628
is the assigned nodePort
.litmusportal-frontend-service
in order to access it at that port. For example, if I wish to access the litmusportal-frontend-service
at the 3000
port, I’d use the command kubectl port-forward svc/litmusportal-frontend-service 3000:9091 -n litmus
. Once done, simply access the Litmus portal at http://127.0.0.1:3000
.nodePort
, given you have a firewall rule allowing ingress at that port. For example, I have a nodePort
of 30628
, hence I can directly access the Litmus portal at http://127.0.0.1:30628
.admin
and the default password is litmus
. Once you log in, you’d be prompted to enter a new password. Once that’s done, you’d find yourself in the dashboard:litmus
namespace only. Add a description of your own choice.appLabel
of our hello-world application by overriding the default app=nginx
value. To check the label of our deployment, we can use the kubectl get deployments --show-labels
command:app=hello-world
so we’d simply replace app=nginx
with app=hello-world
.