Mean Time To Diagnose (MTTD) is the real bottleneck 

Mean Time To Recover (MTTR) is important but undetected, undiagnosed errors can cause severe damage for long periods of time

Can you actually fix a configuration error before you detect it? Actually, yes. It’s called configuration auto-remediation. Think of the lifetime of a configuration error. It typically starts when someone needs to make a change in their environment. In complex environments, IT teams are constantly updating or changing software, databases, middleware and operating systems. Making these changes usually requires a corresponding configuration change.

But did that change accidentally cripple a configuration?

How would you even know?

And how soon would you know?

Minutes? Or Months?

But did that change accidentally cripple a configuration? How would you even know? How soon would you know? Minutes? Or Months? If the configuration error caused an outage, you might know right away. But if the config error “only” damaged application performance or introduced a security vulnerability, you may be unaware for quite a while, even as the problem grows.  

Many organizations measure themselves on their mean time to recover (MTTR), and that is quite important. But as they say, the first step is recognizing you have a problem. Mean Time to Diagnose (MTTD) is key.

Two Scenarios for discovering and recovering from configuration errors

Yesterday:
Common practice in 2016

MTTD

  • “I think something is wrong”
    Time until you became aware of an error (often based on app performance or even an outage)
    * Time elapsed: Days???
  • “I think it is a configuration problem”
    Time spent determining the error is configuration-based
    * Time elapsed: Hours?
  • “I think I have found the configuration error”
    Time spent detecting latent configuration error
    * Time elapsed: Hours? Days?

MTTR

  • “I think I have a fix for the configuration error”
    Time spent developing a fix
    * Time elapsed: Hours?
  • “Now I will deploy the fix”
    Time spent deploying a configuration fix
    * Time elapsed: Hours?
  • “Now I will test the fix”
    Time spent confirming the fix
    * Time elapsed: Hours?

Audit/ Post Mortem/ Process fix

  • “How could this have happened?”
    Time spent searching for and seeking to prevent the recurrence of a configuration problem’s root cause
    * Time elapsed: Days (add all person-hours including engineering and management meetings)

Today:
The configuration auto-remediation era in 2017

MTTD and MTTR

  • “Auto-remediation just automatically fixed an out-of-bounds configuration change”
    Time until you became aware of an error (often based on app performance or even an outage)
    * Time elapsed: seconds???

Audit

  • Q: “Oh really, what change were they trying to make? And who was trying to make it?”
    Time spent searching for and seeking to prevent the recurrence of a configuration problem’s root cause
    A: “It’s all right here in the audit report”

* Time elapsed: 60 seconds

The configuration error is
  • automatically diagnosed
  • automatically corrected and
  • an audit trail automatically created.
Automatically Detect Configuration Errors
Automatically Correct Configuration Errors

Click lower right icon to expand video

2016: Using scripting to find and fix configuration errors

In the typical 2016 scenario, the cause and the fix for a configuration error is often a script. Think of a script as having only hands, but no eyes. Once the configuration change script is deployed it goes out and diligently makes the changes according to what its author wrote. But without the ability to ‘see’ the effects of its changes, the script can blindly damage the application performance, cause security vulnerabilities, or create a full outage.

2017: Configuration auto-remediation

In the 2017 auto-remediation scenario, however, application, database, middleware and OS configurations are already centrally controlled. Compliant values and ranges for each configuration are set, managed and made visible by the configuration management software. So any attempt (even a well-meaning attempt!) to create and deploy a non-compliant or out-of-bounds change will be caught and corrected right away.

Even changes made outside of the configuration management software (e.g. by direct server access) will be detected and corrected automatically.

Learn more about how Orca’s Drift Detector will reduce MTTD Configuration in your environment.