> How can I automatically remediate the issue where corruption in LMDB causes the agent to crash?

In some cases lmdb corruption causes cf-agent to crash. The typical fix is to remove the corrupt lmdbs causing the problem. In some cases this can be remediated from policy. This policy (run early in the bundlesequence) uses lmdump to probe all lmdbs. If the probe fails the offending lmdb and it's associated lock files are deleted.

EDIT: Now supports retaining n versions of corrupt databases

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
  bundle agent remediate_lmdb_corruption
  #@ breif Avoid crashing agents due to corruption in embedded databases by purging files detected to have corruption
  #@ description Sometimes corruption in embedded databases can cause the agent to
  #@ crash. This policy tries to detect corruption in embedded databases by probing
  #@ them. If the probe failes the embedded database is deleted so that the agent
  #@ will continue to function as expected. If you would like to retain backups of
  #@ the corrupt databases define def.remediat_lmdb_corruption[keep] with the number
  #@ you would like to retain.
  #@
  #@ **Example augment:**
  #@ ```json
  #@ {
  #@   "vars": {
  #@     "remediate_lmdb_corruption" {
  #@       "keep": "3"
  #@     }
  #@   }
  #@ }
  #@ ```
  {
    vars:

        "keep_remediations"
          string => ifelse( isvariable( "def.remediate_lmdb_corruption[keep]"), $(def.remediate_lmdb_corruption[keep]),
                            "0");

        "lmdbs" slist => findfiles( "$(sys.statedir)/**.lmdb" );

        "lmdb[$(lmdbs)]"
          string => "lmdump_failed",
          if => and(fileexists( $(lmdbs) ), 
                    not( returnszero ("$(sys.workdir)/bin/lmdump -a $(lmdbs)", noshell) ));

        "corrupt" slist => getindices( lmdb );
        "permutations" slist => { "", "-lock", ".lock" };

    classes:

      "remediation_behaviour_keep" expression => isgreaterthan( $(keep_remediations), 0 );
      "remediation_behaviour_purge" expression => not( isgreaterthan( $(keep_remediations), 0 ));

    files:

      remediation_behaviour_purge::

        "$(corrupt)$(permutations)"
          delete => tidy,
          comment => "We detected a corrupt lmdb and deleted it to preserve agent functionality.",
          classes => results( "bundle", "lmdb_corruption_purge_remediation_$(corrupt)");

      remediation_behaviour_keep::

        "$(corrupt)$(permutations)"
          rename => rotate( $(keep_remediations) ),
          comment => "We detected a corrupt lmdb and deleted it to preserve agent functionality.",
          classes => results( "bundle", "lmdb_corruption_keep_remediation_$(corrupt)");

    reports:
      "DEBUG|DEBUG_$(this.bundle)"::

        "$(corrupt) failed lmdump probe. Remediation retaining $(keep_remediations) backups";
  }
Policy to remediate corruption in lmdbs that can cause agent crashes

Here is example output from a run where performance.lmdb was corrupt and the DEBUG class was set.

  R: /var/cfengine/state/performance.lmdb failed lmdump probe. Remediation retaining 3 backups
Output from policy run