Safely reset RBD locks on failed VMs

Should correct issues on cold start as well as if a VM crashes
uncleanly, which would prevent the VM from starting due to stale RBD
locks.

This implementation has four parts:
  1. Update how IP addresses are handled, specifically by replacing all
  previous instances of "vni_ipaddr" with "vni_floatingipaddr", and then
  adding the "vni_ipaddr" with the real data for this node's IPs. Also
  include the storage IPs in this where they weren't before, so each
  this_node actually has the local IPs plus floating IPs. This enables
  the next two steps.
  2. Modify flush_locks to take this_node as an argument, and update the
  run_command function to only operate against this node, rather than on
  the primary coordinator.
  3. Have the flush_locks check each lock against the current node, to
  verify that the lock is actually held by the current node. This is the
  only way to do this safely. During fencing, we override this by not
  passing a this_node which bypasses this check.
  4. Have the VM start do the check for VM failure/startup and execute a
  flush_locks before actually starting the VM.
This commit is contained in:
2020-12-14 14:39:51 -05:00
parent 68d87c0b99
commit 7c99a7bda7
4 changed files with 66 additions and 27 deletions

View File

@ -1356,9 +1356,11 @@ def collect_vm_stats(queue):
if instance.getdom() is not None:
try:
if instance.getdom().state()[0] != libvirt.VIR_DOMAIN_RUNNING:
logger.out("VM {} has failed".format(instance.domname), state='w', prefix='vm-thread')
raise
except Exception:
# Toggle a state "change"
logger.out("Resetting state to {} for VM {}".format(instance.getstate(), instance.domname), state='i', prefix='vm-thread')
zkhandler.writedata(zk_conn, {'/domains/{}/state'.format(domain): instance.getstate()})
elif instance.getnode() == this_node.name:
memprov += instance.getmemory()