Fix bad log message

Bump version to 0.9.10
Revamp fencing order
2020-12-15 10:51:52 -05:00 · 2020-12-15 10:45:15 -05:00 · 2020-12-15 02:48:25 -05:00 · 2020-12-15 00:30:20 -05:00 · 2020-12-14 16:04:38 -05:00 · 2020-12-14 15:53:18 -05:00
28 changed files with 1302 additions and 457 deletions
--- a/README.md
+++ b/README.md
@ -5,7 +5,6 @@
 <br/><br/>
 <a href="https://github.com/parallelvirtualcluster/pvc"><img alt="License" src="https://img.shields.io/github/license/parallelvirtualcluster/pvc"/></a>
 <a href="https://github.com/parallelvirtualcluster/pvc/releases"><img alt="Release" src="https://img.shields.io/github/release-pre/parallelvirtualcluster/pvc"/></a>
-<a href="https://git.bonifacelabs.ca/parallelvirtualcluster/pvc/pipelines"><img alt="Pipeline Status" src="https://git.bonifacelabs.ca/parallelvirtualcluster/pvc/badges/master/pipeline.svg"/></a>
 <a href="https://parallelvirtualcluster.readthedocs.io/en/latest/?badge=latest"><img alt="Documentation Status" src="https://readthedocs.org/projects/parallelvirtualcluster/badge/?version=latest"/></a>
 </p>

@ -17,10 +16,50 @@ The major goal of PVC is to be administrator friendly, providing the power of En

 ## Getting Started

-To get started with PVC, read the [Cluster Architecture document](https://parallelvirtualcluster.readthedocs.io/en/latest/architecture/cluster/), then see [Installing](https://parallelvirtualcluster.readthedocs.io/en/latest/installing) for details on setting up a set of PVC nodes, using the [PVC Ansible](https://parallelvirtualcluster.readthedocs.io/en/latest/manuals/ansible) framework to configure and bootstrap a cluster, and managing it with the [`pvc` CLI tool](https://parallelvirtualcluster.readthedocs.io/en/latest/manuals/cli) or [RESTful HTTP API](https://parallelvirtualcluster.readthedocs.io/en/latest/manuals/api). For details on the project, its motivation, and architectural details, see [the About page](https://parallelvirtualcluster.readthedocs.io/en/latest/about).
+To get started with PVC, please see the [About](https://parallelvirtualcluster.readthedocs.io/en/latest/about/) page for general information about the project, and the [Getting Started](https://parallelvirtualcluster.readthedocs.io/en/latest/getting-started/) page for details on configuring your cluster.

 ## Changelog

+#### v0.9.10
+
+  * Moves OSD stats uploading to primary, eliminating reporting failures while hosts are down
+  * Documentation updates
+  * Significantly improves RBD locking behaviour in several situations, eliminating cold-cluster start issues and failed VM boot-ups after crashes
+  * Fixes some timeout delays with fencing
+  * Fixes bug in validating YAML provisioner userdata
+
+#### v0.9.9
+
+  * Adds documentation updates
+  * Removes single-element list stripping and fixes surrounding bugs
+  * Adds additional fields to some API endpoints for ease of parsing by clients
+  * Fixes bugs with network configuration
+
+#### v0.9.8
+
+  * Adds support for cluster backup/restore
+  * Moves location of `init` command in CLI to make room for the above
+  * Cleans up some invalid help messages from the API
+
+#### v0.9.7
+
+  * Fixes bug with provisioner system template modifications
+
+#### v0.9.6
+
+  * Fixes bug with migrations
+
+#### v0.9.5
+
+  * Fixes bug with line count in log follow
+  * Fixes bug with disk stat output being None
+  * Adds short pretty health output
+  * Documentation updates
+
+#### v0.9.4
+
+  * Fixes major bug in OVA parser
+
 #### v0.9.3

  * Fixes bugs with image & OVA upload parsing
--- a/api-daemon/pvcapid/flaskapi.py
+++ b/api-daemon/pvcapid/flaskapi.py
@ -333,14 +333,23 @@ api.add_resource(API_Logout, '/logout')

 # /initialize
 class API_Initialize(Resource):
+    @RequestParser([
+        {'name': 'yes-i-really-mean-it', 'required': True, 'helptext': "Initialization is destructive; please confirm with the argument 'yes-i-really-mean-it'."}
+    ])
    @Authenticator
-    def post(self):
+    def post(self, reqargs):
        """
        Initialize a new PVC cluster
        Note: Normally used only once during cluster bootstrap; checks for the existence of the "/primary_node" key before proceeding and returns 400 if found
        ---
        tags:
          - root
+        parameters:
+          - in: query
+            name: yes-i-really-mean-it
+            type: string
+            required: true
+            description: A confirmation string to ensure that the API consumer really means it
        responses:
          200:
            description: OK
@ -363,6 +372,82 @@ class API_Initialize(Resource):
 api.add_resource(API_Initialize, '/initialize')


+# /backup
+class API_Backup(Resource):
+    @Authenticator
+    def get(self):
+        """
+        Back up the Zookeeper data of a cluster in JSON format
+        ---
+        tags:
+          - root
+        responses:
+          200:
+            description: OK
+            schema:
+              type: object
+              id: Cluster Data
+          400:
+            description: Bad request
+        """
+        return api_helper.backup_cluster()
+
+
+api.add_resource(API_Backup, '/backup')
+
+
+# /restore
+class API_Restore(Resource):
+    @RequestParser([
+        {'name': 'yes-i-really-mean-it', 'required': True, 'helptext': "Restore is destructive; please confirm with the argument 'yes-i-really-mean-it'."},
+        {'name': 'cluster_data', 'required': True, 'helptext': "A cluster JSON backup must be provided."}
+    ])
+    @Authenticator
+    def post(self, reqargs):
+        """
+        Restore a backup over the cluster; destroys the existing data
+        ---
+        tags:
+          - root
+        parameters:
+          - in: query
+            name: yes-i-really-mean-it
+            type: string
+            required: true
+            description: A confirmation string to ensure that the API consumer really means it
+          - in: query
+            name: cluster_data
+            type: string
+            required: true
+            description: The raw JSON cluster backup data
+        responses:
+          200:
+            description: OK
+            schema:
+              type: object
+              id: Message
+          400:
+            description: Bad request
+            schema:
+              type: object
+              id: Message
+          500:
+            description: Restore error or code failure
+            schema:
+              type: object
+              id: Message
+        """
+        try:
+            cluster_data = reqargs.get('cluster_data')
+        except Exception as e:
+            return {"message": "Failed to load JSON backup: {}.".format(e)}, 400
+
+        return api_helper.restore_cluster(cluster_data)
+
+
+api.add_resource(API_Restore, '/restore')
+
+
 # /status
 class API_Status(Resource):
    @Authenticator
@ -443,7 +528,7 @@ class API_Status(Resource):
        return api_helper.cluster_status()

    @RequestParser([
-        {'name': 'state', 'choices': ('true', 'false'), 'required': True, 'helpmsg': "A valid state must be specified."}
+        {'name': 'state', 'choices': ('true', 'false'), 'required': True, 'helptext': "A valid state must be specified."}
    ])
    @Authenticator
    def post(self, reqargs):
@ -961,6 +1046,9 @@ class API_VM_Root(Resource):
                      source:
                        type: string
                        description: The parent network bridge on the node
+                      vni:
+                        type: integer
+                        description: The VNI (PVC network) of the network bridge
                      model:
                        type: string
                        description: The virtual network device model
@ -1886,7 +1974,7 @@ class API_Network_Root(Resource):
              id: Message
        """
        if reqargs.get('name_servers', None):
-            name_servers = reqargs.get('name_servers', None).split(',')
+            name_servers = ','.join(reqargs.get('name_servers', None))
        else:
            name_servers = ''
        return api_helper.net_add(
@ -2013,7 +2101,7 @@ class API_Network_Element(Resource):
              id: Message
        """
        if reqargs.get('name_servers', None):
-            name_servers = reqargs.get('name_servers', None).split(',')
+            name_servers = ','.join(reqargs.get('name_servers', None))
        else:
            name_servers = ''
        return api_helper.net_add(
@ -2110,7 +2198,7 @@ class API_Network_Element(Resource):
              id: Message
        """
        if reqargs.get('name_servers', None):
-            name_servers = reqargs.get('name_servers', None).split(',')
+            name_servers = ','.join(reqargs.get('name_servers', None))
        else:
            name_servers = ''
        return api_helper.net_modify(
@ -2412,7 +2500,7 @@ api.add_resource(API_Network_Lease_Element, '/network/<vni>/lease/<mac>')
 class API_Network_ACL_Root(Resource):
    @RequestParser([
        {'name': 'limit'},
-        {'name': 'direction', 'choices': ('in', 'out'), 'helpmsg': "A valid direction must be specified."}
+        {'name': 'direction', 'choices': ('in', 'out'), 'helptext': "A valid direction must be specified."}
    ])
    @Authenticator
    def get(self, vni, reqargs):
@ -2474,9 +2562,9 @@ class API_Network_ACL_Root(Resource):
        )

    @RequestParser([
-        {'name': 'description', 'required': True, 'helpmsg': "A whitespace-free description must be specified."},
-        {'name': 'rule', 'required': True, 'helpmsg': "A rule must be specified."},
-        {'name': 'direction', 'choices': ('in', 'out'), 'helpmsg': "A valid direction must be specified."},
+        {'name': 'description', 'required': True, 'helptext': "A whitespace-free description must be specified."},
+        {'name': 'rule', 'required': True, 'helptext': "A rule must be specified."},
+        {'name': 'direction', 'choices': ('in', 'out'), 'helptext': "A valid direction must be specified."},
        {'name': 'order'}
    ])
    @Authenticator
@ -2566,8 +2654,8 @@ class API_Network_ACL_Element(Resource):
        )

    @RequestParser([
-        {'name': 'rule', 'required': True, 'helpmsg': "A rule must be specified."},
-        {'name': 'direction', 'choices': ('in', 'out'), 'helpmsg': "A valid direction must be specified."},
+        {'name': 'rule', 'required': True, 'helptext': "A rule must be specified."},
+        {'name': 'direction', 'choices': ('in', 'out'), 'helptext': "A valid direction must be specified."},
        {'name': 'order'}
    ])
    @Authenticator
@ -2858,7 +2946,7 @@ class API_Storage_Ceph_Benchmark(Resource):
        return api_benchmark.list_benchmarks(reqargs.get('job', None))

    @RequestParser([
-        {'name': 'pool', 'required': True, 'helpmsg': "A valid pool must be specified."},
+        {'name': 'pool', 'required': True, 'helptext': "A valid pool must be specified."},
    ])
    @Authenticator
    def post(self, reqargs):
@ -2897,8 +2985,8 @@ api.add_resource(API_Storage_Ceph_Benchmark, '/storage/ceph/benchmark')
 # /storage/ceph/option
 class API_Storage_Ceph_Option(Resource):
    @RequestParser([
-        {'name': 'option', 'required': True, 'helpmsg': "A valid option must be specified."},
-        {'name': 'action', 'required': True, 'choices': ('set', 'unset'), 'helpmsg': "A valid action must be specified."},
+        {'name': 'option', 'required': True, 'helptext': "A valid option must be specified."},
+        {'name': 'action', 'required': True, 'choices': ('set', 'unset'), 'helptext': "A valid action must be specified."},
    ])
    @Authenticator
    def post(self, reqargs):
@ -3039,9 +3127,9 @@ class API_Storage_Ceph_OSD_Root(Resource):
        )

    @RequestParser([
-        {'name': 'node', 'required': True, 'helpmsg': "A valid node must be specified."},
-        {'name': 'device', 'required': True, 'helpmsg': "A valid device must be specified."},
-        {'name': 'weight', 'required': True, 'helpmsg': "An OSD weight must be specified."},
+        {'name': 'node', 'required': True, 'helptext': "A valid node must be specified."},
+        {'name': 'device', 'required': True, 'helptext': "A valid device must be specified."},
+        {'name': 'weight', 'required': True, 'helptext': "An OSD weight must be specified."},
    ])
    @Authenticator
    def post(self, reqargs):
@ -3109,7 +3197,7 @@ class API_Storage_Ceph_OSD_Element(Resource):
        )

    @RequestParser([
-        {'name': 'yes-i-really-mean-it', 'required': True, 'helpmsg': "Please confirm that 'yes-i-really-mean-it'."}
+        {'name': 'yes-i-really-mean-it', 'required': True, 'helptext': "Please confirm that 'yes-i-really-mean-it'."}
    ])
    @Authenticator
    def delete(self, osdid, reqargs):
@ -3175,7 +3263,7 @@ class API_Storage_Ceph_OSD_State(Resource):
        )

    @RequestParser([
-        {'name': 'state', 'choices': ('in', 'out'), 'required': True, 'helpmsg': "A valid state must be specified."},
+        {'name': 'state', 'choices': ('in', 'out'), 'required': True, 'helptext': "A valid state must be specified."},
    ])
    @Authenticator
    def post(self, osdid, reqargs):
@ -3231,6 +3319,9 @@ class API_Storage_Ceph_Pool_Root(Resource):
                name:
                  type: string
                  description: The name of the pool
+                volume_count:
+                  type: integer
+                  description: The number of volumes in the pool
                stats:
                  type: object
                  properties:
@ -3295,9 +3386,9 @@ class API_Storage_Ceph_Pool_Root(Resource):
        )

    @RequestParser([
-        {'name': 'pool', 'required': True, 'helpmsg': "A pool name must be specified."},
-        {'name': 'pgs', 'required': True, 'helpmsg': "A placement group count must be specified."},
-        {'name': 'replcfg', 'required': True, 'helpmsg': "A valid replication configuration must be specified."}
+        {'name': 'pool', 'required': True, 'helptext': "A pool name must be specified."},
+        {'name': 'pgs', 'required': True, 'helptext': "A placement group count must be specified."},
+        {'name': 'replcfg', 'required': True, 'helptext': "A valid replication configuration must be specified."}
    ])
    @Authenticator
    def post(self, reqargs):
@ -3370,8 +3461,8 @@ class API_Storage_Ceph_Pool_Element(Resource):
        )

    @RequestParser([
-        {'name': 'pgs', 'required': True, 'helpmsg': "A placement group count must be specified."},
-        {'name': 'replcfg', 'required': True, 'helpmsg': "A valid replication configuration must be specified."}
+        {'name': 'pgs', 'required': True, 'helptext': "A placement group count must be specified."},
+        {'name': 'replcfg', 'required': True, 'helptext': "A valid replication configuration must be specified."}
    ])
    @Authenticator
    def post(self, pool, reqargs):
@ -3415,7 +3506,7 @@ class API_Storage_Ceph_Pool_Element(Resource):
        )

    @RequestParser([
-        {'name': 'yes-i-really-mean-it', 'required': True, 'helpmsg': "Please confirm that 'yes-i-really-mean-it'."}
+        {'name': 'yes-i-really-mean-it', 'required': True, 'helptext': "Please confirm that 'yes-i-really-mean-it'."}
    ])
    @Authenticator
    def delete(self, pool, reqargs):
@ -3559,9 +3650,9 @@ class API_Storage_Ceph_Volume_Root(Resource):
        )

    @RequestParser([
-        {'name': 'volume', 'required': True, 'helpmsg': "A volume name must be specified."},
-        {'name': 'pool', 'required': True, 'helpmsg': "A valid pool name must be specified."},
-        {'name': 'size', 'required': True, 'helpmsg': "A volume size in bytes (or with k/M/G/T suffix) must be specified."}
+        {'name': 'volume', 'required': True, 'helptext': "A volume name must be specified."},
+        {'name': 'pool', 'required': True, 'helptext': "A valid pool name must be specified."},
+        {'name': 'size', 'required': True, 'helptext': "A volume size in bytes (or with k/M/G/T suffix) must be specified."}
    ])
    @Authenticator
    def post(self, reqargs):
@ -3635,7 +3726,7 @@ class API_Storage_Ceph_Volume_Element(Resource):
        )

    @RequestParser([
-        {'name': 'size', 'required': True, 'helpmsg': "A volume size in bytes (or with k/M/G/T suffix) must be specified."}
+        {'name': 'size', 'required': True, 'helptext': "A volume size in bytes (or with k/M/G/T suffix) must be specified."}
    ])
    @Authenticator
    def post(self, pool, volume, reqargs):
@ -3761,7 +3852,7 @@ api.add_resource(API_Storage_Ceph_Volume_Element, '/storage/ceph/volume/<pool>/<
 # /storage/ceph/volume/<pool>/<volume>/clone
 class API_Storage_Ceph_Volume_Element_Clone(Resource):
    @RequestParser([
-        {'name': 'new_volume', 'required': True, 'helpmsg': "A new volume name must be specified."}
+        {'name': 'new_volume', 'required': True, 'helptext': "A new volume name must be specified."}
    ])
    @Authenticator
    def post(self, pool, volume, reqargs):
@ -3806,7 +3897,7 @@ api.add_resource(API_Storage_Ceph_Volume_Element_Clone, '/storage/ceph/volume/<p
 # /storage/ceph/volume/<pool>/<volume>/upload
 class API_Storage_Ceph_Volume_Element_Upload(Resource):
    @RequestParser([
-        {'name': 'image_format', 'required': True, 'location': ['args'], 'helpmsg': "A source image format must be specified."}
+        {'name': 'image_format', 'required': True, 'location': ['args'], 'helptext': "A source image format must be specified."}
    ])
    @Authenticator
    def post(self, pool, volume, reqargs):
@ -3916,9 +4007,9 @@ class API_Storage_Ceph_Snapshot_Root(Resource):
        )

    @RequestParser([
-        {'name': 'snapshot', 'required': True, 'helpmsg': "A snapshot name must be specified."},
-        {'name': 'volume', 'required': True, 'helpmsg': "A volume name must be specified."},
-        {'name': 'pool', 'required': True, 'helpmsg': "A pool name must be specified."}
+        {'name': 'snapshot', 'required': True, 'helptext': "A snapshot name must be specified."},
+        {'name': 'volume', 'required': True, 'helptext': "A volume name must be specified."},
+        {'name': 'pool', 'required': True, 'helptext': "A pool name must be specified."}
    ])
    @Authenticator
    def post(self, reqargs):
@ -4039,7 +4130,7 @@ class API_Storage_Ceph_Snapshot_Element(Resource):
        )

    @RequestParser([
-        {'name': 'new_name', 'required': True, 'helpmsg': "A new name must be specified."}
+        {'name': 'new_name', 'required': True, 'helptext': "A new name must be specified."}
    ])
    @Authenticator
    def put(self, pool, volume, snapshot, reqargs):
@ -4243,11 +4334,11 @@ class API_Provisioner_Template_System_Root(Resource):
        )

    @RequestParser([
-        {'name': 'name', 'required': True, 'helpmsg': "A name must be specified."},
-        {'name': 'vcpus', 'required': True, 'helpmsg': "A vcpus value must be specified."},
-        {'name': 'vram', 'required': True, 'helpmsg': "A vram value in MB must be specified."},
-        {'name': 'serial', 'required': True, 'helpmsg': "A serial value must be specified."},
-        {'name': 'vnc', 'required': True, 'helpmsg': "A vnc value must be specified."},
+        {'name': 'name', 'required': True, 'helptext': "A name must be specified."},
+        {'name': 'vcpus', 'required': True, 'helptext': "A vcpus value must be specified."},
+        {'name': 'vram', 'required': True, 'helptext': "A vram value in MB must be specified."},
+        {'name': 'serial', 'required': True, 'helptext': "A serial value must be specified."},
+        {'name': 'vnc', 'required': True, 'helptext': "A vnc value must be specified."},
        {'name': 'vnc_bind'},
        {'name': 'node_limit'},
        {'name': 'node_selector'},
@ -4392,10 +4483,10 @@ class API_Provisioner_Template_System_Element(Resource):
        )

    @RequestParser([
-        {'name': 'vcpus', 'required': True, 'helpmsg': "A vcpus value must be specified."},
-        {'name': 'vram', 'required': True, 'helpmsg': "A vram value in MB must be specified."},
-        {'name': 'serial', 'required': True, 'helpmsg': "A serial value must be specified."},
-        {'name': 'vnc', 'required': True, 'helpmsg': "A vnc value must be specified."},
+        {'name': 'vcpus', 'required': True, 'helptext': "A vcpus value must be specified."},
+        {'name': 'vram', 'required': True, 'helptext': "A vram value in MB must be specified."},
+        {'name': 'serial', 'required': True, 'helptext': "A serial value must be specified."},
+        {'name': 'vnc', 'required': True, 'helptext': "A vnc value must be specified."},
        {'name': 'vnc_bind'},
        {'name': 'node_limit'},
        {'name': 'node_selector'},
@ -4674,7 +4765,7 @@ class API_Provisioner_Template_Network_Root(Resource):
        )

    @RequestParser([
-        {'name': 'name', 'required': True, 'helpmsg': "A template name must be specified."},
+        {'name': 'name', 'required': True, 'helptext': "A template name must be specified."},
        {'name': 'mac_template'}
    ])
    @Authenticator
@ -4833,7 +4924,7 @@ class API_Provisioner_Template_Network_Net_Root(Resource):
            return {'message': 'Template not found.'}, 404

    @RequestParser([
-        {'name': 'vni', 'required': True, 'helpmsg': "A valid VNI must be specified."}
+        {'name': 'vni', 'required': True, 'helptext': "A valid VNI must be specified."}
    ])
    @Authenticator
    def post(self, template, reqargs):
@ -5026,7 +5117,7 @@ class API_Provisioner_Template_Storage_Root(Resource):
        )

    @RequestParser([
-        {'name': 'name', 'required': True, 'helpmsg': "A template name must be specified."}
+        {'name': 'name', 'required': True, 'helptext': "A template name must be specified."}
    ])
    @Authenticator
    def post(self, reqargs):
@ -5171,8 +5262,8 @@ class API_Provisioner_Template_Storage_Disk_Root(Resource):
            return {'message': 'Template not found.'}, 404

    @RequestParser([
-        {'name': 'disk_id', 'required': True, 'helpmsg': "A disk identifier in sdX or vdX format must be specified."},
-        {'name': 'pool', 'required': True, 'helpmsg': "A storage pool must be specified."},
+        {'name': 'disk_id', 'required': True, 'helptext': "A disk identifier in sdX or vdX format must be specified."},
+        {'name': 'pool', 'required': True, 'helptext': "A storage pool must be specified."},
        {'name': 'source_volume'},
        {'name': 'disk_size'},
        {'name': 'filesystem'},
@ -5279,7 +5370,7 @@ class API_Provisioner_Template_Storage_Disk_Element(Resource):
        abort(404)

    @RequestParser([
-        {'name': 'pool', 'required': True, 'helpmsg': "A storage pool must be specified."},
+        {'name': 'pool', 'required': True, 'helptext': "A storage pool must be specified."},
        {'name': 'source_volume'},
        {'name': 'disk_size'},
        {'name': 'filesystem'},
@ -5421,8 +5512,8 @@ class API_Provisioner_Userdata_Root(Resource):
        )

    @RequestParser([
-        {'name': 'name', 'required': True, 'helpmsg': "A name must be specified."},
-        {'name': 'data', 'required': True, 'helpmsg': "A userdata document must be specified."}
+        {'name': 'name', 'required': True, 'helptext': "A name must be specified."},
+        {'name': 'data', 'required': True, 'helptext': "A userdata document must be specified."}
    ])
    @Authenticator
    def post(self, reqargs):
@ -5489,7 +5580,7 @@ class API_Provisioner_Userdata_Element(Resource):
        )

    @RequestParser([
-        {'name': 'data', 'required': True, 'helpmsg': "A userdata document must be specified."}
+        {'name': 'data', 'required': True, 'helptext': "A userdata document must be specified."}
    ])
    @Authenticator
    def post(self, userdata, reqargs):
@ -5522,7 +5613,7 @@ class API_Provisioner_Userdata_Element(Resource):
        )

    @RequestParser([
-        {'name': 'data', 'required': True, 'helpmsg': "A userdata document must be specified."}
+        {'name': 'data', 'required': True, 'helptext': "A userdata document must be specified."}
    ])
    @Authenticator
    def put(self, userdata, reqargs):
@ -5626,8 +5717,8 @@ class API_Provisioner_Script_Root(Resource):
        )

    @RequestParser([
-        {'name': 'name', 'required': True, 'helpmsg': "A script name must be specified."},
-        {'name': 'data', 'required': True, 'helpmsg': "A script document must be specified."}
+        {'name': 'name', 'required': True, 'helptext': "A script name must be specified."},
+        {'name': 'data', 'required': True, 'helptext': "A script document must be specified."}
    ])
    @Authenticator
    def post(self, reqargs):
@ -5694,7 +5785,7 @@ class API_Provisioner_Script_Element(Resource):
        )

    @RequestParser([
-        {'name': 'data', 'required': True, 'helpmsg': "A script document must be specified."}
+        {'name': 'data', 'required': True, 'helptext': "A script document must be specified."}
    ])
    @Authenticator
    def post(self, script, reqargs):
@ -5727,7 +5818,7 @@ class API_Provisioner_Script_Element(Resource):
        )

    @RequestParser([
-        {'name': 'data', 'required': True, 'helpmsg': "A script document must be specified."}
+        {'name': 'data', 'required': True, 'helptext': "A script document must be specified."}
    ])
    @Authenticator
    def put(self, script, reqargs):
@ -5849,9 +5940,9 @@ class API_Provisioner_OVA_Root(Resource):
        )

    @RequestParser([
-        {'name': 'pool', 'required': True, 'location': ['args'], 'helpmsg': "A storage pool must be specified."},
-        {'name': 'name', 'required': True, 'location': ['args'], 'helpmsg': "A VM name must be specified."},
-        {'name': 'ova_size', 'required': True, 'location': ['args'], 'helpmsg': "An OVA size must be specified."},
+        {'name': 'pool', 'required': True, 'location': ['args'], 'helptext': "A storage pool must be specified."},
+        {'name': 'name', 'required': True, 'location': ['args'], 'helptext': "A VM name must be specified."},
+        {'name': 'ova_size', 'required': True, 'location': ['args'], 'helptext': "An OVA size must be specified."},
    ])
    @Authenticator
    def post(self, reqargs):
@ -5926,8 +6017,8 @@ class API_Provisioner_OVA_Element(Resource):
        )

    @RequestParser([
-        {'name': 'pool', 'required': True, 'location': ['args'], 'helpmsg': "A storage pool must be specified."},
-        {'name': 'ova_size', 'required': True, 'location': ['args'], 'helpmsg': "An OVA size must be specified."},
+        {'name': 'pool', 'required': True, 'location': ['args'], 'helptext': "A storage pool must be specified."},
+        {'name': 'ova_size', 'required': True, 'location': ['args'], 'helptext': "An OVA size must be specified."},
    ])
    @Authenticator
    def post(self, ova, reqargs):
@ -6056,8 +6147,8 @@ class API_Provisioner_Profile_Root(Resource):
        )

    @RequestParser([
-        {'name': 'name', 'required': True, 'helpmsg': "A profile name must be specified."},
-        {'name': 'profile_type', 'required': True, 'helpmsg': "A profile type must be specified."},
+        {'name': 'name', 'required': True, 'helptext': "A profile name must be specified."},
+        {'name': 'profile_type', 'required': True, 'helptext': "A profile type must be specified."},
        {'name': 'system_template'},
        {'name': 'network_template'},
        {'name': 'storage_template'},
@ -6175,7 +6266,7 @@ class API_Provisioner_Profile_Element(Resource):
        )

    @RequestParser([
-        {'name': 'profile_type', 'required': True, 'helpmsg': "A profile type must be specified."},
+        {'name': 'profile_type', 'required': True, 'helptext': "A profile type must be specified."},
        {'name': 'system_template'},
        {'name': 'network_template'},
        {'name': 'storage_template'},
@ -6357,8 +6448,8 @@ api.add_resource(API_Provisioner_Profile_Element, '/provisioner/profile/<profile
 # /provisioner/create
 class API_Provisioner_Create_Root(Resource):
    @RequestParser([
-        {'name': 'name', 'required': True, 'helpmsg': "A VM name must be specified."},
-        {'name': 'profile', 'required': True, 'helpmsg': "A profile name must be specified."},
+        {'name': 'name', 'required': True, 'helptext': "A VM name must be specified."},
+        {'name': 'profile', 'required': True, 'helptext': "A profile name must be specified."},
        {'name': 'define_vm'},
        {'name': 'start_vm'},
        {'name': 'arg', 'action': 'append'}
--- a/api-daemon/pvcapid/helper.py
+++ b/api-daemon/pvcapid/helper.py
@ -21,6 +21,7 @@
 ###############################################################################

 import flask
+import json
 import lxml.etree as etree

 from distutils.util import strtobool as dustrtobool
@ -49,7 +50,7 @@ def strtobool(stringv):


 #
-# Initialization function
+# Cluster base functions
 #
 def initialize_cluster():
    # Open a Zookeeper connection
@ -86,6 +87,66 @@ def initialize_cluster():
    return True


+def backup_cluster():
+    # Open a zookeeper connection
+    zk_conn = pvc_common.startZKConnection(config['coordinators'])
+
+    # Dictionary of values to come
+    cluster_data = dict()
+
+    def get_data(path):
+        data_raw = zk_conn.get(path)
+        if data_raw:
+            data = data_raw[0].decode('utf8')
+        children = zk_conn.get_children(path)
+
+        cluster_data[path] = data
+
+        if children:
+            if path == '/':
+                child_prefix = '/'
+            else:
+                child_prefix = path + '/'
+
+            for child in children:
+                if child_prefix + child == '/zookeeper':
+                    # We must skip the built-in /zookeeper tree
+                    continue
+                get_data(child_prefix + child)
+
+    get_data('/')
+
+    return cluster_data, 200
+
+
+def restore_cluster(cluster_data_raw):
+    # Open a zookeeper connection
+    zk_conn = pvc_common.startZKConnection(config['coordinators'])
+
+    # Open a single transaction (restore is atomic)
+    zk_transaction = zk_conn.transaction()
+
+    try:
+        cluster_data = json.loads(cluster_data_raw)
+    except Exception as e:
+        return {"message": "Failed to parse JSON data: {}.".format(e)}, 400
+
+    for key in cluster_data:
+        data = cluster_data[key]
+
+        if zk_conn.exists(key):
+            zk_transaction.set_data(key, str(data).encode('utf8'))
+        else:
+            zk_transaction.create(key, str(data).encode('utf8'))
+
+    try:
+        zk_transaction.commit()
+        return {'message': 'Restore completed successfully.'}, 200
+    except Exception as e:
+        raise
+        return {'message': 'Restore failed: {}.'.format(e)}, 500
+
+
 #
 # Cluster functions
 #
@ -144,10 +205,6 @@ def node_list(limit=None, daemon_state=None, coordinator_state=None, domain_stat
            'message': retdata
        }

-    # If this is a single element, strip it out of the list
-    if isinstance(retdata, list) and len(retdata) == 1:
-        retdata = retdata[0]
-
    return retdata, retcode


@ -333,10 +390,6 @@ def vm_state(vm):
    retflag, retdata = pvc_vm.get_list(zk_conn, None, None, vm, is_fuzzy=False)
    pvc_common.stopZKConnection(zk_conn)

-    # If this is a single element, strip it out of the list
-    if isinstance(retdata, list) and len(retdata) == 1:
-        retdata = retdata[0]
-
    if retflag:
        if retdata:
            retcode = 200
@ -366,10 +419,6 @@ def vm_node(vm):
    retflag, retdata = pvc_vm.get_list(zk_conn, None, None, vm, is_fuzzy=False)
    pvc_common.stopZKConnection(zk_conn)

-    # If this is a single element, strip it out of the list
-    if isinstance(retdata, list) and len(retdata) == 1:
-        retdata = retdata[0]
-
    if retflag:
        if retdata:
            retcode = 200
@ -429,10 +478,6 @@ def vm_list(node=None, state=None, limit=None, is_fuzzy=True):
    retflag, retdata = pvc_vm.get_list(zk_conn, node, state, limit, is_fuzzy)
    pvc_common.stopZKConnection(zk_conn)

-    # If this is a single element, strip it out of the list
-    if isinstance(retdata, list) and len(retdata) == 1:
-        retdata = retdata[0]
-
    if retflag:
        if retdata:
            retcode = 200
@ -484,10 +529,6 @@ def get_vm_meta(vm):
    retflag, retdata = pvc_vm.get_list(zk_conn, None, None, vm, is_fuzzy=False)
    pvc_common.stopZKConnection(zk_conn)

-    # If this is a single element, strip it out of the list
-    if isinstance(retdata, list) and len(retdata) == 1:
-        retdata = retdata[0]
-
    if retflag:
        if retdata:
            retcode = 200
@ -759,11 +800,7 @@ def vm_flush_locks(vm):
    retflag, retdata = pvc_vm.get_list(zk_conn, None, None, vm, is_fuzzy=False)
    pvc_common.stopZKConnection(zk_conn)

-    # If this is a single element, strip it out of the list
-    if isinstance(retdata, list) and len(retdata) == 1:
-        retdata = retdata[0]
-
-    if retdata['state'] not in ['stop', 'disable']:
+    if retdata[0].get('state') not in ['stop', 'disable']:
        return {"message": "VM must be stopped to flush locks"}, 400

    zk_conn = pvc_common.startZKConnection(config['coordinators'])
@ -792,10 +829,6 @@ def net_list(limit=None, is_fuzzy=True):
    retflag, retdata = pvc_network.get_list(zk_conn, limit, is_fuzzy)
    pvc_common.stopZKConnection(zk_conn)

-    # If this is a single element, strip it out of the list
-    if isinstance(retdata, list) and len(retdata) == 1:
-        retdata = retdata[0]
-
    if retflag:
        if retdata:
            retcode = 200
@ -968,10 +1001,6 @@ def net_acl_list(network, limit=None, direction=None, is_fuzzy=True):
            'message': retdata
        }

-    # If this is a single element, strip it out of the list
-    if isinstance(retdata, list) and len(retdata) == 1:
-        retdata = retdata[0]
-
    return retdata, retcode


@ -1220,10 +1249,6 @@ def ceph_pool_list(limit=None, is_fuzzy=True):
    retflag, retdata = pvc_ceph.get_list_pool(zk_conn, limit, is_fuzzy)
    pvc_common.stopZKConnection(zk_conn)

-    # If this is a single element, strip it out of the list
-    if isinstance(retdata, list) and len(retdata) == 1:
-        retdata = retdata[0]
-
    if retflag:
        if retdata:
            retcode = 200
@ -1287,10 +1312,6 @@ def ceph_volume_list(pool=None, limit=None, is_fuzzy=True):
    retflag, retdata = pvc_ceph.get_list_volume(zk_conn, pool, limit, is_fuzzy)
    pvc_common.stopZKConnection(zk_conn)

-    # If this is a single element, strip it out of the list
-    if isinstance(retdata, list) and len(retdata) == 1:
-        retdata = retdata[0]
-
    if retflag:
        if retdata:
            retcode = 200
@ -1562,10 +1583,6 @@ def ceph_volume_snapshot_list(pool=None, volume=None, limit=None, is_fuzzy=True)
    retflag, retdata = pvc_ceph.get_list_snapshot(zk_conn, pool, volume, limit, is_fuzzy)
    pvc_common.stopZKConnection(zk_conn)

-    # If this is a single element, strip it out of the list
-    if isinstance(retdata, list) and len(retdata) == 1:
-        retdata = retdata[0]
-
    if retflag:
        if retdata:
            retcode = 200
--- a/api-daemon/pvcapid/ova.py
+++ b/api-daemon/pvcapid/ova.py
@ -316,6 +316,8 @@ def upload_ova(pool, name, ova_size):
            # Open the temporary blockdev and seek to byte 0
            blk_file = open(temp_blockdev, 'wb')
            blk_file.seek(0)
+            # Write the contents of vmdk_file into blk_file
+            blk_file.write(vmdk_file.read())
            # Close blk_file (and flush the buffers)
            blk_file.close()
            # Close vmdk_file
--- a/client-cli/cli_lib/ceph.py
+++ b/client-cli/cli_lib/ceph.py
@ -164,7 +164,16 @@ def ceph_osd_info(config, osd):
    response = call_api(config, 'get', '/storage/ceph/osd/{osd}'.format(osd=osd))

    if response.status_code == 200:
-        return True, response.json()
+        if isinstance(response.json(), list) and len(response.json()) != 1:
+            # No exact match; return not found
+            return False, "OSD not found."
+        else:
+            # Return a single instance if the response is a list
+            if isinstance(response.json(), list):
+                return True, response.json()[0]
+            # This shouldn't happen, but is here just in case
+            else:
+                return True, response.json()
    else:
        return False, response.json().get('message', '')

@ -300,9 +309,6 @@ def format_list_osd(osd_list):
    # Handle empty list
    if not osd_list:
        osd_list = list()
-    # Handle single-item list
-    if not isinstance(osd_list, list):
-        osd_list = [osd_list]

    osd_list_output = []

@ -555,7 +561,16 @@ def ceph_pool_info(config, pool):
    response = call_api(config, 'get', '/storage/ceph/pool/{pool}'.format(pool=pool))

    if response.status_code == 200:
-        return True, response.json()
+        if isinstance(response.json(), list) and len(response.json()) != 1:
+            # No exact match; return not found
+            return False, "Pool not found."
+        else:
+            # Return a single instance if the response is a list
+            if isinstance(response.json(), list):
+                return True, response.json()[0]
+            # This shouldn't happen, but is here just in case
+            else:
+                return True, response.json()
    else:
        return False, response.json().get('message', '')

@ -628,9 +643,6 @@ def format_list_pool(pool_list):
    # Handle empty list
    if not pool_list:
        pool_list = list()
-    # Handle single-entry list
-    if not isinstance(pool_list, list):
-        pool_list = [pool_list]

    pool_list_output = []

@ -835,7 +847,16 @@ def ceph_volume_info(config, pool, volume):
    response = call_api(config, 'get', '/storage/ceph/volume/{pool}/{volume}'.format(volume=volume, pool=pool))

    if response.status_code == 200:
-        return True, response.json()
+        if isinstance(response.json(), list) and len(response.json()) != 1:
+            # No exact match; return not found
+            return False, "Volume not found."
+        else:
+            # Return a single instance if the response is a list
+            if isinstance(response.json(), list):
+                return True, response.json()[0]
+            # This shouldn't happen, but is here just in case
+            else:
+                return True, response.json()
    else:
        return False, response.json().get('message', '')

@ -989,9 +1010,6 @@ def format_list_volume(volume_list):
    # Handle empty list
    if not volume_list:
        volume_list = list()
-    # Handle single-entry list
-    if not isinstance(volume_list, list):
-        volume_list = [volume_list]

    volume_list_output = []

@ -1112,7 +1130,16 @@ def ceph_snapshot_info(config, pool, volume, snapshot):
    response = call_api(config, 'get', '/storage/ceph/snapshot/{pool}/{volume}/{snapshot}'.format(snapshot=snapshot, volume=volume, pool=pool))

    if response.status_code == 200:
-        return True, response.json()
+        if isinstance(response.json(), list) and len(response.json()) != 1:
+            # No exact match; return not found
+            return False, "Snapshot not found."
+        else:
+            # Return a single instance if the response is a list
+            if isinstance(response.json(), list):
+                return True, response.json()[0]
+            # This shouldn't happen, but is here just in case
+            else:
+                return True, response.json()
    else:
        return False, response.json().get('message', '')

@ -1209,9 +1236,6 @@ def format_list_snapshot(snapshot_list):
    # Handle empty list
    if not snapshot_list:
        snapshot_list = list()
-    # Handle single-entry list
-    if not isinstance(snapshot_list, list):
-        snapshot_list = [snapshot_list]

    snapshot_list_output = []

--- a/client-cli/cli_lib/cluster.py
+++ b/client-cli/cli_lib/cluster.py
@ -31,10 +31,55 @@ def initialize(config):
    Initialize the PVC cluster

    API endpoint: GET /api/v1/initialize
+    API arguments: yes-i-really-mean-it
+    API schema: {json_data_object}
+    """
+    params = {
+        'yes-i-really-mean-it': 'yes'
+    }
+    response = call_api(config, 'post', '/initialize', params=params)
+
+    if response.status_code == 200:
+        retstatus = True
+    else:
+        retstatus = False
+
+    return retstatus, response.json().get('message', '')
+
+
+def backup(config):
+    """
+    Get a JSON backup of the cluster
+
+    API endpoint: GET /api/v1/backup
    API arguments:
    API schema: {json_data_object}
    """
-    response = call_api(config, 'post', '/initialize')
+    response = call_api(config, 'get', '/backup')
+
+    if response.status_code == 200:
+        return True, response.json()
+    else:
+        return False, response.json().get('message', '')
+
+
+def restore(config, cluster_data):
+    """
+    Restore a JSON backup to the cluster
+
+    API endpoint: POST /api/v1/restore
+    API arguments: yes-i-really-mean-it
+    API schema: {json_data_object}
+    """
+    cluster_data_json = json.dumps(cluster_data)
+
+    params = {
+        'yes-i-really-mean-it': 'yes'
+    }
+    data = {
+        'cluster_data': cluster_data_json
+    }
+    response = call_api(config, 'post', '/restore', params=params, data=data)

    if response.status_code == 200:
        retstatus = True
@ -104,6 +149,20 @@ def format_info(cluster_information, oformat):
        storage_health_colour = ansiprint.yellow()

    ainformation = []
+
+    if oformat == 'short':
+        ainformation.append('{}PVC cluster status:{}'.format(ansiprint.bold(), ansiprint.end()))
+        ainformation.append('{}Cluster health:{}      {}{}{}'.format(ansiprint.purple(), ansiprint.end(), health_colour, cluster_information['health'], ansiprint.end()))
+        if cluster_information['health_msg']:
+            for line in cluster_information['health_msg']:
+                ainformation.append('                     > {}'.format(line))
+        ainformation.append('{}Storage health:{}      {}{}{}'.format(ansiprint.purple(), ansiprint.end(), storage_health_colour, cluster_information['storage_health'], ansiprint.end()))
+        if cluster_information['storage_health_msg']:
+            for line in cluster_information['storage_health_msg']:
+                ainformation.append('                     > {}'.format(line))
+
+        return '\n'.join(ainformation)
+
    ainformation.append('{}PVC cluster status:{}'.format(ansiprint.bold(), ansiprint.end()))
    ainformation.append('')
    ainformation.append('{}Cluster health:{}      {}{}{}'.format(ansiprint.purple(), ansiprint.end(), health_colour, cluster_information['health'], ansiprint.end()))
@ -114,6 +173,7 @@ def format_info(cluster_information, oformat):
    if cluster_information['storage_health_msg']:
        for line in cluster_information['storage_health_msg']:
            ainformation.append('                     > {}'.format(line))
+
    ainformation.append('')
    ainformation.append('{}Primary node:{}        {}'.format(ansiprint.purple(), ansiprint.end(), cluster_information['primary_node']))
    ainformation.append('{}Cluster upstream IP:{} {}'.format(ansiprint.purple(), ansiprint.end(), cluster_information['upstream_ip']))
--- a/client-cli/cli_lib/network.py
+++ b/client-cli/cli_lib/network.py
@ -67,7 +67,16 @@ def net_info(config, net):
    response = call_api(config, 'get', '/network/{net}'.format(net=net))

    if response.status_code == 200:
-        return True, response.json()
+        if isinstance(response.json(), list) and len(response.json()) != 1:
+            # No exact match; return not found
+            return False, "Network not found."
+        else:
+            # Return a single instance if the response is a list
+            if isinstance(response.json(), list):
+                return True, response.json()[0]
+            # This shouldn't happen, but is here just in case
+            else:
+                return True, response.json()
    else:
        return False, response.json().get('message', '')

@ -148,7 +157,7 @@ def net_modify(config, net, description, domain, name_servers, ip4_network, ip4_
    if ip6_gateway is not None:
        params['ip6_gateway'] = ip6_gateway
    if dhcp4_flag is not None:
-        params['dhcp4_flag'] = dhcp4_flag
+        params['dhcp4'] = dhcp4_flag
    if dhcp4_start is not None:
        params['dhcp4_start'] = dhcp4_start
    if dhcp4_end is not None:
@ -196,7 +205,16 @@ def net_dhcp_info(config, net, mac):
    response = call_api(config, 'get', '/network/{net}/lease/{mac}'.format(net=net, mac=mac))

    if response.status_code == 200:
-        return True, response.json()
+        if isinstance(response.json(), list) and len(response.json()) != 1:
+            # No exact match; return not found
+            return False, "Lease not found."
+        else:
+            # Return a single instance if the response is a list
+            if isinstance(response.json(), list):
+                return True, response.json()[0]
+            # This shouldn't happen, but is here just in case
+            else:
+                return True, response.json()
    else:
        return False, response.json().get('message', '')

@ -281,7 +299,16 @@ def net_acl_info(config, net, description):
    response = call_api(config, 'get', '/network/{net}/acl/{description}'.format(net=net, description=description))

    if response.status_code == 200:
-        return True, response.json()
+        if isinstance(response.json(), list) and len(response.json()) != 1:
+            # No exact match; return not found
+            return False, "ACL not found."
+        else:
+            # Return a single instance if the response is a list
+            if isinstance(response.json(), list):
+                return True, response.json()[0]
+            # This shouldn't happen, but is here just in case
+            else:
+                return True, response.json()
    else:
        return False, response.json().get('message', '')

@ -440,10 +467,6 @@ def format_list(config, network_list):
    if not network_list:
        return "No network found"

-    # Handle single-element lists
-    if not isinstance(network_list, list):
-        network_list = [network_list]
-
    network_list_output = []

    # Determine optimal column widths
@ -617,9 +640,6 @@ def format_list_acl(acl_list):
    # Handle when we get an empty entry
    if not acl_list:
        acl_list = list()
-    # Handle when we get a single entry
-    if isinstance(acl_list, dict):
-        acl_list = [acl_list]

    acl_list_output = []

--- a/client-cli/cli_lib/node.py
+++ b/client-cli/cli_lib/node.py
@ -81,7 +81,16 @@ def node_info(config, node):
    response = call_api(config, 'get', '/node/{node}'.format(node=node))

    if response.status_code == 200:
-        return True, response.json()
+        if isinstance(response.json(), list) and len(response.json()) != 1:
+            # No exact match, return not found
+            return False, "Node not found."
+        else:
+            # Return a single instance if the response is a list
+            if isinstance(response.json(), list):
+                return True, response.json()[0]
+            # This shouldn't happen, but is here just in case
+            else:
+                return True, response.json()
    else:
        return False, response.json().get('message', '')

@ -186,10 +195,6 @@ def format_info(node_information, long_output):


 def format_list(node_list, raw):
-    # Handle single-element lists
-    if not isinstance(node_list, list):
-        node_list = [node_list]
-
    if raw:
        ainformation = list()
        for node in sorted(item['name'] for item in node_list):
--- a/client-cli/cli_lib/provisioner.py
+++ b/client-cli/cli_lib/provisioner.py
@ -42,7 +42,16 @@ def template_info(config, template, template_type):
    response = call_api(config, 'get', '/provisioner/template/{template_type}/{template}'.format(template_type=template_type, template=template))

    if response.status_code == 200:
-        return True, response.json()
+        if isinstance(response.json(), list) and len(response.json()) != 1:
+            # No exact match; return not found
+            return False, "Template not found."
+        else:
+            # Return a single instance if the response is a list
+            if isinstance(response.json(), list):
+                return True, response.json()[0]
+            # This shouldn't happen, but is here just in case
+            else:
+                return True, response.json()
    else:
        return False, response.json().get('message', '')

@ -171,7 +180,16 @@ def userdata_info(config, userdata):
    response = call_api(config, 'get', '/provisioner/userdata/{userdata}'.format(userdata=userdata))

    if response.status_code == 200:
-        return True, response.json()[0]
+        if isinstance(response.json(), list) and len(response.json()) != 1:
+            # No exact match; return not found
+            return False, "Userdata not found."
+        else:
+            # Return a single instance if the response is a list
+            if isinstance(response.json(), list):
+                return True, response.json()[0]
+            # This shouldn't happen, but is here just in case
+            else:
+                return True, response.json()
    else:
        return False, response.json().get('message', '')

@ -294,7 +312,16 @@ def script_info(config, script):
    response = call_api(config, 'get', '/provisioner/script/{script}'.format(script=script))

    if response.status_code == 200:
-        return True, response.json()[0]
+        if isinstance(response.json(), list) and len(response.json()) != 1:
+            # No exact match; return not found
+            return False, "Script not found."
+        else:
+            # Return a single instance if the response is a list
+            if isinstance(response.json(), list):
+                return True, response.json()[0]
+            # This shouldn't happen, but is here just in case
+            else:
+                return True, response.json()
    else:
        return False, response.json().get('message', '')

@ -417,7 +444,16 @@ def ova_info(config, name):
    response = call_api(config, 'get', '/provisioner/ova/{name}'.format(name=name))

    if response.status_code == 200:
-        return True, response.json()[0]
+        if isinstance(response.json(), list) and len(response.json()) != 1:
+            # No exact match; return not found
+            return False, "OVA not found."
+        else:
+            # Return a single instance if the response is a list
+            if isinstance(response.json(), list):
+                return True, response.json()[0]
+            # This shouldn't happen, but is here just in case
+            else:
+                return True, response.json()
    else:
        return False, response.json().get('message', '')

@ -504,7 +540,16 @@ def profile_info(config, profile):
    response = call_api(config, 'get', '/provisioner/profile/{profile}'.format(profile=profile))

    if response.status_code == 200:
-        return True, response.json()[0]
+        if isinstance(response.json(), list) and len(response.json()) != 1:
+            # No exact match; return not found
+            return False, "Profile not found."
+        else:
+            # Return a single instance if the response is a list
+            if isinstance(response.json(), list):
+                return True, response.json()[0]
+            # This shouldn't happen, but is here just in case
+            else:
+                return True, response.json()
    else:
        return False, response.json().get('message', '')

--- a/client-cli/cli_lib/vm.py
+++ b/client-cli/cli_lib/vm.py
@ -41,16 +41,16 @@ def vm_info(config, vm):
    response = call_api(config, 'get', '/vm/{vm}'.format(vm=vm))

    if response.status_code == 200:
-        if isinstance(response.json(), list) and len(response.json()) > 1:
+        if isinstance(response.json(), list) and len(response.json()) != 1:
            # No exact match; return not found
            return False, "VM not found."
        else:
+            # Return a single instance if the response is a list
            if isinstance(response.json(), list):
-                response = response.json()[0]
+                return True, response.json()[0]
+            # This shouldn't happen, but is here just in case
            else:
-                response = response.json()
-            return True, response
-        return True, response.json()
+                return True, response.json()
    else:
        return False, response.json().get('message', '')

@ -539,7 +539,10 @@ def vm_networks_add(config, vm, network, macaddr, model, restart):
    device_xml = fromstring(device_string)

    last_interface = None
-    for interface in parsed_xml.devices.find('interface'):
+    all_interfaces = parsed_xml.devices.find('interface')
+    if all_interfaces is None:
+        all_interfaces = []
+    for interface in all_interfaces:
        last_interface = re.match(r'[vm]*br([0-9a-z]+)', interface.source.attrib.get('bridge')).group(1)
        if last_interface == network:
            return False, 'Network {} is already configured for VM {}.'.format(network, vm)
@ -547,6 +550,8 @@ def vm_networks_add(config, vm, network, macaddr, model, restart):
        for interface in parsed_xml.devices.find('interface'):
            if last_interface == re.match(r'[vm]*br([0-9a-z]+)', interface.source.attrib.get('bridge')).group(1):
                interface.addnext(device_xml)
+    else:
+        parsed_xml.devices.find('emulator').addprevious(device_xml)

    try:
        new_xml = tostring(parsed_xml, pretty_print=True)
@ -732,7 +737,10 @@ def vm_volumes_add(config, vm, volume, disk_id, bus, disk_type, restart):

    last_disk = None
    id_list = list()
-    for disk in parsed_xml.devices.find('disk'):
+    all_disks = parsed_xml.devices.find('disk')
+    if all_disks is None:
+        all_disks = []
+    for disk in all_disks:
        id_list.append(disk.target.attrib.get('dev'))
        if disk.source.attrib.get('protocol') == disk_type:
            if disk_type == 'rbd':
@ -782,9 +790,14 @@ def vm_volumes_add(config, vm, volume, disk_id, bus, disk_type, restart):
    elif disk_type == 'file':
        new_disk_details.source.set('file', volume)

-    for disk in parsed_xml.devices.find('disk'):
+    all_disks = parsed_xml.devices.find('disk')
+    if all_disks is None:
+        all_disks = []
+    for disk in all_disks:
        last_disk = disk
-    last_disk.addnext(new_disk_details)
+
+    if last_disk is None:
+        parsed_xml.devices.find('emulator').addprevious(new_disk_details)

    try:
        new_xml = tostring(parsed_xml, pretty_print=True)
@ -1007,8 +1020,11 @@ def follow_console_log(config, vm, lines=10):
    print(loglines, end='')

    while True:
-        # Grab the next line set
+        # Grab the next line set (500 is a reasonable number of lines per second; any more are skipped)
        try:
+            params = {
+                'lines': 500
+            }
            response = call_api(config, 'get', '/vm/{vm}/console'.format(vm=vm), params=params)
            new_console_log = response.json()['data']
        except Exception:
@ -1066,20 +1082,20 @@ def format_info(config, domain_information, long_output):
        ainformation.append('')
        ainformation.append('{0}Memory stats:{1}       {2}Swap In  Swap Out  Faults (maj/min)  Available  Usable  Unused  RSS{3}'.format(ansiprint.purple(), ansiprint.end(), ansiprint.bold(), ansiprint.end()))
        ainformation.append('                    {0: <7}  {1: <8}  {2: <16}  {3: <10} {4: <7} {5: <7} {6: <10}'.format(
-            format_metric(domain_information['memory_stats'].get('swap_in')),
-            format_metric(domain_information['memory_stats'].get('swap_out')),
-            '/'.join([format_metric(domain_information['memory_stats'].get('major_fault')), format_metric(domain_information['memory_stats'].get('minor_fault'))]),
-            format_bytes(domain_information['memory_stats'].get('available') * 1024),
-            format_bytes(domain_information['memory_stats'].get('usable') * 1024),
-            format_bytes(domain_information['memory_stats'].get('unused') * 1024),
-            format_bytes(domain_information['memory_stats'].get('rss') * 1024)
+            format_metric(domain_information['memory_stats'].get('swap_in', 0)),
+            format_metric(domain_information['memory_stats'].get('swap_out', 0)),
+            '/'.join([format_metric(domain_information['memory_stats'].get('major_fault', 0)), format_metric(domain_information['memory_stats'].get('minor_fault', 0))]),
+            format_bytes(domain_information['memory_stats'].get('available', 0) * 1024),
+            format_bytes(domain_information['memory_stats'].get('usable', 0) * 1024),
+            format_bytes(domain_information['memory_stats'].get('unused', 0) * 1024),
+            format_bytes(domain_information['memory_stats'].get('rss', 0) * 1024)
        ))
        ainformation.append('')
        ainformation.append('{0}vCPU stats:{1}         {2}CPU time (ns)     User time (ns)    System time (ns){3}'.format(ansiprint.purple(), ansiprint.end(), ansiprint.bold(), ansiprint.end()))
        ainformation.append('                    {0: <16}  {1: <16}  {2: <15}'.format(
-            str(domain_information['vcpu_stats'].get('cpu_time')),
-            str(domain_information['vcpu_stats'].get('user_time')),
-            str(domain_information['vcpu_stats'].get('system_time'))
+            str(domain_information['vcpu_stats'].get('cpu_time', 0)),
+            str(domain_information['vcpu_stats'].get('user_time', 0)),
+            str(domain_information['vcpu_stats'].get('system_time', 0))
        ))

    # PVC cluster information
@ -1122,7 +1138,7 @@ def format_info(config, domain_information, long_output):
        formatted_node_autostart = domain_information['node_autostart']

    if not domain_information.get('migration_method'):
-        formatted_migration_method = "none"
+        formatted_migration_method = "any"
    else:
        formatted_migration_method = domain_information['migration_method']

@ -1166,8 +1182,8 @@ def format_info(config, domain_information, long_output):
                disk['name'],
                disk['dev'],
                disk['bus'],
-                '/'.join([str(format_metric(disk['rd_req'])), str(format_metric(disk['wr_req']))]),
-                '/'.join([str(format_bytes(disk['rd_bytes'])), str(format_bytes(disk['wr_bytes']))]),
+                '/'.join([str(format_metric(disk.get('rd_req', 0))), str(format_metric(disk.get('wr_req', 0)))]),
+                '/'.join([str(format_bytes(disk.get('rd_bytes', 0))), str(format_bytes(disk.get('wr_bytes', 0)))]),
                width=name_length
            ))
        ainformation.append('')
@ -1179,9 +1195,9 @@ def format_info(config, domain_information, long_output):
                net['source'],
                net['model'],
                net['mac'],
-                '/'.join([str(format_bytes(net['rd_bytes'])), str(format_bytes(net['wr_bytes']))]),
-                '/'.join([str(format_metric(net['rd_packets'])), str(format_metric(net['wr_packets']))]),
-                '/'.join([str(format_metric(net['rd_errors'])), str(format_metric(net['wr_errors']))]),
+                '/'.join([str(format_bytes(net.get('rd_bytes', 0))), str(format_bytes(net.get('wr_bytes', 0)))]),
+                '/'.join([str(format_metric(net.get('rd_packets', 0))), str(format_metric(net.get('wr_packets', 0)))]),
+                '/'.join([str(format_metric(net.get('rd_errors', 0))), str(format_metric(net.get('wr_errors', 0)))]),
            ))
        # Controller list
        ainformation.append('')
@ -1195,10 +1211,6 @@ def format_info(config, domain_information, long_output):


 def format_list(config, vm_list, raw):
-    # Handle single-element lists
-    if not isinstance(vm_list, list):
-        vm_list = [vm_list]
-
    # Function to strip the "br" off of nets and return a nicer list
    def getNiceNetID(domain_information):
        # Network list
--- a/client-cli/pvc.py
+++ b/client-cli/pvc.py
@ -1601,38 +1601,38 @@ def net_add(vni, description, nettype, domain, ip_network, ip_gateway, ip6_netwo
@click.option(
    '-i', '--ipnet', 'ip4_network',
    default=None,
-    help='CIDR-format IPv4 network address for subnet.'
+    help='CIDR-format IPv4 network address for subnet; disable with "".'
 )
@click.option(
    '-i6', '--ipnet6', 'ip6_network',
    default=None,
-    help='CIDR-format IPv6 network address for subnet.'
+    help='CIDR-format IPv6 network address for subnet; disable with "".'
 )
@click.option(
    '-g', '--gateway', 'ip4_gateway',
    default=None,
-    help='Default IPv4 gateway address for subnet.'
+    help='Default IPv4 gateway address for subnet; disable with "".'
 )
@click.option(
    '-g6', '--gateway6', 'ip6_gateway',
    default=None,
-    help='Default IPv6 gateway address for subnet.'
+    help='Default IPv6 gateway address for subnet; disable with "".'
 )
@click.option(
    '--dhcp/--no-dhcp', 'dhcp_flag',
    is_flag=True,
    default=None,
-    help='Enable/disable DHCP for clients on subnet.'
+    help='Enable/disable DHCPv4 for clients on subnet (DHCPv6 is always enabled if DHCPv6 network is set).'
 )
@click.option(
    '--dhcp-start', 'dhcp_start',
    default=None,
-    help='DHCP range start address.'
+    help='DHCPvr range start address.'
 )
@click.option(
    '--dhcp-end', 'dhcp_end',
    default=None,
-    help='DHCP range end address.'
+    help='DHCPv4 range end address.'
 )
@click.argument(
    'vni'
@ -2721,14 +2721,14 @@ def provisioner_template_system_list(limit):
    help='The amount of vRAM (in MB).'
 )
@click.option(
-    '-s', '--serial', 'serial',
+    '-s/-S', '--serial/--no-serial', 'serial',
    is_flag=True, default=False,
    help='Enable the virtual serial console.'
 )
@click.option(
-    '-n', '--vnc', 'vnc',
+    '-n/-N', '--vnc/--no-vnc', 'vnc',
    is_flag=True, default=False,
-    help='Enable the VNC console.'
+    help='Enable/disable the VNC console.'
 )
@click.option(
    '-b', '--vnc-bind', 'vnc_bind',
@ -2801,14 +2801,14 @@ def provisioner_template_system_add(name, vcpus, vram, serial, vnc, vnc_bind, no
    help='The amount of vRAM (in MB).'
 )
@click.option(
-    '-s', '--serial', 'serial',
+    '-s/-S', '--serial/--no-serial', 'serial',
    is_flag=True, default=None,
    help='Enable the virtual serial console.'
 )
@click.option(
-    '-n', '--vnc', 'vnc',
+    '-n/-N', '--vnc/--no-vnc', 'vnc',
    is_flag=True, default=None,
-    help='Enable the VNC console.'
+    help='Enable/disable the VNC console.'
 )
@click.option(
    '-b', '--vnc-bind', 'vnc_bind',
@ -3320,7 +3320,7 @@ def provisioner_userdata_add(name, filename):
    userdata = filename.read()
    filename.close()
    try:
-        yaml.load(userdata, Loader=yaml.FullLoader)
+        yaml.load(userdata, Loader=yaml.SafeLoader)
    except Exception as e:
        click.echo("Error: Userdata document is malformed")
        cleanup(False, e)
@ -3397,7 +3397,7 @@ def provisioner_userdata_modify(name, filename, editor):
        filename.close()

    try:
-        yaml.load(userdata, Loader=yaml.FullLoader)
+        yaml.load(userdata, Loader=yaml.SafeLoader)
    except Exception as e:
        click.echo("Error: Userdata document is malformed")
        cleanup(False, e)
@ -4057,13 +4057,19 @@ def maintenance_off():
@click.command(name='status', short_help='Show current cluster status.')
@click.option(
    '-f', '--format', 'oformat', default='plain', show_default=True,
-    type=click.Choice(['plain', 'json', 'json-pretty']),
+    type=click.Choice(['plain', 'short', 'json', 'json-pretty']),
    help='Output format of cluster status information.'
 )
@cluster_req
 def status_cluster(oformat):
    """
    Show basic information and health for the active PVC cluster.
+
+    Output formats:
+      plain: Full text, full colour output for human-readability.
+      short: Health-only, full colour output for human-readability.
+      json: Compact JSON representation for machine parsing.
+      json-pretty: Pretty-printed JSON representation for machine parsing or human-readability.
    """

    retcode, retdata = pvc_cluster.get_info(config)
@ -4073,16 +4079,81 @@ def status_cluster(oformat):


 ###############################################################################
-# pvc init
+# pvc task
+###############################################################################
+@click.group(name='task', short_help='Perform PVC cluster tasks.', context_settings=CONTEXT_SETTINGS)
+def cli_task():
+    """
+    Perform administrative tasks against the PVC cluster.
+    """
+    pass
+
+
+###############################################################################
+# pvc task backup
+###############################################################################
+@click.command(name='backup', short_help='Create JSON backup of cluster.')
+@click.option(
+    '-f', '--file', 'filename',
+    default=None, type=click.File(),
+    help='Write backup data to this file.'
+)
+@cluster_req
+def task_backup(filename):
+    """
+    Create a JSON-format backup of the cluster Zookeeper database.
+    """
+
+    retcode, retdata = pvc_cluster.backup(config)
+    if filename:
+        with open(filename, 'wb') as fh:
+            fh.write(retdata)
+            retdata = 'Data written to {}'.format(filename)
+    cleanup(retcode, retdata)
+
+
+###############################################################################
+# pvc task restore
+###############################################################################
+@click.command(name='restore', short_help='Restore JSON backup to cluster.')
+@click.option(
+    '-f', '--file', 'filename',
+    required=True, default=None, type=click.File(),
+    help='Read backup data from this file.'
+)
+@click.option(
+    '-y', '--yes', 'confirm_flag',
+    is_flag=True, default=False,
+    help='Confirm the restore'
+)
+@cluster_req
+def task_restore(filename, confirm_flag):
+    """
+    Restore the JSON backup data from a file to the cluster.
+    """
+
+    if not confirm_flag:
+        try:
+            click.confirm('Replace all existing cluster data from coordinators with backup file "{}"'.format(filename.name), prompt_suffix='? ', abort=True)
+        except Exception:
+            exit(0)
+
+    cluster_data = json.loads(filename.read())
+    retcode, retmsg = pvc_cluster.restore(config, cluster_data)
+    cleanup(retcode, retmsg)
+
+
+###############################################################################
+# pvc task init
 ###############################################################################
@click.command(name='init', short_help='Initialize a new cluster.')
@click.option(
    '-y', '--yes', 'confirm_flag',
    is_flag=True, default=False,
-    help='Confirm the removal'
+    help='Confirm the initialization'
 )
@cluster_req
-def init_cluster(confirm_flag):
+def task_init(confirm_flag):
    """
    Perform initialization of a new PVC cluster.
    """
@ -4317,6 +4388,10 @@ cli_provisioner.add_command(provisioner_status)
 cli_maintenance.add_command(maintenance_on)
 cli_maintenance.add_command(maintenance_off)

+cli_task.add_command(task_backup)
+cli_task.add_command(task_restore)
+cli_task.add_command(task_init)
+
 cli.add_command(cli_cluster)
 cli.add_command(cli_node)
 cli.add_command(cli_vm)
@ -4324,8 +4399,8 @@ cli.add_command(cli_network)
 cli.add_command(cli_storage)
 cli.add_command(cli_provisioner)
 cli.add_command(cli_maintenance)
+cli.add_command(cli_task)
 cli.add_command(status_cluster)
-cli.add_command(init_cluster)


 #
--- a/daemon-common/ceph.py
+++ b/daemon-common/ceph.py
@ -347,9 +347,11 @@ def getPoolInformation(zk_conn, pool):
    # Parse the stats data
    pool_stats_raw = zkhandler.readdata(zk_conn, '/ceph/pools/{}/stats'.format(pool))
    pool_stats = dict(json.loads(pool_stats_raw))
+    volume_count = len(getCephVolumes(zk_conn, pool))

    pool_information = {
        'name': pool,
+        'volume_count': volume_count,
        'stats': pool_stats
    }
    return pool_information
--- a/daemon-common/common.py
+++ b/daemon-common/common.py
@ -27,6 +27,7 @@ import shlex
 import subprocess
 import kazoo.client
 from json import loads
+from re import match as re_match

 from distutils.util import strtobool

@ -359,6 +360,7 @@ def getDomainNetworks(parsed_xml, stats_data):
            net_wr_drops = net_stats.get('wr_drops', 0)
            net_obj = {
                'type': net_type,
+                'vni': re_match(r'[vm]*br([0-9a-z]+)', net_bridge).group(1),
                'mac': net_mac,
                'source': net_bridge,
                'model': net_model,
--- a/daemon-common/network.py
+++ b/daemon-common/network.py
@ -292,23 +292,23 @@ def modify_network(zk_conn, vni, description=None, domain=None, name_servers=Non
                   dhcp4_flag=None, dhcp4_start=None, dhcp4_end=None):
    # Add the modified parameters to Zookeeper
    zk_data = dict()
-    if description:
+    if description is not None:
        zk_data.update({'/networks/{}'.format(vni): description})
-    if domain:
+    if domain is not None:
        zk_data.update({'/networks/{}/domain'.format(vni): domain})
-    if name_servers:
+    if name_servers is not None:
        zk_data.update({'/networks/{}/name_servers'.format(vni): name_servers})
-    if ip4_network:
+    if ip4_network is not None:
        zk_data.update({'/networks/{}/ip4_network'.format(vni): ip4_network})
-    if ip4_gateway:
+    if ip4_gateway is not None:
        zk_data.update({'/networks/{}/ip4_gateway'.format(vni): ip4_gateway})
-    if ip6_network:
+    if ip6_network is not None:
        zk_data.update({'/networks/{}/ip6_network'.format(vni): ip6_network})
-        if ip6_network is not None:
+        if ip6_network:
            zk_data.update({'/networks/{}/dhcp6_flag'.format(vni): 'True'})
        else:
            zk_data.update({'/networks/{}/dhcp6_flag'.format(vni): 'False'})
-    if ip6_gateway:
+    if ip6_gateway is not None:
        zk_data.update({'/networks/{}/ip6_gateway'.format(vni): ip6_gateway})
    else:
        # If we're changing the network, but don't also specify the gateway,
@ -317,11 +317,11 @@ def modify_network(zk_conn, vni, description=None, domain=None, name_servers=Non
            ip6_netpart, ip6_maskpart = ip6_network.split('/')
            ip6_gateway = '{}1'.format(ip6_netpart)
            zk_data.update({'/networks/{}/ip6_gateway'.format(vni): ip6_gateway})
-    if dhcp4_flag:
+    if dhcp4_flag is not None:
        zk_data.update({'/networks/{}/dhcp4_flag'.format(vni): dhcp4_flag})
-    if dhcp4_start:
+    if dhcp4_start is not None:
        zk_data.update({'/networks/{}/dhcp4_start'.format(vni): dhcp4_start})
-    if dhcp4_end:
+    if dhcp4_end is not None:
        zk_data.update({'/networks/{}/dhcp4_end'.format(vni): dhcp4_end})

    zkhandler.writedata(zk_conn, zk_data)
--- a/debian/changelog
+++ b/debian/changelog
@ -1,3 +1,57 @@
+pvc (0.9.10-0) unstable; urgency=high
+
+  * Moves OSD stats uploading to primary, eliminating reporting failures while hosts are down
+  * Documentation updates
+  * Significantly improves RBD locking behaviour in several situations, eliminating cold-cluster start issues and failed VM boot-ups after crashes
+  * Fixes some timeout delays with fencing
+  * Fixes bug in validating YAML provisioner userdata
+
+ -- Joshua M. Boniface <joshua@boniface.me>  Tue, 15 Dec 2020 10:45:15 -0500
+
+pvc (0.9.9-0) unstable; urgency=high
+
+  * Adds documentation updates
+  * Removes single-element list stripping and fixes surrounding bugs
+  * Adds additional fields to some API endpoints for ease of parsing by clients
+  * Fixes bugs with network configuration
+
+ -- Joshua M. Boniface <joshua@boniface.me>  Wed, 09 Dec 2020 02:20:20 -0500
+
+pvc (0.9.8-0) unstable; urgency=high
+
+  * Adds support for cluster backup/restore
+  * Moves location of `init` command in CLI to make room for the above
+  * Cleans up some invalid help messages from the API
+
+ -- Joshua M. Boniface <joshua@boniface.me>  Tue, 24 Nov 2020 12:26:57 -0500
+
+pvc (0.9.7-0) unstable; urgency=high
+
+  * Fixes bug with provisioner system template modifications
+
+ -- Joshua M. Boniface <joshua@boniface.me>  Thu, 19 Nov 2020 10:48:28 -0500
+
+pvc (0.9.6-0) unstable; urgency=high
+
+  * Fixes bug with migrations
+
+ -- Joshua M. Boniface <joshua@boniface.me>  Tue, 17 Nov 2020 13:01:54 -0500
+
+pvc (0.9.5-0) unstable; urgency=high
+
+  * Fixes bug with line count in log follow
+  * Fixes bug with disk stat output being None
+  * Adds short pretty health output
+  * Documentation updates
+
+ -- Joshua M. Boniface <joshua@boniface.me>  Tue, 17 Nov 2020 12:34:04 -0500
+
+pvc (0.9.4-0) unstable; urgency=high
+
+  * Fixes major bug in OVA parser
+
+ -- Joshua M. Boniface <joshua@boniface.me>  Tue, 10 Nov 2020 15:33:50 -0500
+
 pvc (0.9.3-0) unstable; urgency=high

  * Fixes bugs with image & OVA upload parsing
--- a/docs/about.md
+++ b/docs/about.md
@ -1,65 +1,157 @@
-# About the Parallel Virtual Cluster suite
+# About the Parallel Virtual Cluster system

-## Project Goals and Philosophy
+- [About the Parallel Virtual Cluster system](#about-the-parallel-virtual-cluster-system)
+  * [Project Motivation](#project-motivation)
+  * [Building Blocks](#building-blocks)
+  * [Cluster Architecture](#cluster-architecture)
+  * [Clients](#clients)
+    + [API Client](#api-client)
+    + [Direct Bindings](#direct-bindings)
+    + [CLI Client](#cli-client)
+  * [Deployment](#deployment)
+  * [Frequently Asked Questions](#frequently-asked-questions)
+    + [General Questions](#general-questions)
+    + [Feature Questions](#feature-questions)
+    + [Storage Questions](#storage-questions)
+  * [About The Author](#about-the-author)
+
+This document contains information about the project itself, the software stack, its motivations, and a number of frequently-asked questions.
+
+## Project Motivation

 Server management and system administration have changed significantly in the last decade. Computing as a resource is here, and software-defined is the norm. Gone are the days of pet servers, of tweaking configuration files by hand, and of painstakingly installing from ISO images in 52x CD-ROM drives. This is a brave new world.

-As part of this trend, the rise of IaaS (Infrastructure as a Service) has created an entirely new way for administrators and, increasingly, developers, to interact with servers. They need to be able to provision virtual machines easily and quickly, to ensure those virtual machines are reliable and consistent, and to avoid downtime wherever possible.
+As part of this trend, the rise of IaaS (Infrastructure as a Service) has created an entirely new way for administrators and, increasingly, developers, to interact with servers. They need to be able to provision virtual machines easily and quickly, to ensure those virtual machines are reliable and consistent, and to avoid downtime wherever possible. Even in a world of containers, VMs are still important, and are not going away, so some virtual management solution is a must.

-However, the state of the Free Software, virtual management ecosystem at the start of 2020 is quite disappointing. On the one hand are the giant, IaaS products like OpenStack and CloudStack. These are massive pieces of software, featuring dozens of interlocking parts, designed for massive clusters and public cloud deployments. They're great for a "hyperscale" provider, a large-scale SaaS/IaaS provider, or an enterprise. But they're not designed for small teams or small clusters. On the other hand, tools like Proxmox, oVirt, and even good old fashioned shell scripts are barely scalable, are showing their age, and have become increasingly unwieldy for advanced use-cases - great for one server, not so great for 9 in a highly-available cluster. Not to mention the constant attempts to monetize by throwing features behind Enterprise subscriptions. In short, there is a massive gap between the old-style, pet-based virtualization and the modern, large-scale, IaaS-type virtualization. This is not to mention the well-entrenched, proprietary solutions like VMWare and Nutanix which provide many of the features a small cluster administrator requires, but can be prohibitively expensive for small organizations.
+However, the current state of this ecosystem is lacking. At present there are 3 primary categories: the large "Stack" open-source projects, the smaller traditional "VM management" open-source projects, and the entrenched proprietary solutions.

-PVC aims to bridge these gaps. As a Python 3-based, fully-Free Software, scalable, and redundant private "cloud" that isn't afraid to say it's for small clusters, PVC is able to provide the simple, easy-to-use, small cluster you need today, with minimal administrator work, while being able to scale as your system grows, supporting hundreds or thousands of VMs across dozens of nodes. High availability is baked right into the core software at every layer, giving you piece of mind about your cluster, and ensuring that your systems keep running no matter what happens. And the interface couldn't be easier - a straightforward Click-based CLI and a Flask-based HTTP API provide access to the cluster for you to manage, either directly or though scripts or WebUIs. And since everything is Free Software, you can always inspect it, customize it to your use-case, add features, and contribute back to the community if you so choose.
+At the high end of the open-source ecosystem, are the "Stacks": OpenStack, CloudStack, and their numerous "vendorware" derivatives. These are large, unwieldy projects with dozens or hundreds of pieces of software to deploy in production, and can often require a large team just to understand and manage them. They're great if you're a large enterprise, building a public cloud, or have a team to get you going. But if you just want to run a small- to medium-sized virtual cluster for your SMB or ISP, they're definitely overkill and will cause you more headaches than they will solve long-term.

-PVC provides all the features you'd expect of a "cloud" system - easy management of VMs, including live migration between nodes for maximum uptime; virtual networking support using either vLANs or EVPN-based VXLAN; shared, redundant, object-based storage using Ceph, and a Python function library and convenient API interface for building your own interfaces. It is able to do this without being excessively complex, and without making sacrifices for legacy ideas.
+At the low end of the open source ecosystem, are what I call the "traditional tools". The biggest name in this space is ProxMox, though other, mostly defunct projects like Ganeti, tangential projects like Corosync/Pacemaker, and even traditional "I just use scripts" methods fit as well. These projects are great if you want to run a small server or homelab, but they quickly get unwieldy, though for the opposite reason from the Stacks: they're too simplistic, designed around single-host models, and when they provide redundancy at all it is often haphazard and nowhere near production-grade.

-If you need to run virtual machines, and don't have the time to learn the Stacks, the patience to deal with the old-style FOSS tools, or the money to spend on proprietary solutions, PVC might be just what you're looking for.
+Finally, the proprietary solutions like VMWare and Nutanix have entrenched themselves in the industry. They're excellent pieces of software providing just about anything you would need, but this comes at a significant cost, both in terms of money and also in software freedom and vendor lock-in. The licensing costs of Nutanix for instance can often make even enterprise-grade customers' accountants' heads spin.
+
+PVC seeks to bridge the gaps between these 3 categories. It is fully Free Software like the first two categories, and even more so - PVC is committed to never be "open-core" software and to never hide a single feature behind a paywall; it is able to scale from very small (1 or 3 node) clusters up to a dozen or more nodes, bridging the first two categories as effortlessly as the third does; it makes use of a hyperconverged architecture like ProxMox or Nuntanix to avoid wasting hardware resources on dedicated controller, hypervisor, and storage nodes; it is redundant at every layer from the ground-up, something that is not designed into any other free solution, and is able to tolerate the loss any single disk or entire node with barely a blip, all without administrator intervention; and finally, it is designed to be as simple to use as possible, with an Ansible-based node management framework, a RESTful API client interface, and a consistent, self-documenting CLI administration tool, allowing an administrator to create and manage their cluster quickly and simply, and then get on with more interesting things.
+
+In short, it is a Free Software, scalable, redundant, self-healing, and self-managing private cloud solution designed with administrator simplicity in mind.
+
+## Building Blocks
+
+PVC is build from a number of other, open source components. The main system itself is a series of software daemons (services) written in Python 3, with the CLI interface also written in Python 3.
+
+Virtual machines themselves are run with the Linux KVM subsystem via the Libvirt virtual machine management library. This provides the maximum flexibility and compatibility for running various guest operating systems in multiple modes (fully-virtualized, para-virtualized, virtio-enabled, etc.).
+
+To manage cluster state, PVC uses Zookeeper. This is an Apache project designed to provide a highly-available and always-consistent key-value database. The various daemons all connect to the distributed Zookeeper database to both obtain details about cluster state, and to manage that state. For instance the node daemon watches Zookeeper for information on what VMs to run, networks to create, etc., while the API writes information to Zookeeper in response to requests.
+
+Additional relational database functionality, specifically for the DNS aggregation subsystem and the VM provisioner, is provided by the PostgreSQL database and the Patroni management tool, which provides automatic clustering and failover for PostgreSQL database instances.
+
+Node network routing for managed networks providing EBGP VXLAN and route-learning is provided by FRRouting, a descendant project of Quaaga and GNU Zebra.
+
+The storage subsystem is provided by Ceph, a distributed object-based storage subsystem with extensive scalability, self-managing, and self-healing functionality. The Ceph RBD (Rados Block Device) subsystem is used to provide VM block devices similar to traditional LVM or ZFS zvols, but in a distributed, shared-storage manner.
+
+All the components are designed to be run on top of Debian GNU/Linux, specifically Debian 10.X "Buster", with the SystemD system service manager. This OS provides a stable base to run the various other subsystems while remaining truly Free Software, while SystemD provides functionality such as automatic daemon restarting and complex startup/shutdown ordering.

 ## Cluster Architecture

-A PVC cluster is based around "nodes", which are physical servers on which the various daemons, storage, networks, and virtual machines run. Each node is self-contained; it is able to perform any and all cluster functions if needed, and there is no segmentation of function between different types of physical hosts.
+A PVC cluster is based around "nodes", which are physical servers on which the various daemons, storage, networks, and virtual machines run. Each node is self-contained and is able to perform any and all cluster functions if needed; there is no segmentation of function between different types of physical hosts. Ideally, all nodes in a cluster will be identical in specifications, but in some situations mismatched nodes are acceptable, with limitations.

-A limited number of nodes, called "coordinators", are statically configured to provide additional services for the cluster. All databases for instance run on the coordinators, but not other nodes. This prevents any issues with scaling database clusters across dozens of hosts, while still retaining maximum redundancy. In a standard configuration, 3 or 5 nodes are designated as coordinators, and additional nodes connect to the coordinators for database access where required. For quorum purposes, there should always be an odd number of coordinators, and exceeding 5 is likely not required even for large clusters. PVC also supports a single node cluster format for extremely small clusters, homelabs, or testing where redundancy is not required.
+A subset of the nodes, called "coordinators", are statically configured to provide additional services for the cluster. For instance, all databases, FRRouting instances, and Ceph management daemons run only on the set of cluster coordinators. At cluster bootstrap, 1 (testing-only), 3 (small clusters), or 5 (large clusters) nodes may be chosen as the coordinators. Other nodes can then be added as "hypervisor" nodes, which then provide only block device (storage) and VM (compute) functionality by connecting to the set of coordinators. This limits the scaling problem of the databases while ensuring there is still maximum redundancy and resiliency for the core cluster services. 

-The primary database for PVC is Zookeeper, a highly-available key-value store designed with consistency in mind. Each node connects to the Zookeeper cluster running on the coordinators to send and receive data from the rest of the cluster. The API client (and Python function library) interface with this Zookeeper cluster directly to configure and obtain state about the various objects in the cluster. This database is the central authority for all nodes.
+Additional nodes can be added to the cluster either as coordinators, or as hypervisors, by adding them to the Ansible configuration and running it against the full set of nodes. Note that the number of coordinators must always be odd, and more than 5 coordinators are normally unnecessary and can cause issues with the database; it is thus normally advisable to add any nodes beyond the initial set as hypervisors instead of coordinators. Nodes can be removed from service, but this is a manual process and should not be attempted unless absolutely required; the Ceph subsystem in particular is sensitive to changes in the coordinator nodes.

-Nodes are networked together via at least 3 different networks, set during bootstrap. The first is the "upstream" network, which provides upstream access for the nodes, for instance Internet connectivity, sending routes to client networks to upstream routers, etc. This should usually be a private/firewalled network to prevent unauthorized access to the cluster. The second is the "cluster" network, which is a private RFC1918 network that is unrouted and that nodes use to communicate between one another for Zookeeper access, Libvirt migrations, EVPN VXLAN tunnels, etc. The third is the "storage" network, which is used by the Ceph storage cluster for inter-OSD communication, allowing it to be separate from the main cluster network for maximum performance flexibility.
+During runtime, one coordinator is elected the "primary" for the cluster. This designation can shift dynamically in response to cluster events, or be manually migrated by an administrator. The coordinator takes on a number of roles for which only one host may be active at once, for instance to provide DHCP services to managed client networks or to interface with the API.

-Further information about the general cluster architecture can be found at the [cluster architecture page](/architecture/cluster).
+Nodes are networked together via a set of statically-configured networks. At a minimum, 2 discrete networks are required, with an optional 3rd.

-## Node Architecture
+ * The "upstream" network is the primary network for the nodes, and provides functions such as upstream Internet access, routing to and from the cluster nodes, and management via the API; it may be either a firewalled public or NAT'd RFC1918 network, but should never be exposed directly to the Internet.
+ * The "cluster" network is an unrouted RFC1918 network which provides inter-node communication for managed client network traffic (VXLANs), cross-node routing, VM migration and failover, and database replication and access.
+ * The "storage" network is another unrouted RFC1918 network which provides a dedicated logical and/or physical link between the nodes for storage traffic, including VM block device storage traffic, inter-OSD replication traffic, and Ceph heartbeat traffic, thus allowing it to be completely isolated from the other networks for maximum performance. This network can be optionally colocated with the "cluster" network, by specifying the same device for both, and can be further combined by specifying the same IP for both to completely collapse the "cluster" and "storage" networks. This may be ideal to simply management of small clusters.
+ 
+ Within each network is a single "floating" IP address which follows the primary coordinator, providing a single interface to the cluster. Once configured, the cluster is then able to create additional networks of two kinds, "bridged" traditional vLANs and "managed" routed VXLANs, to provide network access to VMs.

-Within each node, the PVC daemon is a single Python 3 program which handles all node functionality, including networking, starting cluster services, managing creation/removal of VMs, networks, and storage, and providing utilization statistics and information to the cluster.
+Further information about the general cluster architecture, including important considerations for node specifications/sizing and network configuration, [can be found at the cluster architecture page](/cluster-architecture). It is imperative that potential PVC administrators read this document thoroughly to understand the specific requirements of PVC and avoid potential missteps in obtaining and deploying their cluster.

-The daemon uses an object-oriented approach, with most cluster objects being represented by class objects of a specific type. Each node has a full view of all cluster objects and can interact with them based on events from the cluster as needed.
+## Clients

-Further information about the node daemon manual can be found at the [daemon manual page](/manuals/daemon).
+### API Client

-## Client Architecture
+The API client is a Flask-based RESTful API and is the core interface to PVC. By default the API will run on the primary coordinator, listening on TCP port 7370 on the "upstream" network floating IP address. All other clients communicate with this API to perform actions against the cluster. The API features basic authentication using UUID-based API keys to prevent unauthorized access, and can optionally be configured with full TLS encryption to provide integrity and confidentiality across public networks.

-### API client
+The API generally accepts all requests as HTTP form requests following standard RESTful guidelines, supporting arguments in the URI string or, with limited exceptions, in the message body. The API returns JSON response bodies to all requests consisting either of the information requested, or a `{ "message": "text" }` construct to pass informational status messages back to the client.

-The API client is the core interface to PVC. It is a Flask RESTful API interface capable of performing all functions, and by default runs on the primary coordinator listening on port 7370 at the upstream floating IP address. Other clients, such as the CLI client, connect to the API to perform actions against the cluster. The API features a basic key-based authentication mechanism to prevent unauthorized access to the cluster if desired, and can also provide TLS-encrypted access for maximum security over public networks.
+The API client manual can be found at the [API manual page](/manuals/api), and the full API documentation can be found at the [API reference page](/manuals/api-reference.html).

-The API accepts all requests as HTTP form requests, supporting arguments both in the URI string as well as in the POST/PUT body. The API returns JSON response bodies to all requests.
+### Direct Bindings

-The API client manual can be found at the [API manual page](/manuals/api), and the [API documentation page](/manuals/api-reference.html).
+The API client uses a dedicated set of Python libraries, packaged as the `pvc-daemon-common` Debian package, to communicate with the cluster. It is thus possible to build custom Python clients that directly interface with the PVC cluster, without having to get "into the weeds" of the Zookeeper or PostgreSQL databases.

-### Direct bindings
+### CLI Client

-The API client uses a dedicated, independent set of functions to perform the actual communication with the cluster, which is packaged separately as the `pvc-client-common` package. These functions can be used directly by 3rd-party Python interfaces for PVC if desired.
+The CLI client is a Python Click application, which provides a convenient CLI interface to the API client. It supports connecting to multiple clusters from a single instance, with or without authentication and over both HTTP or HTTPS, including a special "local" cluster if the client determines that an API configuration exists on the local host. Information about the configured clusters is stored in a local JSON document, and a default cluster can be set with an environment variable.

-### CLI client
+The CLI client is self-documenting using the `-h`/`--help` arguments throughout, easing the administrator learning curve and providing easy access to command details. A short manual can also be found at the [CLI manual page](/manuals/cli).

-The CLI client interface is a Click application, which provides a convenient CLI interface to the API client. It supports connecting to multiple clusters, over both HTTP and HTTPS and with authentication, including a special "local" cluster if the client determines that an `/etc/pvc/pvcapid.yaml` configuration exists on the host.
+## Deployment

-The CLI client is self-documenting using the `-h`/`--help` arguments, though a short manual can be found at the [CLI manual page](/manuals/cli).
-
-## Deployment architecture
-
-The overall management, deployment, bootstrapping, and configuring of nodes is accomplished via a set of Ansible roles, found in the [`pvc-ansible` repository](https://github.com/parallelvirtualcluster/pvc-ansible), and nodes are installed via a custom installer ISO generated by the [`pvc-installer` repository](https://github.com/parallelvirtualcluster/pvc-installer). Once the cluster is set up, nodes can be added, replaced, or updated using this Ansible framework.
+The overall management, deployment, bootstrapping, and configuring of nodes is accomplished via a set of Ansible roles and playbooks, found in the [`pvc-ansible` repository](https://github.com/parallelvirtualcluster/pvc-ansible), and nodes are installed via a custom installer ISO generated by the [`pvc-installer` repository](https://github.com/parallelvirtualcluster/pvc-installer). Once the cluster is set up, nodes can be added, replaced, updated, or reconfigured using this Ansible framework.

 The Ansible configuration and architecture manual can be found at the [Ansible manual page](/manuals/ansible).

-## About the author
+## Frequently Asked Questions
+
+### General Questions
+
+#### What is it?
+
+PVC is a virtual machine management suite designed around high-availability and ease-of-use. It can be considered an alternative to OpenStack, ProxMox, Nutanix, and other similar solutions that manage not just the VMs, but the surrounding infrastructure as well.
+
+#### Why would you make this?
+
+After becoming frustrated by numerous other management tools, I discovered that what I wanted didn't exist as FLOSS software, so I built it myself. Since then, I have also been able to leverage PVC both for my own purposes as well as for my employer, a win-win for the project.
+
+#### Is PVC right for me?
+
+PVC might be right for you if:
+
+1. You need KVM-based VMs.
+2. You want management of storage and networking (a.k.a. "batteries-included") in the same tool.
+3. You want hypervisor-level redundancy, able to tolerate hypervisor downtime seamlessly, for all elements of the stack.
+
+I built PVC for my homelab first, found a perfect use-case with my employer, and think it might be useful to you too.
+
+#### Is 3 hypervisors really the minimum?
+
+For a redundant cluster, yes. PVC requires a majority quorum for proper operation at various levels, and the smallest possible majority quorum is 2-of-3; thus 3 nodes is the safe minimum. That said, you can run PVC on a single node for testing/lab purposes without host-level redundancy, should you wish to do so, and it might also be possible to run 2 "main" systems with a 3rd "quorum observer" hosting only the management tools but no VMs, however this is not officially supported.
+
+### Feature Questions
+
+#### Does PVC support containers (Docker/Kubernetes/LXC/etc.)?
+
+No, not directly. PVC supports only KVM VMs. To run containers, you would need to run a VM which then runs your containers. For instance PVC makes an excellent underlying layer for a virtual Kubernetes cluster, instead of bare hardware.
+
+#### Does PVC have a WebUI?
+
+Not yet. Right now, PVC management is done exclusively with the CLI interface to the API. A WebUI can and likely will be built in the future, but I'm not a frontend developer and I do not consider this a personal priority. As of late 2020 the API is generally stable, so I would welcome 3rd party assistance here.
+
+### Storage Questions
+
+#### Can I use RAID-5/RAID-6 with PVC?
+
+The short answer is no. The long answer is: Ceph, the storage backend used by PVC, does support "erasure coded" pools which implement a RAID-5-like (striped with distributed parity) functionality, but PVC does not support this for several reasons, mostly related to ease of management and performance. If you use PVC, you must accept at the very least a 2x storage penalty, and for true multi-node safety and resiliency, a 3x storage penalty for VM storage. This is a trade-off of the architecture and should be taken into account when sizing storage in nodes.
+
+#### Can I use spinning HDDs with PVC?
+
+You can, but you won't like the results. SSDs, and specifically datacentre-grade SSDs for resiliency, are required to obtain any sort of reasonable performance when running multiple VMs. The higher-performance the drives, the faster the storage.
+
+#### What network speed does PVC require?
+
+For optimal performance, nodes should use at least 10-Gigabit Ethernet network interfaces wherever possible, and on large clusters a dedicated 10-Gigabit "storage" network, separate from the "upstream"/"cluster" networks, is strongly recommended. The storage system performance, especially for writes, is more heavily bottlenecked by the network speed than the actual storage device speed when speaking of high-performance disks. 1-Gigabit Ethernet will be sufficient for some use-cases and is sufficient for the non-storage networks (VM traffic notwithstanding), but storage performance will become severely limited as the cluster grows. Even slower network speeds (e.g. 100-Megabit) are not sufficient for PVC to operate properly except in very limited testing scenarios.
+
+#### What Ceph version does PVC use?
+
+PVC requires Ceph 14.x (Nautilus). The official PVC repository at https://repo.bonifacelabs.ca includes Ceph 14.2.x (updated regularly), since Debian Buster by default includes only 12.x (Luminous).
+
+## About The Author
+
+PVC is written by [Joshua](https://www.boniface.me) [M.](https://bonifacelabs.ca) [Boniface](https://github.com/joshuaboniface). A Linux system administrator by trade, Joshua is always looking for the best solutions to his user's problems, be they developers or end users. PVC grew out of his frustration with the various FOSS virtualization tools, as well as and specifically, the constant failures of Pacemaker/Corosync to gracefully manage a virtualization cluster. He started work on PVC at the end of May 2018 as a simple alternative to a Corosync/Pacemaker-managed virtualization cluster, and has been growing the feature set and stability of the system ever since.

-PVC is written by [Joshua](https://www.boniface.me) [M.](https://bonifacelabs.ca) [Boniface](https://github.com/joshuaboniface). A Linux system administrator by trade, Joshua is always looking for the best solutions to his user's problems, be they developers or end users. PVC grew out of his frustration with the various FOSS virtualization tools, as well as and specifically, the constant failures of Pacemaker/Corosync to gracefully manage a virtualization cluster. He started work on PVC at the end of May 2018 as a simple alternative to a Corosync/Pacemaker-managed virtualization cluster, and has been growing the feature set in starts and stops ever since.
--- a/docs/cluster-architecture.md
+++ b/docs/cluster-architecture.md
@ -64,35 +64,37 @@ For memory provisioning of VMs, PVC will warn the administrator, via a Degraded

 ### Operating System and Architecture

-As an underlying OS, only Debian 10 "Buster" is supported by PVC. This is the operating system installed by the PVC [node installer](https://github.com/parallelvirtualcluster/pvc-installer) and expected by the PVC [Ansible configuration system](https://github.com/parallelvirtualcluster/pvc-ansible). Ubuntu or other Debian-derived distributions may work, but are not officially supported. PVC also makes use of a custom repository to provide the PVC software and an updated version of Ceph beyond what is available in the base operating system, and this is only compatible officially with Debian 10 "Buster".
+As an underlying OS, only Debian GNU/Linux 10.x "Buster" is supported by PVC. This is the operating system installed by the PVC [node installer](https://github.com/parallelvirtualcluster/pvc-installer) and expected by the PVC [Ansible configuration system](https://github.com/parallelvirtualcluster/pvc-ansible). Ubuntu or other Debian-derived distributions may work, but are not officially supported. PVC also makes use of a custom repository to provide the PVC software and an updated version of Ceph beyond what is available in the base operating system, and this is only compatible officially with Debian 10 "Buster". PVC will, in the future, upgrade to future versions of Debian based on their release schedule and testing; releases may be skipped for official support if required. As a general rule, using the current versions of the official node installer and Ansible repository is the preferred and only supported method for deploying PVC.

-Currently, only the `amd64` (Intel 64 or AMD64) architecture is officially supported by PVC. Given the cross-platform nature of Python and the various software components in Debian, it may work on `armhf` or `arm64` systems as well, however this has not been tested by the author.
+Currently, only the `amd64` (Intel 64 or AMD64) architecture is officially supported by PVC. Given the cross-platform nature of Python and the various software components in Debian, it may work on `armhf` or `arm64` systems as well, however this has not been tested by the author and is not officially supported at this time.

 ## Storage Layout: Ceph and OSDs

-The Ceph subsystem of PVC, if enabled, creates a "hyperconverged" cluster whereby storage and VM hypervisor functions are collocated onto the same physical servers. The performance of the storage must be taken into account when sizing the nodes as mentioned above.
+PVC makes use of Ceph, a distributed, replicated, self-healing, and self-managing storage system to provide shared VM storage. While a PVC administrator is not required to understand Ceph for day-to-day administraton, and PVC provides interfaces to most of the common storage functions required to operate a cluster, at least some knowledge of Ceph is advisable.

-The Ceph system is laid out similar to the other daemons. The Ceph Monitor and Manager functions are delegated to the Coordinators over the storage network, with all nodes connecting to these hosts to obtain the CRUSH maps and select OSD disks. OSDs are then distributed on all hosts, including non-coordinator hypervisors, and communicate with clients and each other over the storage network.
+The Ceph subsystem of PVC creates a "hyperconverged" cluster whereby storage and VM hypervisor functions are collocated onto the same physical servers; PVC does not differentiate between "storage" and "compute" nodes, and while storage support can be disabled and an external Ceph cluster used, this is not recommended. The performance of the storage must be taken into account when sizing the nodes as mentioned above.

-Disks must be balanced across all nodes. Therefore, adding 1 disk to 1 node is not sufficient; 1 disk must be added to all nodes at the same time for the available space to increase. Ideally, disk sizes should also be identical across all storage disks, though the weight of each disk can be configured when added to the cluster. Generally speaking, fewer larger disks are preferable to many smaller disks to minimize storage resource utilization, however slightly more storage performance can be gained from using many small disks; the administrator should therefore always aim to choose the biggest disks they can and grow by adding more identical disks as space or performance needs grow.
+Ceph on PVC is laid out similar to the other daemons. The Ceph Monitor and Manager functions are delegated to the Coordinators over the storage network, with all nodes connecting to these hosts to obtain the CRUSH maps and select OSD disks. OSDs are then distributed on all hosts, potentially including non-coordinator hypervisors if desired, and communicate with clients and each other over the storage network.

-PVC Ceph pools make use of the replication mechanism of Ceph to store multiple copies of each object, thus ensuring that data is always available even when a host is unavailable. Only "replica"-based Ceph redundancy is supported by PVC; erasure coded pools are not supported due to major performance impacts related to rewrites and random I/O.
+Disks must be balanced across all storage-containing nodes. For instance, adding 1 disk to 1 node is not sufficient to increase storage space; 1 disk must be added to all storage-containing nodes, based on the configured replication scheme of the various pools (see below), at the same time for the available space to increase. Ideally, disk sizes should also be identical across all storage disks, though the weight of each disk can be configured when added to the cluster. Generally speaking, fewer larger disks are preferable to many smaller disks to minimize storage resource utilization, however slightly more storage performance can be gained from using many small disks, if the other cluster hardware, and specifically CPUs, are performant enough. The administrator should therefore always aim to choose the biggest disks they can and grow by adding more identical disks as space or performance needs grow.

-The default replication level for a new pool is `copies=3, mincopies=2`. This will store 3 copies of each object, with a host-level failure domain, and will allow I/O as long as 2 copies are available. Thus, in a cluster of any size, all data is fully available even if a single host becomes unavailable. It will however use 3x the space for each piece of data stored, which must be considered when sizing the disk space for the cluster: a pool in this configuration, running on 3 nodes each with a single 400GB disk, will effectively have 400GB of total space available for use. As mentioned above, new disks must also be added in groups across nodes equal to the total number of `copies` to ensure new space is usable.
+PVC Ceph pools make use of the replication mechanism of Ceph to store multiple copies of each object, thus ensuring that data is always available even when a host is unavailable. Only "replica"-based Ceph redundancy is supported by PVC; erasure coded pools are not supported due to major performance impacts related to rewrites and random I/O as well as management overhead.

-Non-default values can also be set at pool creation time. For instance, one could create a `copies=3, mincopies=1` pool, which would allow I/O with two hosts down but leaves the cluster susceptible to a write hole should a disk fail in this state. Alternatively, for more resilience, one could create a `copies=4, mincopies=2` pool, which will allow 2 hosts to fail without a write hole, but would consume 4x the space for each piece of data stored and require new disks to be added in groups of 4 instead. Practically any combination of values is possible, however these 3 are the most relevant for most use-cases, and for most, especially small, clusters, the default is sufficient to provide solid redundancy and guard against host failures until the administrator can respond.
+The default replication level for a new pool is `copies=3, mincopies=2`. This will store 3 copies of each object, with a host-level failure domain, and will allow I/O as long as 2 copies are available. Thus, in a cluster of any size, all data is fully available even if a single host becomes unavailable. It will however use 3x the space for each piece of data stored, which must be considered when sizing the disk space for the cluster: a pool in this configuration, running on 3 nodes each with a single 400GB disk, will effectively have 400GB of total space available for use. As mentioned above, new disks must also be added in groups across nodes equal to the total number of `copies` to ensure new space is usable; for instance in a `copies=3` scheme, at least 3 disks must thus be added to different hosts at the same time for the avilable space to grow.

-Replication levels cannot be changed within PVC once a pool is created, however they can be changed via manual Ceph commands on a coordinator should the administrator require this. In any case, the administrator should carefully consider sizing, failure domains, and performance when selecting storage devices to ensure the right level of resiliency versus data usage for their use-case and cluster size.
+Non-default values can also be set at pool creation time. For instance, one could create a `copies=3, mincopies=1` pool, which would allow I/O with two hosts down, but leaves the cluster susceptible to a write hole should a disk fail in this state; this configuration is not recommended in most situations. Alternatively, for additional resilience, one could create a `copies=4, mincopies=2` pool, which would also allow 2 hosts to fail, without a write hole, but would consume 4x the space for each piece of data stored and require new disks to be added in groups of 4 instead. Practically any combination of values is possible, however these 3 are the most relevant for most use-cases, and for most, especially small, clusters, the default is sufficient to provide solid redundancy and guard against host failures until the administrator can respond.
+
+Replication levels cannot be changed within PVC once a pool is created, however they can be changed via manual Ceph commands on a coordinator should the administrator require this, though discussion of this process is outside of the scope of this documentation. The administrator should carefully consider sizing, failure domains, and performance when first selecting storage devices and creating pools, to ensure the right level of resiliency versus data usage for their use-case and planned cluster size.

 ## Physical network considerations

-At a minimum, a production PVC cluster should use at least two 1Gbps Ethernet interfaces, connected in an LACP or active-backup bond on one or more switches. On top of this bond, the various cluster networks are configured as 802.3q vLANs. PVC is be able to support configurations without 802.1q vLAN support using multiple physical interfaces and no bridged client networks, but this is strongly discouraged due to the added complexity this introduces; the switches chosen for the cluster should include these requirements as a minimum.
+At a minimum, a production PVC cluster should use at least two 1Gbps Ethernet interfaces, connected in an LACP or active-backup bond on one or more switches. On top of this bond, the various cluster networks should be configured as 802.3q vLANs. PVC is be able to support configurations without bonding or 802.1q vLAN support, using multiple physical interfaces and no bridged client networks, but this is strongly discouraged due to the added complexity this introduces; the switches chosen for the cluster should include these requirements as a minimum.

-More advanced physical network layouts are also possible. For instance, one could have two isolated networks. On the first network, each node has two 10Gbps Ethernet interfaces, which are combined in a bond across two redundant switch fabrics and that handle the upstream and cluster networks. On the second network, each node has an additional two 10Gbps, which are also combined in a bond across the redundant switch fabrics and handle the storage network. This configuration could support up to 10Gbps of aggregate client traffic while also supporting 10Gbps of aggregate storage traffic. Even more complex network configurations are possible if the cluster requires such performance. See the [Example Configurations](#example-configurations) section for some examples.
+More advanced physical network layouts are also possible. For instance, one could have two isolated networks. On the first network, each node has two 10Gbps Ethernet interfaces, which are combined in a bond across two redundant switch fabrics and that handle the upstream and cluster networks. On the second network, each node has an additional two 10Gbps, which are also combined in a bond across the redundant switch fabrics and handle the storage network. This configuration could support up to 10Gbps of aggregate client traffic while also supporting 10Gbps of aggregate storage traffic. Even more complex network configurations are possible if the cluster requires such performance. See the [Example Configurations](#example-configurations) section for some basic topology examples.

 Only Ethernet networks are supported by PVC. More exotic interconnects such as Infiniband are not supported by default, and must be manually set up with Ethernet (e.g. EoIB) layers on top to be usable with PVC.

-PVC manages the IP addressing of all nodes itself and creates the required addresses during node daemon startup; thus, the on-boot network configuration of each interface should be set to "manual" with no IP addresses configured.
+PVC manages the IP addressing of all nodes itself and creates the required addresses during node daemon startup; thus, the on-boot network configuration of each interface should be set to "manual" with no IP addresses configured. This can be ignored safely, however, and the addresses specified manually in the networking configurations. PVC nodes use a split (`/etc/network/interfaces.d/<iface>`) network configuration model.

 ## Network Layout: Considering the required networks

@ -152,7 +154,7 @@ The floating IP address in the storage network can be used as a single point of

 Nodes in this network are generally assigned IPs automatically based on their node number (e.g. node1 at `.1`, node2 at `.2`, etc.). The network should be large enough to include all nodes sequentially.

-The administrator may choose to collocate the storage network on the same physical interface as the cluster network, or on a separate physical interface. This should be decided based on the size of the cluster and the perceived ratios of client network versus storage traffic. In large (>3 node) or storage-intensive clusters, this network should generally be a separate set of fast physical interfaces, separate from both the upstream and cluster networks, in order to maximize and isolate the storage bandwidth. If the administrator does choose to colocate these networks, they may also share the same IP address, thus eliminating any distinction between the Cluster and Storage networks. The PVC software handles this natively when the Cluster and Storage IPs of a node are identical.
+The administrator may choose to collocate the storage network on the same physical interface as the cluster network, or on a separate physical interface. This should be decided based on the size of the cluster and the perceived ratios of client network versus storage traffic. In large (>3 node) or storage-intensive clusters, this network should generally be a separate set of fast physical interfaces, separate from both the upstream and cluster networks, in order to maximize and isolate the storage bandwidth. If the administrator does choose to collocate these networks, they may also share the same IP address, thus eliminating any distinction between the Cluster and Storage networks. The PVC software handles this natively when the Cluster and Storage IPs of a node are identical.

 ### PVC client networks

@ -162,7 +164,7 @@ The first type of client network is the unmanaged bridged network. These network

 With this client network type, PVC does no management of the network. This is left entirely to the administrator. It requires switch support and the configuration of the vLANs on the switchports of each node's physical interfaces before enabling the network.

-Generally, the same physical network interface will underly both the cluster networks as well as bridged client networks. PVC does however support specifying a separate physical device for bridged client networks, for instance to separate these networks onto a different physical interface from the main cluster networks.
+Generally, the same physical network interface will underlay both the cluster networks as well as bridged client networks. PVC does however support specifying a separate physical device for bridged client networks, for instance to separate these networks onto a different physical interface from the main cluster networks.

 #### VXLAN (managed) Client Networks

--- a/docs/faq.md
+++ b/docs/faq.md
@ -1,49 +0,0 @@
-# Frequently Asked Questions about Parallel Virtual Cluster
-
-## General Questions
-
-### What is it?
-
-PVC is a virtual machine management suite designed around high-availability. It can be considered an alternative to ProxMox, VMWare, Nutanix, and other similar solutions that manage not just the VMs, but the surrounding infrastructure as well.
-
-### Why would you make this?
-
-The full story can be found in the [about page](https://parallelvirtualcluster.readthedocs.io/en/latest/about), but after becoming frustrated by numerous other management tools, I discovered that what I wanted didn't exist as FLOSS software, so I built it myself.
-
-### Is PVC right for me?
-
-PVC might be right for you if your requirements are:
-
-1. You need KVM-based VMs.
-2. You want management of storage and networking (a.k.a. "batteries-included") in the same tool.
-3. You want hypervisor-level redundancy, able to tolerate hypervisor downtime seamlessly, for all elements of the stack.
-
-I built PVC for my homelab first, found a perfect usecase with my employer, and think it might be useful to you too.
-
-### Is 3 hypervisors really the minimum?
-
-For a redundant cluster, yes. PVC requires a majority quorum for several subsystems, and the smallest possible majority quorum is 2/3. That said, you can run PVC on a single node for testing/lab purposes without host-level reundancy, should you wish to do so.
-
-## Feature Questions
-
-### Does PVC support Docker/Kubernetes/LXC/etc.
-
-No. PVC supports only KVM VMs. To run Docker containers, etc., you would need to run a VM which then runs your containers.
-
-### Does PVC have a WebUI?
-
-Not yet. Right now, PVC management is done almost exclusively with an API and the included CLI interface to that API. A WebUI could and likely will be built in the future, but I'm not a frontend developer.
-
-## Storage Questions
-
-### Can I use RAID-5 with PVC?
-
-The short answer is no. The long answer is: Ceph, the storage backend used by PVC, does support "erasure coded" pools which implement a RAID-5-like functionality. PVC does not support this for several reasons. If you use PVC, you must accept at the very least a 2x storage penalty, and for true safety and resiliency a 3x storage penalty, for VM storage. This is a trade-off of the architecture.
-
-### Can I use spinning HDDs with PVC?
-
-You can, but you won't like the results. SSDs are effectively required to obtain any sort of reasonable performance when running multiple VMs. Ideally, datacentre-grade SSDs as well, due to their significantly increased write endurance.
-
-### What Ceph version does PVC use?
-
-PVC requires Ceph 14.x (Nautilus). The official PVC repository includes Ceph 14.2.8. Debian Buster by default includes only 12.x (Luminous).
--- a/docs/getting-started.md
+++ b/docs/getting-started.md
@ -55,19 +55,20 @@ This guide will walk you through setting up a simple 3-node PVC cluster from scr
 0. Perform the initial bootstrap. From the `pvc-ansible` repository directory, execute the following `ansible-playbook` command, replacing `<cluster_name>` with the Ansible group name from the `hosts` file. Make special note of the additional `bootstrap=yes` variable, which tells the playbook that this is an initial bootstrap run.  
    `$ ansible-playbook -v -i hosts pvc.yml -l <cluster_name> -e bootstrap=yes`

+    **WARNING:** Never rerun this playbook with the `-e bootstrap=yes` option against an active cluster. This will have unintended, disastrous consequences.
+
 0. Wait for the Ansible playbook run to finish. Once completed, the cluster bootstrap will be finished, and all 3 nodes will have rebooted into a working PVC cluster.

-0. Install the CLI client on your administrative host, and verify connectivity to the cluster, for instance by running the following command, which should show all 3 nodes as present and running:  
-    `$ pvc -z pvchv1:2181,pvchv2:2181,pvchv3:2181 node list`
+0. Install the CLI client on your administrative host, and add and verify connectivity to the cluster; this will also verify that the API is working. You will need to know the cluster upstream floating IP address here, and if you configured SSL or authentication for the API in your `group_vars`, adjust the first command as needed (see `pvc cluster add -h` for details).  
+    `$ pvc cluster add -a <upstream_floating_ip> mycluster`  
+    `$ pvc -c mycluster node list`

-0. Optionally, verify the API is listening on the `upstream_floating_ip` address configured in the cluster `group_vars`, for instance by running the following command which shows, in JSON format, the same information as in the previous step:  
-    `$ curl -X GET http://<upstream_floating_ip>:7370/api/v1`
+    We can also set a default cluster by exporting the `PVC_CLUSTER` environment variable to avoid requiring `-c cluster` with every subsequent command:  
+    `$ export PVC_CLUSTER="mycluster"`

 ### Part Four - Configuring the Ceph storage cluster

-All steps in this and following sections can be performed using either the CLI client or the HTTP API; for clarity, only the CLI commands are shown.
-
-0. Determine the Ceph OSD block devices on each host, via an `ssh` shell. For instance, check `/dev/disk/by-path` to show the block devices by their physical SAS/SATA bus location, and obtain the relevant `/dev/sdX` name for each disk you wish to be a Ceph OSD on each host.
+0. Determine the Ceph OSD block devices on each host, via an `ssh` shell. For instance, use `lsblk` or check `/dev/disk/by-path` to show the block devices by their physical SAS/SATA bus location, and obtain the relevant `/dev/sdX` name for each disk you wish to be a Ceph OSD on each host.

 0. Add each OSD device to each host. The general command is:  
    `$ pvc storage osd add --weight <weight> <node> <device>`
@ -80,9 +81,11 @@ All steps in this and following sections can be performed using either the CLI c
    `$ pvc storage osd add --weight 1.0 pvchv3 /dev/sdb`  
    `$ pvc storage osd add --weight 1.0 pvchv3 /dev/sdc`   

-    **NOTE:** On the CLI, the `--weight` argument is optional, and defaults to `1.0`. In the API, it must be specified explicitly. OSD weights determine the relative amount of data which can fit onto each OSD. Under normal circumstances, you would want all OSDs to be of identical size, and hence all should have the same weight. If your OSDs are instead different sizes, the weight should be proportional to the size, e.g. `1.0` for a 100GB disk, `2.0` for a 200GB disk, etc. For more details, see the Ceph documentation.
+    **NOTE:** On the CLI, the `--weight` argument is optional, and defaults to `1.0`. In the API, it must be specified explicitly, but the CLI sets a default value. OSD weights determine the relative amount of data which can fit onto each OSD. Under normal circumstances, you would want all OSDs to be of identical size, and hence all should have the same weight. If your OSDs are instead different sizes, the weight should be proportional to the size, e.g. `1.0` for a 100GB disk, `2.0` for a 200GB disk, etc. For more details, see the Ceph documentation.

-    **NOTE:** OSD commands wait for the action to complete on the node, and can take some time (up to 30s normally). Be cautious of HTTP timeouts when using the API to perform these steps.
+    **NOTE:** OSD commands wait for the action to complete on the node, and can take some time.
+
+    **NOTE:** You can add OSDs in any order you wish, for instance you can add the first OSD to each node and then add the second to each node, or you can add all nodes' OSDs together at once like the example. This ordering does not affect the cluster in any way.

 0. Verify that the OSDs were added and are functional (`up` and `in`):  
    `$ pvc storage osd list`
@ -93,19 +96,18 @@ All steps in this and following sections can be performed using either the CLI c
    For example, to create a pool named `vms` with 256 placement groups (a good default with 6 OSD disks), run the command as follows:  
    `$ pvc storage pool add vms 256`

-    **NOTE:** Ceph placement groups are a complex topic; as a general rule it's easier to grow than shrink, so start small and grow as your cluster grows. For more details see the Ceph documentation and the [placement group calculator](https://ceph.com/pgcalc/).
+    **NOTE:** Ceph placement groups are a complex topic; as a general rule it's easier to grow than shrink, so start small and grow as your cluster grows. The general formula is to calculate the ideal number of PGs is `pgs * maxcopies / osds = ~250`, then round `pgs` down to the closest power of 2; generally, you want as close to 250 PGs per OSD as possible, but no more than 250. With 3-6 OSDs, 256 is a good number, and with 9+ OSDs, 512 is a good number. Ceph will error if the total number exceeds the limit. For more details see the Ceph documentation and the [placement group calculator](https://ceph.com/pgcalc/).

-    **NOTE:** All PVC RBD pools use `copies=3` and `mincopies=2` for data storage. This provides, for each object, 3 copies of the data, with writes being accepted with 1 degraded copy. This provides maximum resiliency against single-node outages, but will use 3x the amount of storage for each unit stored inside the image. Take this into account when sizing OSD disks and VM images. This cannot be changed as any less storage will result in a non-HA cluster that could not handle a single node failure.
+    **NOTE:** As detailed in the [cluster architecture documentation](/cluster-architecture), you can also set a custom replica configuration for each pool if the default of 3 replica copies with 2 minimum copies is not acceptable. See `pvc storage pool add -h` or that document for full details.

 0. Verify that the pool was added:  
    `$ pvc storage pool list`

 ### Part Five - Creating virtual networks

-0. Determine a domain name, IPv4, and/or IPv6 network for your first client network, and any other client networks you may wish to create. For this guide we will create a single "managed" virtual client network with DHCP.
+0. Determine a domain name and IPv4, and/or IPv6 network for your first client network, and any other client networks you may wish to create. These networks should never overlap with the cluster networks. For full details on the client network types, see the [cluster architecture documentation](/cluster-architecture).

-0. Create the virtual network. The general command for an IPv4-only network with DHCP is:  
-    `$ pvc network add <vni_id> --type <type> --description <space-less_description> --domain <domain> --ipnet <ipv4_network_in_CIDR> --gateway <ipv4_gateway_address> --dhcp --dhcp-start <first_address> --dhcp-end <last_address>`
+0. Create the virtual network. There are many options here, so see `pvc network add -h` for details.  

    For example, to create the managed (EVPN VXLAN) network `100` with subnet `10.100.0.0/24`,  gateway `.1` and DHCP from `.100` to `.199`, run the command as follows:  
    `$ pvc network add 100 --type managed --description my-managed-network --domain myhosts.local --ipnet 10.100.0.0/24 --gateway 10.100.0.1 --dhcp --dhcp-start 10.100.0.100 --dhcp-end 10.100.0.199`
@ -113,24 +115,27 @@ All steps in this and following sections can be performed using either the CLI c
    For another example, to create the static bridged (switch-configured, tagged VLAN, with no PVC management of IPs) network `200`, run the command as follows:  
    `$ pvc network add 200 --type bridged --description my-bridged-network`

+    **NOTE:** Network descriptions cannot contain spaces or special characters; keep them short, sweet, and dash or underscore delimited.
+
 0. Verify that the network(s) were added:  
    `$ pvc network list`

 0. On the upstream router, configure one of:

-    a) A BGP neighbour relationship with the `upstream_floating_address` to automatically learn routes.
+    a) A BGP neighbour relationship with the cluster upstream floating address to automatically learn routes.

-    b) Static routes for the configured client IP networks towards the `upstream_floating_address`.
+    b) Static routes for the configured client IP networks towards the cluster upstream floating address.

 0. On the upstream router, if required, configure NAT for the configured client IP networks.

 0. Verify the client networks are reachable by pinging the managed gateway from outside the cluster.

-0. Set all 3 nodes to `ready` state, allowing them to run virtual machines. The general command is:  
-    `$ pvc node ready <node>`

 ### You're Done!

+0. Set all 3 nodes to `ready` state, allowing them to run virtual machines. The general command is:  
+    `$ pvc node ready <node>`
+
 Congratulations, you now have a basic PVC storage cluster, ready to run your VMs.

 For next steps, see the [Provisioner manual](/manuals/provisioner) for details on how to use the PVC provisioner to create new Virtual Machines, as well as the [CLI manual](/manuals/cli) and [API manual](/manuals/api) for details on day-to-day usage of PVC.
--- a/docs/index.md
+++ b/docs/index.md
@ -5,7 +5,6 @@
 <br/><br/>
 <a href="https://github.com/parallelvirtualcluster/pvc"><img alt="License" src="https://img.shields.io/github/license/parallelvirtualcluster/pvc"/></a>
 <a href="https://github.com/parallelvirtualcluster/pvc/releases"><img alt="Release" src="https://img.shields.io/github/release-pre/parallelvirtualcluster/pvc"/></a>
-<a href="https://git.bonifacelabs.ca/parallelvirtualcluster/pvc/pipelines"><img alt="Pipeline Status" src="https://git.bonifacelabs.ca/parallelvirtualcluster/pvc/badges/master/pipeline.svg"/></a>
 <a href="https://parallelvirtualcluster.readthedocs.io/en/latest/?badge=latest"><img alt="Documentation Status" src="https://readthedocs.org/projects/parallelvirtualcluster/badge/?version=latest"/></a>
 </p>

@ -15,10 +14,50 @@ The major goal of PVC is to be administrator friendly, providing the power of En

 ## Getting Started

-To get started with PVC, read the [Cluster Architecture document](https://parallelvirtualcluster.readthedocs.io/en/latest/architecture/cluster/) and [Frequently Asked Questions](https://parallelvirtualcluster.readthedocs.io/en/latest/faq/), then see [Installing](https://parallelvirtualcluster.readthedocs.io/en/latest/installing) for details on setting up a set of PVC nodes, using the [PVC Ansible](https://parallelvirtualcluster.readthedocs.io/en/latest/manuals/ansible) framework to configure and bootstrap a cluster, and managing it with the [`pvc` CLI tool](https://parallelvirtualcluster.readthedocs.io/en/latest/manuals/cli) or [RESTful HTTP API](https://parallelvirtualcluster.readthedocs.io/en/latest/manuals/api). For details on the project, its motivation, and architectural details, see [the About page](https://parallelvirtualcluster.readthedocs.io/en/latest/about).
+To get started with PVC, please see the [About](https://parallelvirtualcluster.readthedocs.io/en/latest/about/) page for general information about the project, and the [Getting Started](https://parallelvirtualcluster.readthedocs.io/en/latest/getting-started/) page for details on configuring your cluster.

 ## Changelog

+#### v0.9.10
+
+  * Moves OSD stats uploading to primary, eliminating reporting failures while hosts are down
+  * Documentation updates
+  * Significantly improves RBD locking behaviour in several situations, eliminating cold-cluster start issues and failed VM boot-ups after crashes
+  * Fixes some timeout delays with fencing
+  * Fixes bug in validating YAML provisioner userdata
+
+#### v0.9.9
+
+  * Adds documentation updates
+  * Removes single-element list stripping and fixes surrounding bugs
+  * Adds additional fields to some API endpoints for ease of parsing by clients
+  * Fixes bugs with network configuration
+
+#### v0.9.8
+
+  * Adds support for cluster backup/restore
+  * Moves location of `init` command in CLI to make room for the above
+  * Cleans up some invalid help messages from the API
+
+#### v0.9.7
+
+  * Fixes bug with provisioner system template modifications
+
+#### v0.9.6
+
+  * Fixes bug with migrations
+
+#### v0.9.5
+
+  * Fixes bug with line count in log follow
+  * Fixes bug with disk stat output being None
+  * Adds short pretty health output
+  * Documentation updates
+
+#### v0.9.4
+
+  * Fixes major bug in OVA parser
+
 #### v0.9.3

  * Fixes bugs with image & OVA upload parsing
--- a/docs/manuals/ansible.md
+++ b/docs/manuals/ansible.md
@ -36,7 +36,7 @@ The PVC role configures all the dependencies of PVC, including storage, networki

 * Install and configure FRRouting.

-* Install and configure the main PVC daemon and API client, including initializing the PVC cluster (`pvc init`).
+* Install and configure the main PVC daemon and API client, including initializing the PVC cluster (`pvc task init`).

 ## Completion

--- a/docs/manuals/provisioner.md
+++ b/docs/manuals/provisioner.md
@ -1,72 +1,89 @@
-# PVC Provisioner manual
+# PVC Provisioner Manual

-The PVC provisioner is a subsection of the main PVC API. IT interfaces directly with the Zookeeper database using the common client functions, and with the Patroni PostgreSQL database to store details. The provisioner also interfaces directly with the Ceph storage cluster, for mapping volumes, creating filesystems, and installing guests.
+The PVC provisioner is a subsection of the main PVC API. It interfaces directly with the Zookeeper database using the common client functions, and with the Patroni PostgreSQL database to store details. The provisioner also interfaces directly with the Ceph storage cluster, for mapping volumes, creating filesystems, and installing guests.

 Details of the Provisioner API interface can be found in [the API manual](/manuals/api).

+- [PVC Provisioner Manual](#pvc-provisioner-manual)
+  * [Overview](#overview)
+  * [PVC Provisioner concepts](#pvc-provisioner-concepts)
+    + [Templates](#templates)
+    + [Userdata](#cloud-init-userdata)
+    + [Scripts](#provisioning-scripts)
+    + [Profiles](#profiles)
+  * [Deploying VMs from provisioner scripts](#deploying-vms-from-provisioner-scripts)
+  * [Deploying VMs from OVA images](#deploying-vms-from-ova-images)
+    + [Uploading an OVA](#uploading-an-ova)
+    + [OVA limitations](#ova-limitations)
+
 ## Overview

 The purpose of the Provisioner API is to provide a convenient way for administrators to automate the creation of new virtual machines on the PVC cluster.

-The Provisioner allows the administrator to constuct descriptions of VMs, called profiles, which include system resource specifications, network interfaces, disks, cloud-init userdata, and installation scripts. These profiles are highly modular, allowing the administrator to specify arbitrary combinations of the mentioned VM features with which to build new VMs.
+The Provisioner allows the administrator to construct descriptions of VMs, called profiles, which include system resource specifications, network interfaces, disks, cloud-init userdata, and installation scripts. These profiles are highly modular, allowing the administrator to specify arbitrary combinations of the mentioned VM features with which to build new VMs.

 The provisioner supports creating VMs based off of installation scripts, by cloning existing volumes, and by uploading OVA image templates to the cluster.

 Examples in the following sections use the CLI exclusively for demonstration purposes. For details of the underlying API calls, please see the [API interface reference](/manuals/api-reference.html).

-# Deploying VMs from OVA images
+Use of the PVC Provisioner is not required. Administrators can always perform their own installation tasks, and the provisioner is not specially integrated, calling various other API commands as though they were run from the CLI or API.

-PVC supports deploying virtual machines from industry-standard OVA images. OVA images can be uploaded to the cluster with the `pvc provisioner ova` commands, and deployed via the created profile(s) using the `pvc provisioner create` command. Additionally, the profile(s) can be modified to suite your specific needs via the provisioner template system detailed below.
+# PVC Provisioner concepts

-# Deploying VMs from provisioner scripts
-
-PVC supports deploying virtual machines using administrator-provided scripts, using templates, profiles, and Cloud-init userdata to control the deployment process as desired. This deployment method permits the administrator to deploy POSIX-like systems such as Linux or BSD directly from a companion tool such as `debootstrap` on-demand and with maximum flexibility.
+Before explaining how to create VMs using either OVA images or installer scripts, we must discuss the concepts used to construct the PVC provisioner system.

 ## Templates

-The PVC Provisioner features three categories of templates to specify the resources allocated to the virtual machine. They are: System Templates, Network Templates, and Disk Templates.
+Templates are the building blocks of VMs. Each template type specifies part of the configuration of a VM, and when combined together later into profiles, provide a full description of the VM resources.
+
+Templates are used to provide flexibility for the administrator. For instance, one could specify some standard core resources for different VMs, but then specify a different set of storage devices and networks for each one. This flexibility is at the heart of this system, allowing the administrator to construct a complex set of VM configurations from a few basic templates.
+
+The PVC Provisioner features three types of templates: System Templates, Network Templates, and Disk Templates.

 ### System Templates

-System templates specify the basic resources of the virtual machine: vCPUs, memory, and configuration metadata (e.g. serial/VNC consoles, migration methods, node limits, etc.). Each profile requires a single system template.
+System templates specify the basic resources of the virtual machine: vCPUs, memory, serial/VNC consoles, and PVC configuration metadata (migration methods, node limits, etc.). Each profile requires a single system template.

-The simplest templates will specify a number of vCPUs and the amount of vRAM; additional details can be specified if required.
+The simplest valid template will specify a number of vCPUs and an amount of vRAM; additional details are optional and can be specified if required.

-Serial consoles permit the use of the `pvc vm log` functionality via console logfiles in `/var/log/libvirt`.
+Serial consoles are required to make use of the `pvc vm log` functionality, via console logfiles in `/var/log/libvirt` on the nodes. VMs without a serial console show an empty log. Note that the guest operating system must also be configured to provide output to this serial console for this functionality to work as expected.

 VNC consoles permit graphical access to the VM. By default, the VNC interface listens only on 127.0.0.1 on its parent node; the VNC bind configuration can override this to listen on other interfaces, including `0.0.0.0` for all.

+PVC does not currently support SPICE or any other non-VNC consoles.
+
+#### Examples
+
 ```
 $ pvc provisioner template system list
 Using cluster "local" - Host: "10.0.0.1:7370"  Scheme: "http"  Prefix: "/api/v1"

 System templates:

-Name        ID  vCPUs  vRAM [MB]  Consoles: Serial  VNC    VNC bind   Metadata: Limit     Selector    Autostart  
-ext-lg      80  4      8192                 False   False  None                 None      None        False
-ext-lg-ser  81  4      8192                 True    False  None                 None      None        False
-ext-lg-vnc  82  4      8192                 False   True   0.0.0.0              None      None        False
-ext-sm-lim  83  1      1024                 True    False  None                 pvchv1    mem         True
+Name        ID  vCPUs  vRAM [MB]  Consoles: Serial  VNC    VNC bind   Metadata: Limit           Selector    Autostart  Migration
+ext-lg      80  4      8192                 False   False  None                 None            None        False      None
+ext-lg-ser  81  4      8192                 True    False  None                 None            None        False      None
+ext-lg-vnc  82  4      8192                 False   True   0.0.0.0              None            None        False      None
+ext-sm-lim  83  1      1024                 True    False  None                 pvchv1,pvchv2   mem         True       live
 ```

+* The first example specifies a template with 4 vCPUs and 8GB of RAM. It has no serial or VNC consoles, and no non-default metadata, forming the most basic possible system template.
+
+* The second example specifies a template with the same vCPU and RAM quantities as the first, but with a serial console as well. VMs using this template will be able to make use of `pvc vm log` as long as their guest operating system is configured to use it.
+
+* The third example specifies a template with an alternate console to the second, in this case a VNC console bound to `0.0.0.0` (all interfaces). VNC ports are always auto-selected due to the dynamic nature of PVC, and the administrator can connect to them once the VM is running by determining the port on the hosting hypervisor (e.g. with `netstat -tl`).
+
+* The fourth example shows the ability to set PVC cluster metadata in a system template. VMs with this template will be forcibly limited to running on the hypervisors `pvchv1` and `pvchv2`, but no others, will explicitly use the `mem` (free memory) selector when choosing migration or deployment targets, will be set to automatically start on reboot of its hypervisor, and will be limited to live migration between nodes. For full details on what these options mean, see `pvc vm meta -h`.
+
 ### Network Templates

-Network template specify which PVC networks the virtual machine is bound to, as well as the method used to calculate MAC addresses for VM interfaces. Networks are specified by their VNI ID within PVC.
+Network template specify which PVC networks the virtual machine will be bound to, as well as the method used to calculate MAC addresses for VM interfaces. Networks are specified by their VNI ID within PVC.

-A template requires at least one network VNI to be valid.
+A network template requires at least one network VNI to be valid, and is created in two stages. First, `pvc provisioner template network add` adds the template itself, along with the optional MAC template. Second, `pvc provisioner template network vni add` adds a VNI into the network template. VNIs are always shown and created in the order added; to move networks around they must be removed then re-added in the proper order; this will not affect existing VMs provisioned with the template.

-```
-$ pvc provisioner template network list
-Using cluster "local" - Host: "10.0.0.1:7370"  Scheme: "http"  Prefix: "/api/v1"
+In some cases, it may be useful for the administrator to specify a static MAC address pattern for a set of VMs, for instance if they must get consistent DHCP reservations between rebuilds. Such a MAC address template can be specified when adding a new network template, using a standardized layout and set of interpolated variables. This is an optional feature; if no MAC template is specified, VMs will be configured with random MAC addresses for each interface at deploy time.

-Network templates:
-
-Name      ID  MAC template  Network VNIs
-ext-101   80  None          101
-ext-11X   81  None          110,1101
-```
-
-In some cases, it may be useful for the administrator to specify a static MAC address pattern for a set of VMs, for instance if they must get consistent DHCP reservations between rebuilds. Such a MAC address template can be specified when adding a new network template, using a standardized layout and set of interpolated variables. For example:
+#### Examples

 ```
 $ pvc provisioner template network list
@ -75,49 +92,67 @@ Using cluster "local" - Host: "10.0.0.1:7370"  Scheme: "http"  Prefix: "/api/v1"
 Network templates:

 Name       ID  MAC template                  Network VNIs
-fixed-mac  82  {prefix}:XX:XX:{vmid}{netid}  1000,1001
+ext-101    80  None                          101
+ext-11X    81  None                          110,1101
+fixed-mac  82  {prefix}:ff:ff:{vmid}{netid}  1000,1001,1002
 ```

-The {prefix} variable is replaced by the provisioner with a standard prefix ("52:54:01"), which is different from the randomly-generated MAC prefix ("52:54:00") to avoid accidental overlap of MAC addresses.
+* The first example shows a simple single-VNI network with no MAC template.

-The {vmid} variable is replaced by a single hexidecimal digit representing the VM's ID, the numerical suffix portion of its name; VMs without a suffix numeral have ID 0. VMs with IDs greater than 15 (hexidecimal "f") will wrap back to 0.
+* The second example shows a dual-VNI network with no MAC template. Note the ordering; as mentioned, the first VNI will be provisioned on `eth0`, the second VNI on `eth1`, etc.

-The {netid} variable is replaced by the sequential identifier, starting at 0, of the network VNI of the interface; for example, the first interface is 0, the second is 1, etc. Like te VM ID, network IDs greater than 15 (hexidecimal "f") will wrap back to 0.
+* The third example shows a triple-VNI network with a MAC template. The variable names shown are literal, while the `f` values are user-configurable and must be set to valid hexadecimal values by the administrator to uniquely identify the MAC address (in this case, using `ff:ff` for that segment). The variables are interpolated at deploy time as follows:

-The four X digits are use-configurable. Use these digits to uniquely define the MAC address.
+    * The `{prefix}` variable is replaced by the provisioner with a standard prefix (`52:54:01`), which is different from the randomly-generated MAC prefix (`52:54:00`) to avoid accidental overlap of MAC addresses. These OUI prefixes are not assigned to any vendor by the IEEE and thus should not conflict with any (real, standards-compliant) devices on the network.

-The location of the two per-VM variables can be adjusted at the administrator's discretion, or removed if not required (e.g. a single-network template, or template for a single VM). In such situations, be careful to avoid accidental overlap with other templates' variable portions.
+    * The `{vmid}` variable is replaced by a single hexadecimal digit representing the VM's ID, the numerical suffix portion of its name (e.g. `myvm2` will have ID 2); VMs without a suffix numeral in their names have ID 0. VMs with IDs greater than 15 (hexadecimal `f`) will wrap back to 0, so a single MAC template should never be used by more than 16 VMs (numbered 0-15).
+
+    * The `{netid}` variable is replaced by a single hexadecimal digit representing the sequential identifier, starting at 0, of the interface within the template (i.e. the first interface is 0, the second is 1, etc.). Like the VM ID, network IDs greater than 15 (hexadecimal `f`) will wrap back to 0, so a single VM should never have more than 16 interfaces.
+
+    * The location of the two per-VM variables can be adjusted at the administrator's discretion, or removed if not required (e.g. a single-network template, or template for a single VM). In such situations, be careful to avoid accidental overlap with other templates' variable portions.

 ### Disk Templates

-Disk templates specify the disk layout, including filesystem and mountpoint for scripted deployments, for the VM. Disks are specified by their virtual disk ID in Libvirt, and sizes are always specified in GB. Disks may also reference other storage volumes, which will then be cloned during provisioning.
+Disk templates specify the disk layout, including filesystem and mountpoint for scripted deployments, for the VM. Disks are specified by their virtual disk ID in Libvirt, in either `sdX` or `vdX` format, and sizes are always specified in GB. Disks may also reference other storage volumes, which will then be cloned during provisioning.

 For additional flexibility, the volume filesystem and mountpoint are optional; such volumes will be created and attached to the VM but will not be modified during provisioning.

+All storage volumes created by the provisioner at deploy time, regardless of source or type, will be named in the format `<vmname>_<id>`, for instance `myvm_sda`.
+
+#### Examples
+
 ```
 $ pvc provisioner template storage list
 Using cluster "local" - Host: "10.0.0.1:7370"  Scheme: "http"  Prefix: "/api/v1"

 Storage templates:

-Name           ID  Disk ID  Pool     Source Volume  Size [GB]  Filesystem  Arguments  Mountpoint
+Name           ID  Disk ID  Pool  Source Volume  Size [GB]  Filesystem  Arguments  Mountpoint
 standard-ext4  21
-                   sda      blsevm   None           2          ext4        -L=root    /
-                   sdb      blsevm   None           4          ext4        -L=var     /var
-                   sdc      blsevm   None           4          ext4        -L=log     /var/log
+                   sda      vms   None           2          ext4        -L=root    /
+                   sdb      vms   None           4          ext4        -L=var     /var
+                   sdc      vms   None           4          ext4        -L=log     /var/log
 large-cloned   22
-                   sda      blsevm   template_sda   None       None        None       None
-                   sdb      blsevm   None           40         None        None       None
+                   sda      vms   template_sda   None       None        None       None
+                   sdb      vms   None           40         None        None       None
 ```

+* The first example shows a volume with a simple 3-disk layout suitable for most Linux distributions. Each volume is in pool `vms`, with an `ext4` filesystem, an argument specifying a disk label, and a mountpoint to which the volume will be mounted when deploying the VM. All 3 volumes will be created at deploy time. When deploying VMs using Scripts detailed below, this is the normal format that storage templates should take to ensure that all block devices are formatted and mounted in the proper place for the script to take over and install the operating system to them.
+
+* The second example shows both a cloned volume and a blank volume. At deploy time, the Source Volume for the `sda` device will be cloned and attached to the VM at `sda`. The second volume will be created at deploy time, but will not be formatted or mounted, and will thus show as an empty block device inside the VM. This type of storage template is more suited to devices that do not use the Script install method, and are instead cloned from a source volume, either another running VM, or a manually-uploaded disk image.
+
+* Unformatted block devices as shown in the second example can be used in any type of storage template, though care should be taken to consider their purpose; unformatted block devices are completely ignored by the Script at deploy time.
+
 ## Cloud-Init Userdata

-PVC allows the sending of arbitrary cloud-init userdata to VMs on bootup. It uses an Amazon AWS EC2-style metadata service to delivery basic VM information and this userdata to the VMs, based dynamically on the assigned profile of the VM at boot time.
+PVC allows the sending of arbitrary cloud-init userdata to VMs on boot-up. It uses an Amazon AWS EC2-style metadata service, listening at the link-local IP `169.254.169.254` on port `80`, to delivery basic VM information and this userdata to the VMs. The metadata to be sent is based dynamically on the assigned profile of the VM at boot time.

-Both single-function and multipart cloud-init userdata is supported. Examples can be found at `/usr/share/pvc/provisioner/examples` on a PVC node.
+Both single-function and multipart cloud-init userdata is supported. Full examples can be found under `/usr/share/pvc/provisioner/examples` on any PVC coordinator node.

 The default userdata document "empty" can be used to skip userdata for a profile.

+#### Examples
+
 ```
 $ pvc provisioner userdata list
 Using cluster "local" - Host: "10.0.0.1:7370"  Scheme: "http"  Prefix: "/api/v1"
@ -131,15 +166,53 @@ basic-ssh   11  Content-Type: text/cloud-config; charset="us-ascii"
                [...]
 ```

+* The first example is the default, always-present `empty` document, which is sent to invalid VMs if requested, or can be configured explicitly for profiles that do not require cloud-init userdata, instead of leaving that section of the profile as `None`.
+
+* The second, truncated, example is the start of a normal single-function userdata document. For full details on the contents of these documents, see the cloud-init documentation.
+
 ## Provisioning Scripts

-The PVC provisioner provides a scripting framework in order to automate VM installation. This is generally the most useful with UNIX-like systems which can be installed over the network via shell scripts. For instance, the script might install a Debian VM using `debootstrap`.
+The PVC provisioner provides a scripting framework in order to automate VM installation. This is generally the most useful with UNIX-like systems which can be installed over the network via shell scripts. For instance, the script might install a Debian VM using `debootstrap` or a Red Hat VM using `rpmstrap`. The PVC Ansible system will automatically install `debootstrap` on coordinator nodes, to allow out-of-the-box deployment of Debian-based VMs with `debootstrap` and the example script shipped with PVC (see below); any other deployment tool must be installed separately onto all PVC coordinator nodes, or installed by the script itself (e.g. using `os.system('apt-get install ...')`, `requests` to download a script, etc.).

-Provisioner scripts are written in Python 3 and are called in a standardized way during the provisioning sequence. A single function called `install` is called during the provisioning sequence to perform OS installation and basic configuration.
+Provisioner scripts are written in Python 3 and are called in a standardized way during the provisioning sequence. A single function called `install` is called during the provisioning sequence to perform arbitrary tasks. At execution time, the script is passed several default keyword arguments detailed below, and can also be passed arbitrary arguments defined either in the provisioner profile, or on the `provisioner create` CLI.

-*A WARNING*: It's important to remember that these provisioning scripts will run with the same privileges as the provisioner API daemon (usually root) on the system running the daemon. THIS MAY POSE A SECURITY RISK. However, the intent is that administrators of the cluster are the only ones allowed to specify these scripts, and that they check them thoroughly when adding them to the system as well as limit access to the provisioning API to trusted sources. If neither of these conditions are possible, for instance if arbitrary users must specify custom scripts without administrator oversight, then the PVC provisoner may not be ideal.
+A full example script to perform a `debootstrap` Debian installation can be found under `/usr/share/pvc/provisioner/examples` on any PVC coordinator node.

-The default script "empty" can be used to skip scripted installation for a profile. Additionally, profiles with no valid disk mountpoints skip scripted installation.
+The default script "empty" can be used to skip scripted installation for a profile. Additionally, profiles with no disk mountpoints (and specifically, no root `/` mountpoint) will skip scripted installation.
+
+**WARNING**: It is important to remember that these provisioning scripts will run with the same privileges as the provisioner API daemon (usually root) on the system running the daemon. THIS MAY POSE A SECURITY RISK. However, the intent is that administrators of the cluster are the only ones allowed to specify these scripts, and that they check them thoroughly when adding them to the system, as well as limit access to the provisioning API to trusted sources. If neither of these conditions are possible, for instance if arbitrary users must specify custom scripts without administrator oversight, then the PVC provisioner script system may not be ideal.
+
+**NOTE**: It is often required to perform a `chroot` to perform some aspects of the install process. The PVC script fully supports this, though it is relatively complicated. The example script details how to achieve this.
+
+#### The `install` function
+
+The `install` function is the main entrypoint for a provisioning script, and is the only part of the script that is explicitly called. The provisioner calls this function after setting up the temporary install directory and mounting the volumes. Thus, this script can then perform any sort of tasks required in the VM to install it, and then finishes, after which the main provisioner resumes control to unmount the volumes and finish the VM creation.
+
+It is good practice in these scripts to "fail through", since terminating the script abruptly would affect the entire provisioning flow and thus may leave the half-provisioned VM in an undefined state. Care should be taken to `try`/`catch` possible errors, and attempt to finish the script execution (or `return`) even if some aspect fails.
+
+This function is passed a number of keyword arguments that it can then use during installation. These include those specified by the administrator in the profile, on the CLI at deploy time, as well as a number of default arguments:
+
+##### `vm_name`
+
+The `vm_name` keyword argument contains the name of the new VM from PVC's perspective.
+
+##### `vm_id`
+
+The `vm_id` keyword argument contains the VM identifier (the last numeral of the VM name, or `0` for a VM that does not end in a numeral).
+
+##### `temporary_directory`
+
+The `temporary_directory` keyword argument contains the path to the temporary directory on which the new VM's disks are mounted. The function *must* perform any installation steps to/under this directory.
+
+##### `disks`
+
+The `disks` keyword argument contains a Python list of the configured disks, as dictionaries of values as specified in the Disk template. The function may use these values as appropriate, for instance to specify an `/etc/fstab`.
+
+##### `networks`
+
+The `networks` keyword argument contains a Python list of the configured networks, as dictionaries of values as specified in the Network template. The function may use these values as appropriate, for instance to write an `/etc/network/interfaces` file.
+
+#### Examples

 ```
 $ pvc provisioner script list
@ -149,43 +222,23 @@ Name         ID  Script
 empty        1
 debootstrap  2   #!/usr/bin/env python3

-                 # debootstrap_script.py - PVC Provisioner example script for Debootstrap
-                 # Part of the Parallel Virtual Cluster (PVC) system
+                 def install(**kwargs):
+                     vm_name = kwargs['vm_name']
                 [...]
 ```

-### `install` function
+* The first example is the default, always-present `empty` document, which is used if the VM does not specify a valid root mountpoint, or can be configured explicitly for profiles that do not require scripts, instead of leaving that section of the profile as `None`.

-The `install` function is the main entrypoing for a provisioning script, and is the only part of the script that is explicitly called. The provisioner calls this function after setting up the temporary install directory and mounting the volumes. Thus, this script can then perform any sort of tasks required in the VM to install it, and then finishes.
-
-This function is passed a number of keyword arguments that it can then use during installation. These include those specified by the administrator in the profile, as well as a number of default arguments:
-
-###### `vm_name`
-
-The `vm_name` keyword argument contains the full name of the new VM from PVC's perspective.
-
-###### `vm_id`
-
-The `vm_id` keyword argument contains the VM identifier (the last numeral of the VM name, or `0` for a VM that does not end in a numeral).
-
-###### `temporary_directory`
-
-The `temporary_directory` keyword argument contains the path to the temporary directory on which the new VM's disks are mounted. The function *must* perform any installation steps to/under this directory.
-
-###### `disks`
-
-The `disks` keyword argument contains a Python list of the configured disks, as dictionaries of values as specified in the Disk template. The function *may* use these values as appropriate, for instance to specify an `/etc/fstab`.
-
-###### `networks`
-
-The `networks` keyword argument contains a Python list of the configured networks, as dictionaries of values as specified in the Network template. The function *may* use these values as appropriate, for instance to write an `/etc/network/interfaces` file.
+* The second, truncated, example is the start of a normal Python install script. The full example is provided in the folder mentioned above on all PVC coordinator nodes.

 ## Profiles

-Provisioner profiles combine the templates, userdata, and scripts together into dynamic configurations which are then applied to the VM when provisioned. The VM retains the record of this profile name in its configuration for the full lifetime of the VM on the cluster, most generally for cloud-init functionality.
+Provisioner profiles combine the templates, userdata, and scripts together into dynamic configurations which are then applied to the VM when provisioned. The VM retains the record of this profile name in its configuration for the full lifetime of the VM on the cluster; this is primarily used for cloud-init functionality, but may also serve as a convenient administrator reference.

 Additional arguments to the installation script can be specified along with the profile, to allow further customization of the installation if required.

+#### Examples
+
 ```
 $ pvc provisioner profile list
 Using cluster "local" - Host: "10.0.0.1:7370"  Scheme: "http"  Prefix: "/api/v1"
@ -194,9 +247,9 @@ Name        ID  Templates: System      Network  Storage        Data: Userdata
 std-large   41             ext-lg-ser  ext-101  standard-ext4        basic-ssh  debootstrap  deb_release=buster
 ```

-## Creating VMs with the Provisioner
+# Deploying VMs from provisioner scripts

-Creating VMs with the provisioner requires specifying a VM name and a profile to use.
+Once a profile with a Script value is defined, creating VMs with the provisioner is as simple as specifying a VM name and a profile to use.

 ```
 $ pvc provisioner create test1 std-large
@ -228,7 +281,7 @@ af1d0682-53e8-4141-982f-f672e2f23261  active   celery@pvchv1      test1  std-lar
 43d57a2d-8d0d-42f6-90df-cc39956825a9  pending  celery@pvchv1      testX  std-large  False    False
 ```

-Additionally, the `--wait` option can be given to the create command. This will cause the command to block and providing a visual progress indicator while the provisioning occurs.
+The `--wait` option can be given to the create command. This will cause the command to block and providing a visual progress indicator while the provisioning occurs.

 ```
 $ pvc provisioner create test2 std-large
@ -243,7 +296,7 @@ Waiting for task to start..... done.
 SUCCESS: VM "test2" with profile "std-large" has been provisioned and started successfully
 ```

-The administrator can also specify whether or not to automatically define and start the VM when launching a provisioner job, using the `--define`/`--no-define` and `--start`/`--no-start` options. The default is to define and start a VM. `--no-define` implies `--no-start` as there would be no VM to start.
+The administrator can also specify whether or not to automatically define and start the VM when launching a provisioner job, using the `--define`/`--no-define` and `--start`/`--no-start` options. The default is to define and start a VM. `--no-define` implies `--no-start` as there would be no VM to start. Using `--no-start` can be useful if other tasks must be performed before starting the VM for the first time, and `--no-define` can be useful for creating "template" VMs which would then be cloned by other profiles.

 ```
 $ pvc provisioner create test3 std-large --no-define
@ -252,8 +305,49 @@ Using cluster "local" - Host: "10.0.0.1:7370"  Scheme: "http"  Prefix: "/api/v1"
 Task ID: 43d57a2d-8d0d-42f6-90df-cc39956825a9
 ```

-A VM set to do so will be defined on the cluster early in the provisioning process, before creating disks or executing the provisioning script, and with the special status "provision". Once completed, if the VM is not set to start automatically, the state will remain "provision" (with the VM not running) until its state is explicitly changed wit the client (or via autostart when its node returns to ready state).
+Finally, the administrator may specify further, one-time script arguments at install time, to further tune the VM installation (e.g. setting an FQDN or some conditional to trigger additional steps in the script).

-Provisioning jobs are tied to the node that spawned them. If the primary node changes, provisioning jobs will continue to run against that node until they are completed or interrupted, but the active API (now on the new primary node) will not have access to any status data from these jobs, until the primary node status is returned to the original host. The CLI will warn the administrator of this if there are active jobs while running node primary or secondary commands.
+```
+$ pvc provisioner create test4 std-large --script-arg vm_fqdn=testhost.example.tld --script-arg my_foo=True
+Using cluster "local" - Host: "10.0.0.1:7370"  Scheme: "http"  Prefix: "/api/v1"

-Provisioning jobs cannot be cancelled, either before they start or during execution. The administrator should always let an invalid job either complete or fail out automatically, then remove the erroneous VM with the vm remove command.
+Task ID: 39639f8c-4866-49de-8c51-4179edec0194
+```
+
+**NOTE**: A VM that is set to do so will be defined on the cluster early in the provisioning process, before creating disks or executing the provisioning script, with the special status `provision`. Once completed, if the VM is not set to start automatically, the state will remain `provision`, with the VM not running, until its state is explicitly changed with the client (or via autostart when its node returns to `ready` state).
+
+**NOTE**: Provisioning jobs are tied to the node that spawned them. If the primary node changes, provisioning jobs will continue to run against that node until they are completed, interrupted, or fail, but the active API (now on the new primary node) will not have access to any status data from these jobs, until the primary node status is returned to the original host. The CLI will warn the administrator of this if there are active jobs while running `node primary` or `node secondary` commands.
+
+**NOTE**: Provisioning jobs cannot be cancelled, either before they start or during execution. The administrator should always let an invalid job either complete or fail out automatically, then remove the erroneous VM with the `vm remove` command.
+
+# Deploying VMs from OVA images
+
+PVC supports deploying virtual machines from industry-standard OVA images. OVA images can be uploaded to the cluster with the `pvc provisioner ova` commands, and deployed via the created profile(s) using the `pvc provisioner create` command detailed above for scripted installs; the process is the same in both cases. Additionally, the profile(s) can be modified to suite your specific needs after creation.
+
+## Uploading an OVA
+
+Once the OVA is uploaded to the cluster with the `pvc provisioner ova upload` command, it will be visible in two different places:
+
+* In `pvc provisioner ova list`, one can see all uploaded OVA images as well as details on their disk configurations.
+
+* In `pvc profile list`, a new profile will be visible which matches the OVA `NAME` from the upload. This profile will have a "Source" of `OVA <NAME>`, and a system template of the same name. This system template will contain the basic configuration of the VM. You may notice that the other templates and data are set to `N/A`. For full details on this, see the next section.
+
+## OVA limitations
+
+PVC does not implement a *complete* OVA framework. While all basic elements of the OVA are included, the following areas require special attention.
+
+### Networks
+
+Because the PVC provisioner has its own conception of networks separate from the OVA profiles, the administrator must perform this mapping themselves, by first creating a network template, and the required networks on the PVC cluster, and then modifying the profile of the resulting OVA.
+
+The provisioner profile for the OVA can be safely modified to include this new network template at any time, and the resulting VM will be provisioned with these networks.
+
+This setup was chosen specifically to eliminate corner cases. Most OVA images include a single, "default" network interface, and expect the administrator of the hypervisor to modify this later. You can of course do this, but since PVC has its own conception of networks already in the provisioner, it makes more sense to ignore what the OVA specifies, and allow the administrator full control over this portion of the VM config, before deployment. It is thus always important to be aware of the network requirements of your OVA images, especially if they require specific network configurations, and then create a network template to match.
+
+### Storage
+
+During import, PVC splits the OVA into its constituent parts, including any disk images (usually VMDK-formatted). It will then create a separate PVC storage volume for each disk image. These storage volumes are then converted at deployment time from the VMDK format to the PVC native raw format based on their included size in the OVA. Once the storage volume for an actual VM deployment is created, it can then be resized as needed.
+
+Because of this, OVA profiles do not include storage templates like other PVC profiles. A storage template can still be added to such a profile, and the block devices will be added after the main block devices. However, this is generally not recommended; it is far better to modify the OVA to add additional volume(s) before uploading it instead.
+
+**WARNING**: Never adjust the sizes of the OVA VMDK-formatted storage volumes (named `ova_<NAME>_sdX`) or remove them without removing the OVA itself in the provisioner; doing so will prevent the deployment of the OVA, specifically the conversion of the images to raw format at deploy time, and render the OVA profile useless.
--- a/docs/manuals/swagger.json
+++ b/docs/manuals/swagger.json
@ -10,6 +10,9 @@
            },
            "type": "object"
        },
+        "Cluster Data": {
+            "type": "object"
+        },
        "ClusterStatus": {
            "properties": {
                "health": {
@ -156,6 +159,10 @@
        },
        "VMMetadata": {
            "properties": {
+                "migration_method": {
+                    "description": "The preferred migration method (live, shutdown, none)",
+                    "type": "string"
+                },
                "name": {
                    "description": "The name of the VM",
                    "type": "string"
@ -671,6 +678,10 @@
                        }
                    },
                    "type": "object"
+                },
+                "volume_count": {
+                    "description": "The number of volumes in the pool",
+                    "type": "integer"
                }
            },
            "type": "object"
@ -954,6 +965,10 @@
                    "description": "Internal provisioner template ID",
                    "type": "integer"
                },
+                "migration_method": {
+                    "description": "The preferred migration method (live, shutdown, none)",
+                    "type": "string"
+                },
                "name": {
                    "description": "Template name",
                    "type": "string"
@ -1158,6 +1173,10 @@
                    "description": "Whether the VM has been migrated, either \"no\" or \"from <last_node>\"",
                    "type": "string"
                },
+                "migration_method": {
+                    "description": "The preferred migration method (live, shutdown, none)",
+                    "type": "string"
+                },
                "name": {
                    "description": "The name of the VM",
                    "type": "string"
@ -1198,6 +1217,10 @@
                                "description": "The PVC network type",
                                "type": "string"
                            },
+                            "vni": {
+                                "description": "The VNI (PVC network) of the network bridge",
+                                "type": "integer"
+                            },
                            "wr_bytes": {
                                "description": "The number of write bytes on the interface",
                                "type": "integer"
@ -1397,9 +1420,38 @@
                ]
            }
        },
+        "/api/v1/backup": {
+            "get": {
+                "description": "",
+                "responses": {
+                    "200": {
+                        "description": "OK",
+                        "schema": {
+                            "$ref": "#/definitions/Cluster Data"
+                        }
+                    },
+                    "400": {
+                        "description": "Bad request"
+                    }
+                },
+                "summary": "Back up the Zookeeper data of a cluster in JSON format",
+                "tags": [
+                    "root"
+                ]
+            }
+        },
        "/api/v1/initialize": {
            "post": {
                "description": "Note: Normally used only once during cluster bootstrap; checks for the existence of the \"/primary_node\" key before proceeding and returns 400 if found",
+                "parameters": [
+                    {
+                        "description": "A confirmation string to ensure that the API consumer really means it",
+                        "in": "query",
+                        "name": "yes-i-really-mean-it",
+                        "required": true,
+                        "type": "string"
+                    }
+                ],
                "responses": {
                    "200": {
                        "description": "OK",
@ -3933,6 +3985,13 @@
                        "name": "node_autostart",
                        "required": false,
                        "type": "boolean"
+                    },
+                    {
+                        "description": "The preferred migration method (live, shutdown, none)",
+                        "in": "query",
+                        "name": "migration_method",
+                        "required": false,
+                        "type": "string"
                    }
                ],
                "responses": {
@ -4056,6 +4115,13 @@
                        "name": "node_autostart",
                        "required": false,
                        "type": "boolean"
+                    },
+                    {
+                        "description": "The preferred migration method (live, shutdown, none)",
+                        "in": "query",
+                        "name": "migration_method",
+                        "required": false,
+                        "type": "string"
                    }
                ],
                "responses": {
@ -4127,6 +4193,12 @@
                        "in": "query",
                        "name": "node_autostart",
                        "type": "boolean"
+                    },
+                    {
+                        "description": "The preferred migration method (live, shutdown, none)",
+                        "in": "query",
+                        "name": "migration_method",
+                        "type": "string"
                    }
                ],
                "responses": {
@ -4319,6 +4391,51 @@
                ]
            }
        },
+        "/api/v1/restore": {
+            "post": {
+                "description": "",
+                "parameters": [
+                    {
+                        "description": "A confirmation string to ensure that the API consumer really means it",
+                        "in": "query",
+                        "name": "yes-i-really-mean-it",
+                        "required": true,
+                        "type": "string"
+                    },
+                    {
+                        "description": "The raw JSON cluster backup data",
+                        "in": "query",
+                        "name": "cluster_data",
+                        "required": true,
+                        "type": "string"
+                    }
+                ],
+                "responses": {
+                    "200": {
+                        "description": "OK",
+                        "schema": {
+                            "$ref": "#/definitions/Message"
+                        }
+                    },
+                    "400": {
+                        "description": "Bad request",
+                        "schema": {
+                            "$ref": "#/definitions/Message"
+                        }
+                    },
+                    "500": {
+                        "description": "Restore error or code failure",
+                        "schema": {
+                            "$ref": "#/definitions/Message"
+                        }
+                    }
+                },
+                "summary": "Restore a backup over the cluster; destroys the existing data",
+                "tags": [
+                    "root"
+                ]
+            }
+        },
        "/api/v1/status": {
            "get": {
                "description": "",
@ -5463,6 +5580,19 @@
                        "name": "autostart",
                        "required": false,
                        "type": "boolean"
+                    },
+                    {
+                        "default": "none",
+                        "description": "The preferred migration method (live, shutdown, none)",
+                        "enum": [
+                            "live",
+                            "shutdown",
+                            "none"
+                        ],
+                        "in": "query",
+                        "name": "migration_method",
+                        "required": false,
+                        "type": "string"
                    }
                ],
                "responses": {
@ -5587,6 +5717,19 @@
                        "name": "autostart",
                        "required": false,
                        "type": "boolean"
+                    },
+                    {
+                        "default": "none",
+                        "description": "The preferred migration method (live, shutdown, none)",
+                        "enum": [
+                            "live",
+                            "shutdown",
+                            "none"
+                        ],
+                        "in": "query",
+                        "name": "migration_method",
+                        "required": false,
+                        "type": "string"
                    }
                ],
                "responses": {
@ -5758,6 +5901,19 @@
                        "name": "profile",
                        "required": false,
                        "type": "string"
+                    },
+                    {
+                        "default": "none",
+                        "description": "The preferred migration method (live, shutdown, none)",
+                        "enum": [
+                            "live",
+                            "shutdown",
+                            "none"
+                        ],
+                        "in": "query",
+                        "name": "migration_method",
+                        "required": false,
+                        "type": "string"
                    }
                ],
                "responses": {
--- a/node-daemon/pvcnoded/DNSAggregatorInstance.py
+++ b/node-daemon/pvcnoded/DNSAggregatorInstance.py
@ -76,8 +76,8 @@ class PowerDNSInstance(object):
        self.dns_server_daemon = None

        # Floating upstreams
-        self.vni_ipaddr, self.vni_cidrnetmask = self.config['vni_floating_ip'].split('/')
-        self.upstream_ipaddr, self.upstream_cidrnetmask = self.config['upstream_floating_ip'].split('/')
+        self.vni_floatingipaddr, self.vni_cidrnetmask = self.config['vni_floating_ip'].split('/')
+        self.upstream_floatingipaddr, self.upstream_cidrnetmask = self.config['upstream_floating_ip'].split('/')

    def start(self):
        self.logger.out(
@ -93,7 +93,7 @@ class PowerDNSInstance(object):
            '--disable-syslog=yes',              # Log only to stdout (which is then captured)
            '--disable-axfr=no',                 # Allow AXFRs
            '--allow-axfr-ips=0.0.0.0/0',        # Allow AXFRs to anywhere
-            '--local-address={},{}'.format(self.vni_ipaddr, self.upstream_ipaddr),  # Listen on floating IPs
+            '--local-address={},{}'.format(self.vni_floatingipaddr, self.upstream_floatingipaddr),  # Listen on floating IPs
            '--local-port=53',                   # On port 53
            '--log-dns-details=on',              # Log details
            '--loglevel=3',                      # Log info
--- a/node-daemon/pvcnoded/Daemon.py
+++ b/node-daemon/pvcnoded/Daemon.py
@ -54,7 +54,7 @@ import pvcnoded.CephInstance as CephInstance
 import pvcnoded.MetadataAPIInstance as MetadataAPIInstance

 # Version string for startup output
-version = '0.9.3'
+version = '0.9.10'

 ###############################################################################
 # PVCD - node daemon startup program
@ -1275,6 +1275,8 @@ def collect_ceph_stats(queue):
        osd_stats = dict()

        for osd in osd_list:
+            if d_osd[osd].node == myhostname:
+                osds_this_node += 1
            try:
                this_dump = osd_dump[osd]
                this_dump.update(osd_df[osd])
@ -1284,12 +1286,12 @@ def collect_ceph_stats(queue):
                # One or more of the status commands timed out, just continue
                logger.out('Failed to parse OSD stats into dictionary: {}'.format(e), state='w')

-        # Trigger updates for each OSD on this node
-        if debug:
-            logger.out("Trigger updates for each OSD on this node", state='d', prefix='ceph-thread')
+        # Upload OSD data for the cluster (primary-only)
+        if this_node.router_state == 'primary':
+            if debug:
+                logger.out("Trigger updates for each OSD", state='d', prefix='ceph-thread')

-        for osd in osd_list:
-            if d_osd[osd].node == myhostname:
+            for osd in osd_list:
                try:
                    stats = json.dumps(osd_stats[osd])
                    zkhandler.writedata(zk_conn, {
@ -1298,7 +1300,6 @@ def collect_ceph_stats(queue):
                except KeyError as e:
                    # One or more of the status commands timed out, just continue
                    logger.out('Failed to upload OSD stats from dictionary: {}'.format(e), state='w')
-                osds_this_node += 1

    ceph_conn.shutdown()

@ -1355,9 +1356,11 @@ def collect_vm_stats(queue):
                if instance.getdom() is not None:
                    try:
                        if instance.getdom().state()[0] != libvirt.VIR_DOMAIN_RUNNING:
+                            logger.out("VM {} has failed".format(instance.domname), state='w', prefix='vm-thread')
                            raise
                    except Exception:
                        # Toggle a state "change"
+                        logger.out("Resetting state to {} for VM {}".format(instance.getstate(), instance.domname), state='i', prefix='vm-thread')
                        zkhandler.writedata(zk_conn, {'/domains/{}/state'.format(domain): instance.getstate()})
        elif instance.getnode() == this_node.name:
            memprov += instance.getmemory()
--- a/node-daemon/pvcnoded/NodeInstance.py
+++ b/node-daemon/pvcnoded/NodeInstance.py
@ -64,17 +64,28 @@ class NodeInstance(object):
        self.vcpualloc = 0
        # Floating IP configurations
        if self.config['enable_networking']:
-            self.vni_dev = self.config['vni_dev']
-            self.vni_ipaddr, self.vni_cidrnetmask = self.config['vni_floating_ip'].split('/')
            self.upstream_dev = self.config['upstream_dev']
-            self.upstream_ipaddr, self.upstream_cidrnetmask = self.config['upstream_floating_ip'].split('/')
+            self.upstream_floatingipaddr = self.config['upstream_floating_ip'].split('/')[0]
+            self.upstream_ipaddr, self.upstream_cidrnetmask = self.config['upstream_dev_ip'].split('/')
+            self.vni_dev = self.config['vni_dev']
+            self.vni_floatingipaddr = self.config['vni_floating_ip'].split('/')[0]
+            self.vni_ipaddr, self.vni_cidrnetmask = self.config['vni_dev_ip'].split('/')
+            self.storage_dev = self.config['storage_dev']
+            self.storage_floatingipaddr = self.config['storage_floating_ip'].split('/')[0]
+            self.storage_ipaddr, self.storage_cidrnetmask = self.config['storage_dev_ip'].split('/')
        else:
-            self.vni_dev = None
-            self.vni_ipaddr = None
-            self.vni_cidrnetmask = None
            self.upstream_dev = None
+            self.upstream_floatingipaddr = None
            self.upstream_ipaddr = None
            self.upstream_cidrnetmask = None
+            self.vni_dev = None
+            self.vni_floatingipaddr = None
+            self.vni_ipaddr = None
+            self.vni_cidrnetmask = None
+            self.storage_dev = None
+            self.storage_floatingipaddr = None
+            self.storage_ipaddr = None
+            self.storage_cidrnetmask = None
        # Threads
        self.flush_thread = None
        # Flags
@ -349,13 +360,13 @@ class NodeInstance(object):
        # 1. Add Upstream floating IP
        self.logger.out(
            'Creating floating upstream IP {}/{} on interface {}'.format(
-                self.upstream_ipaddr,
+                self.upstream_floatingipaddr,
                self.upstream_cidrnetmask,
                'brupstream'
            ),
            state='o'
        )
-        common.createIPAddress(self.upstream_ipaddr, self.upstream_cidrnetmask, 'brupstream')
+        common.createIPAddress(self.upstream_floatingipaddr, self.upstream_cidrnetmask, 'brupstream')
        self.logger.out('Releasing write lock for synchronization phase C', state='i')
        zkhandler.writedata(self.zk_conn, {'/locks/primary_node': ''})
        lock.release()
@ -367,16 +378,25 @@ class NodeInstance(object):
        lock.acquire()
        self.logger.out('Acquired write lock for synchronization phase D', state='o')
        time.sleep(0.2)  # Time fir reader to acquire the lock
-        # 2. Add Cluster floating IP
+        # 2. Add Cluster & Storage floating IP
        self.logger.out(
            'Creating floating management IP {}/{} on interface {}'.format(
-                self.vni_ipaddr,
+                self.vni_floatingipaddr,
                self.vni_cidrnetmask,
                'brcluster'
            ),
            state='o'
        )
-        common.createIPAddress(self.vni_ipaddr, self.vni_cidrnetmask, 'brcluster')
+        common.createIPAddress(self.vni_floatingipaddr, self.vni_cidrnetmask, 'brcluster')
+        self.logger.out(
+            'Creating floating storage IP {}/{} on interface {}'.format(
+                self.storage_floatingipaddr,
+                self.storage_cidrnetmask,
+                'brstorage'
+            ),
+            state='o'
+        )
+        common.createIPAddress(self.storage_floatingipaddr, self.storage_cidrnetmask, 'brstorage')
        self.logger.out('Releasing write lock for synchronization phase D', state='i')
        zkhandler.writedata(self.zk_conn, {'/locks/primary_node': ''})
        lock.release()
@ -541,13 +561,13 @@ class NodeInstance(object):
        # 5. Remove Upstream floating IP
        self.logger.out(
            'Removing floating upstream IP {}/{} from interface {}'.format(
-                self.upstream_ipaddr,
+                self.upstream_floatingipaddr,
                self.upstream_cidrnetmask,
                'brupstream'
            ),
            state='o'
        )
-        common.removeIPAddress(self.upstream_ipaddr, self.upstream_cidrnetmask, 'brupstream')
+        common.removeIPAddress(self.upstream_floatingipaddr, self.upstream_cidrnetmask, 'brupstream')
        self.logger.out('Releasing read lock for synchronization phase C', state='i')
        lock.release()
        self.logger.out('Released read lock for synchronization phase C', state='o')
@ -557,16 +577,25 @@ class NodeInstance(object):
        self.logger.out('Acquiring read lock for synchronization phase D', state='i')
        lock.acquire()
        self.logger.out('Acquired read lock for synchronization phase D', state='o')
-        # 6. Remove Cluster floating IP
+        # 6. Remove Cluster & Storage floating IP
        self.logger.out(
            'Removing floating management IP {}/{} from interface {}'.format(
-                self.vni_ipaddr,
+                self.vni_floatingipaddr,
                self.vni_cidrnetmask,
                'brcluster'
            ),
            state='o'
        )
-        common.removeIPAddress(self.vni_ipaddr, self.vni_cidrnetmask, 'brcluster')
+        common.removeIPAddress(self.vni_floatingipaddr, self.vni_cidrnetmask, 'brcluster')
+        self.logger.out(
+            'Removing floating storage IP {}/{} from interface {}'.format(
+                self.storage_floatingipaddr,
+                self.storage_cidrnetmask,
+                'brstorage'
+            ),
+            state='o'
+        )
+        common.removeIPAddress(self.storage_floatingipaddr, self.storage_cidrnetmask, 'brstorage')
        self.logger.out('Releasing read lock for synchronization phase D', state='i')
        lock.release()
        self.logger.out('Released read lock for synchronization phase D', state='o')
--- a/node-daemon/pvcnoded/VMInstance.py
+++ b/node-daemon/pvcnoded/VMInstance.py
@ -35,7 +35,7 @@ import pvcnoded.VMConsoleWatcherInstance as VMConsoleWatcherInstance
 import daemon_lib.common as daemon_common


-def flush_locks(zk_conn, logger, dom_uuid):
+def flush_locks(zk_conn, logger, dom_uuid, this_node=None):
    logger.out('Flushing RBD locks for VM "{}"'.format(dom_uuid), state='i')
    # Get the list of RBD images
    rbd_list = zkhandler.readdata(zk_conn, '/domains/{}/rbdlist'.format(dom_uuid)).split(',')
@ -56,11 +56,18 @@ def flush_locks(zk_conn, logger, dom_uuid):
        if lock_list:
            # Loop through the locks
            for lock in lock_list:
+                if this_node is not None and zkhandler.readdata(zk_conn, '/domains/{}/state'.format(dom_uuid)) != 'stop' and lock['address'].split(':')[0] != this_node.storage_ipaddr:
+                    logger.out('RBD lock does not belong to this host (lock owner: {}): freeing this lock would be unsafe, aborting'.format(lock['address'].split(':')[0], state='e'))
+                    zkhandler.writedata(zk_conn, {'/domains/{}/state'.format(dom_uuid): 'fail'})
+                    zkhandler.writedata(zk_conn, {'/domains/{}/failedreason'.format(dom_uuid): 'Could not safely free RBD lock {} ({}) on volume {}; stop VM and flush locks manually'.format(lock['id'], lock['address'], rbd)})
+                    break
                # Free the lock
                lock_remove_retcode, lock_remove_stdout, lock_remove_stderr = common.run_os_command('rbd lock remove {} "{}" "{}"'.format(rbd, lock['id'], lock['locker']))
                if lock_remove_retcode != 0:
-                    logger.out('Failed to free RBD lock "{}" on volume "{}"\n{}'.format(lock['id'], rbd, lock_remove_stderr), state='e')
-                    continue
+                    logger.out('Failed to free RBD lock "{}" on volume "{}": {}'.format(lock['id'], rbd, lock_remove_stderr), state='e')
+                    zkhandler.writedata(zk_conn, {'/domains/{}/state'.format(dom_uuid): 'fail'})
+                    zkhandler.writedata(zk_conn, {'/domains/{}/failedreason'.format(dom_uuid): 'Could not free RBD lock {} ({}) on volume {}: {}'.format(lock['id'], lock['address'], rbd, lock_remove_stderr)})
+                    break
                logger.out('Freed RBD lock "{}" on volume "{}"'.format(lock['id'], rbd), state='o')

    return True
@ -74,15 +81,14 @@ def run_command(zk_conn, logger, this_node, data):
    # Flushing VM RBD locks
    if command == 'flush_locks':
        dom_uuid = args
-        # If this node is taking over primary state, wait until it's done
-        while this_node.router_state == 'takeover':
-            time.sleep(1)
-        if this_node.router_state == 'primary':
+
+        # Verify that the VM is set to run on this node
+        if this_node.d_domain[dom_uuid].getnode() == this_node.name:
            # Lock the command queue
            zk_lock = zkhandler.writelock(zk_conn, '/cmd/domains')
            with zk_lock:
-                # Add the OSD
-                result = flush_locks(zk_conn, logger, dom_uuid)
+                # Flush the lock
+                result = flush_locks(zk_conn, logger, dom_uuid, this_node)
                # Command succeeded
                if result:
                    # Update the command queue
@ -225,6 +231,17 @@ class VMInstance(object):
        except Exception:
            curstate = 'notstart'

+        # Handle situations where the VM crashed or the node unexpectedly rebooted
+        if self.getdom() is None or self.getdom().state()[0] != libvirt.VIR_DOMAIN_RUNNING:
+            # Flush locks
+            self.logger.out('Flushing RBD locks', state='i', prefix='Domain {}'.format(self.domuuid))
+            flush_locks(self.zk_conn, self.logger, self.domuuid, self.this_node)
+            if zkhandler.readdata(self.zk_conn, '/domains/{}/state'.format(self.domuuid)) == 'fail':
+                lv_conn.close()
+                self.dom = None
+                self.instart = False
+                return
+
        if curstate == libvirt.VIR_DOMAIN_RUNNING:
            # If it is running just update the model
            self.addDomainToList()
@ -243,7 +260,10 @@ class VMInstance(object):
                self.logger.out('Failed to create VM', state='e', prefix='Domain {}'.format(self.domuuid))
                zkhandler.writedata(self.zk_conn, {'/domains/{}/state'.format(self.domuuid): 'fail'})
                zkhandler.writedata(self.zk_conn, {'/domains/{}/failedreason'.format(self.domuuid): str(e)})
+                lv_conn.close()
                self.dom = None
+                self.instart = False
+                return

        lv_conn.close()

@ -381,6 +401,7 @@ class VMInstance(object):
            })
            migrate_lock_node.release()
            migrate_lock_state.release()
+            self.inmigrate = False
            self.logger.out('Aborted migration: {}'.format(reason), state='i', prefix='Domain {}'.format(self.domuuid))

        # Acquire exclusive lock on the domain node key
--- a/node-daemon/pvcnoded/fencing.py
+++ b/node-daemon/pvcnoded/fencing.py
@ -124,6 +124,11 @@ def rebootViaIPMI(ipmi_hostname, ipmi_user, ipmi_password, logger):
    )
    ipmi_reset_retcode, ipmi_reset_stdout, ipmi_reset_stderr = common.run_os_command(ipmi_command_reset)

+    if ipmi_reset_retcode != 0:
+        logger.out('Failed to reboot dead node', state='e')
+        print(ipmi_reset_stderr)
+        return False
+
    time.sleep(2)

    # Ensure the node is powered on
@ -139,14 +144,14 @@ def rebootViaIPMI(ipmi_hostname, ipmi_user, ipmi_password, logger):
        )
        ipmi_start_retcode, ipmi_start_stdout, ipmi_start_stderr = common.run_os_command(ipmi_command_start)

-    # Declare success or failure
-    if ipmi_reset_retcode == 0:
-        logger.out('Successfully rebooted dead node', state='o')
-        return True
-    else:
-        logger.out('Failed to reboot dead node', state='e')
-        print(ipmi_reset_stderr)
-        return False
+        if ipmi_start_retcode != 0:
+            logger.out('Failed to start powered-off dead node', state='e')
+            print(ipmi_reset_stderr)
+            return False
+
+    # Declare success
+    logger.out('Successfully rebooted dead node', state='o')
+    return True


 #
Author	SHA1	Message	Date
Joshua M. Boniface	d6ef722997	Fix bad log message	2020-12-15 10:51:52 -05:00
Joshua M. Boniface	518d699c15	Bump version to 0.9.10	2020-12-15 10:45:15 -05:00
Joshua M. Boniface	ac3ef3d792	Revamp fencing order Prevents unnecessarily excessive timeouts if IPMI connections time out; before, would have to go through 3 timed out commands at ~20s each before failure was registered; reduced to 1 if the first times out.	2020-12-15 02:48:25 -05:00
Joshua M. Boniface	37c3b4ef80	Validate provisioner userdata with SafeLoader Given the issues with FullLoader and its eventual deprecation, just use SafeLoader instead. Any well-formatted Userdata document should conform.	2020-12-15 00:30:20 -05:00
Joshua M. Boniface	3705daff43	Better handle failing RBD lock frees If the VM is not in a stop state, failing to free the lock is now considered a fatal error and will put the domain into fail state, aborting the start. This is better than being unsafe or trying to start a VM which will fail to boot due to read-only volumes.	2020-12-14 16:04:38 -05:00
Joshua M. Boniface	7c99a7bda7	Safely reset RBD locks on failed VMs Should correct issues on cold start as well as if a VM crashes uncleanly, which would prevent the VM from starting due to stale RBD locks. This implementation has four parts: 1. Update how IP addresses are handled, specifically by replacing all previous instances of "vni_ipaddr" with "vni_floatingipaddr", and then adding the "vni_ipaddr" with the real data for this node's IPs. Also include the storage IPs in this where they weren't before, so each this_node actually has the local IPs plus floating IPs. This enables the next two steps. 2. Modify flush_locks to take this_node as an argument, and update the run_command function to only operate against this node, rather than on the primary coordinator. 3. Have the flush_locks check each lock against the current node, to verify that the lock is actually held by the current node. This is the only way to do this safely. During fencing, we override this by not passing a this_node which bypasses this check. 4. Have the VM start do the check for VM failure/startup and execute a flush_locks before actually starting the VM.	2020-12-14 15:53:18 -05:00
Joshua M. Boniface	68d87c0b99	Formatting fixes and typos	2020-12-13 16:38:30 -05:00
Joshua M. Boniface	1af7c545b2	Fix broken link	2020-12-13 04:45:38 -05:00
Joshua M. Boniface	9a1b86bbbf	Fix capitalization issue	2020-12-13 04:42:46 -05:00
Joshua M. Boniface	5b0066da3f	Update provisioner manual	2020-12-13 04:35:42 -05:00
Joshua M. Boniface	89c7e225a0	Move OSD stats uploading to primary only Instead of each node uploading its own OSD stats, which would not work if the PVC daemon wasn't running, instead have the primary upload stats for all OSDs in the cluster.	2020-12-09 02:46:09 -05:00
Joshua M. Boniface	b36ec43a2d	Bump version to 0.9.9	2020-12-09 02:20:20 -05:00
Joshua M. Boniface	2ac31e0a14	Handle issues with state retrieval	2020-12-08 23:26:29 -05:00
Joshua M. Boniface	938d67f96b	Make help strings in network modify more detailed	2020-12-04 04:18:06 -05:00
Joshua M. Boniface	f58e95e4c1	Fix bugs in modifying networks 1. Use a consistent "is not None" to verify records are changing. 2. Fix bug where IPv6 network had no remove setter (it is now a blank string, the first thing I would expect). 3. 1 fixes a bug whereby it was impossible to unset DHCPv4 status.	2020-12-04 04:15:04 -05:00
Joshua M. Boniface	2338aa64f4	Fix bad param for DHCPv4 config	2020-12-04 04:09:23 -05:00
Joshua M. Boniface	e8c6df49e6	Fix incorrect method on incoming list	2020-12-04 03:49:28 -05:00
Joshua M. Boniface	c208898b34	Rename null migration method to any	2020-12-03 17:08:49 -05:00
Joshua M. Boniface	1d5b9c33b5	Unify handling of API list returns Ensure that every API return is handled appropriately as it is a list now.	2020-12-02 19:15:33 -05:00
Joshua M. Boniface	0820cb3c5b	Update swagger documentation	2020-12-01 04:45:01 -05:00
Joshua M. Boniface	0f8e5c6536	Add VNI to VM network API list Saves some processing on API clients.	2020-12-01 04:44:33 -05:00
Joshua M. Boniface	593810e53e	Add volume_count to pool API data	2020-12-01 03:40:41 -05:00
Joshua M. Boniface	185615e6e8	Don't strip single-element lists This was a dumb decision that complicated handling of single-item entries.	2020-12-01 03:23:18 -05:00
Joshua M. Boniface	3a5955b41c	Documentation tweaks in Ceph section	2020-11-25 17:58:33 -05:00
Joshua M. Boniface	f06e0ea750	Add more OS info to cluster arch doc	2020-11-25 17:05:16 -05:00
Joshua M. Boniface	8ecd2c5e80	Emphasize the cluster architecture doc	2020-11-25 16:57:04 -05:00
Joshua M. Boniface	256c537159	Add additional information to docs	2020-11-25 16:55:20 -05:00
Joshua M. Boniface	a5d495cfaf	Update docs name of init command	2020-11-25 10:36:48 -05:00
Joshua M. Boniface	ce5ee11841	Bump version to 0.9.8	2020-11-24 12:26:57 -05:00
Joshua M. Boniface	8f705c9cc2	Add cluster backup + restore functionality Adds cluster backup (JSON dump) and restore functions for use in disaster recovery. Further, adds additional confirmation to the initialization (as well as restore) endpoints to avoid accidental triggering, and also groups the init, backup, and restore commands in the CLI into a new "task" subsection.	2020-11-24 02:39:06 -05:00
Joshua M. Boniface	3f2c7293d1	Fix inconsistent name helpmsg In the RequestParser this is called helptext, not helpmsg; make all of the entries consistent and return the issue as a message.	2020-11-24 02:37:28 -05:00
Joshua M. Boniface	d4a28d7a58	Bump version to 0.9.7	2020-11-19 10:48:28 -05:00
Joshua M. Boniface	e8914eabb7	Better handle modifying consoles in templates Before, the default False was problematic and would reset consoles if the template was otherwise modified. Instead switch the flags to be full true/false flags, and on modify, adjust the default to be None so they will not be changed.	2020-11-19 10:28:00 -05:00
Joshua M. Boniface	e69eb93cb3	Bump version to 0.9.6	2020-11-17 13:01:54 -05:00
Joshua M. Boniface	70dfcd434f	Ensure inmigrate is cleared on failure	2020-11-17 12:57:37 -05:00
Joshua M. Boniface	0383f31086	Fix linting error	2020-11-17 12:37:33 -05:00
Joshua M. Boniface	a4e5323e81	Bump version to 0.9.5	2020-11-17 12:34:04 -05:00
Joshua M. Boniface	7c520ec00c	Add short pretty health output	2020-11-17 12:32:16 -05:00
Joshua M. Boniface	9a36fedcab	More Spaaaaacing	2020-11-14 12:29:28 -05:00
Joshua M. Boniface	aa075759c2	Correct more spacing issues	2020-11-14 12:12:55 -05:00
Joshua M. Boniface	568209c9af	Correct spacing before commands	2020-11-14 12:11:56 -05:00
Joshua M. Boniface	d47a2c29d4	Rephrase the power of 2 part	2020-11-14 11:58:03 -05:00
Joshua M. Boniface	5b92b822f1	Correct spelling	2020-11-14 11:28:39 -05:00
Joshua M. Boniface	ac47fb5b58	Update getting-started documentation	2020-11-14 11:27:51 -05:00
Joshua M. Boniface	6e9081f8c3	Correct spelling mistakes	2020-11-13 01:30:38 -05:00
Joshua M. Boniface	1125382b8d	Correct typo in replica configs	2020-11-13 01:23:57 -05:00
Joshua M. Boniface	06c97eed63	Mention limited exceptions to body request	2020-11-13 01:23:40 -05:00
Joshua M. Boniface	f6b4ce909e	Add more network info	2020-11-13 01:09:11 -05:00
Joshua M. Boniface	776a6982ff	Add more about the networks	2020-11-13 01:07:00 -05:00
Joshua M. Boniface	9cec6a97d1	Apply proofreading to the about page	2020-11-12 02:41:50 -05:00
Joshua M. Boniface	d34a996cf2	Add table of contents to about page	2020-11-12 02:06:03 -05:00
Joshua M. Boniface	59bf375d13	Merge FAQ into the about page	2020-11-12 02:00:39 -05:00
Joshua M. Boniface	57bd6babcb	Rename the installing page	2020-11-12 01:50:18 -05:00
Joshua M. Boniface	f199875e1a	Rename the cluster architecture page	2020-11-12 01:50:04 -05:00
Joshua M. Boniface	a1f72370d7	Rewrite the about page of the documentation	2020-11-12 01:49:44 -05:00
Joshua M. Boniface	25fb415a2a	Revamp getting started and remove pipeline badge	2020-11-12 00:57:39 -05:00
Joshua M. Boniface	f15253210f	Ensure all disk stats default to 0 Prevents issues with converting None to integers and such.	2020-11-11 13:13:31 -05:00
Joshua M. Boniface	1a0aedf01c	Up line count to 500 to be sure	2020-11-10 16:17:13 -05:00
Joshua M. Boniface	f729a54a2c	Obtain more lines during log follow	2020-11-10 16:14:33 -05:00
Joshua M. Boniface	a38e65be47	Correct issues if no interfaces/disks are present	2020-11-10 16:06:43 -05:00
Joshua M. Boniface	9053edacd8	Bump version to 0.9.4	2020-11-10 15:33:50 -05:00
Joshua M. Boniface	beb62c9f3d	Readd erroneously removed blk_file.write	2020-11-10 15:33:29 -05:00