Merge branch 'sriov'

Implement SR-IOV support in PVC.

Closes #130
This commit is contained in:
2021-06-23 00:58:44 -04:00
21 changed files with 2269 additions and 152 deletions

View File

@ -12,6 +12,7 @@
+ [PVC client networks](#pvc-client-networks)
- [Bridged (unmanaged) Client Networks](#bridged--unmanaged--client-networks)
- [VXLAN (managed) Client Networks](#vxlan--managed--client-networks)
- [SR-IOV Client Networks](#sriov-client-networks)
- [Other Client Networks](#other-client-networks)
* [Node Layout: Considering how nodes are laid out](#node-layout--considering-how-nodes-are-laid-out)
+ [Node Functions: Coordinators versus Hypervisors](#node-functions--coordinators-versus-hypervisors)
@ -184,6 +185,26 @@ With this client network type, PVC is in full control of the network. No vLAN co
NOTE: These networks may introduce a bottleneck and tromboning if there is a large amount of external and/or inter-network traffic on the cluster. The administrator should consider this carefully when deciding whether to use managed or bridged networks and properly evaluate the inter-network traffic requirements.
#### SR-IOV Client Networks
The third type of client network is the SR-IOV network. SR-IOV (Single-Root I/O Virtualization) is a technique and feature enabled on modern high-performance NICs (for instance, those from Intel or nVidia) which allows a single physical Ethernet port (a "PF" in SR-IOV terminology) to be split, at a hardware level, into multiple virtual Ethernet ports ("VF"s), which can then be managed separately. Starting with version 0.9.21, PVC support SR-IOV PF and VF configuration at the node level, and these VFs can be passed into VMs in two ways.
SR-IOV's main benefit is to offload bridging and network functions from the hypervisor layer, and direct them onto the hardware itself. This can increase network throughput in some situations, as well as provide near-complete isolation of guest networks from the hypervisors (in contrast with bridges which *can* expose client traffic to the hypervisors, and VXLANs which *do* expose client traffic to the hypervisors). For instance, a VF can have a vLAN specified, and the tagging/untagging of packets is then carried out at the hardware layer.
There are however caveats to working with SR-IOV. At the most basic level, the biggest difference with SR-IOV compared to the other two network types is that SR-IOV must be configured on a per-node basis. That is, each node must have SR-IOV explicitly enabled, it's specific PF devices defined, and a set of VFs created at PVC startup. Generally, with identical PVC nodes, this will not be a problem but is something to consider, especially if the servers are mismatched in any way. It is thus also possible to set some nodes with SR-IOV functionality, and others without, though care must be taken in this situation to set node limits in the VM metadata of any VMs which use SR-IOV VFs to prevent failed migrations.
PFs are defined in the `pvcnoded.yml` configuration of each node, via the `sriov_device` list. Each PF can have an arbitrary number of VFs (`vfcount`) allocated, though each NIC vendor and model has specific limits. Once configured, specifically with Intel NICs, PFs (and specifically, the `vfcount` attribute in the driver) are immutable and cannot be changed easily without completely flushing the node and rebooting it, so care should be taken to select the desired settings as early in the cluster configuration as possible.
Once created, VFs are also managed on a per-node basis. That is, each VF, on each host, even if they have the exact same device names, is managed separately. For instance, the PF `ens1f0` creating a VF `ens1f0v0` on "`hv1`", can have a different configuration from the identically-named VF `ens1f0v0` on "`hv2`". The administrator is responsible for ensuring consistency here, and for ensuring that devices do not overlap (e.g. assigning the same VF name to VMs on two separate nodes which might migrate to each other). PVC will however explicitly prevent two VMs from being assigned to the same VF on the same node, even if this may be technically possible in some cases.
When attaching VFs to VMs, there are two supported modes: `macvtap`, and `hostdev`.
`macvtap`, as the name suggests, uses the Linux `macvtap` driver to connect the VF to the VM. Once attached, the vNIC behaves just like a "bridged" network connection above, and like "bridged" connections, the "mode" of the NIC can be specificed, defaulting to "virtio" but supporting various emulated devices instead. Note that in this mode, vLANs cannot be configured on the guest side; they must be specified in the VF configuration (`pvc network sriov vf set`) with one vLAN per VF. VMs with `macvtap` interfaces can be live migrated between nodes without issue, assuming there is a corresponding free VF on the destination node, and the SR-IOV functionality is transparent to the VM.
`hostdev` is a direct PCIe passthrough method. With a VF attached to a VM in `hostdev` mode, the virtual PCIe NIC device itself becomes hidden from the node, and is visible only to the guest, where it appears as a discrete PCIe device. In this mode, vLANs and other attributes can be set on the guest side at will, though setting vLANs and other properties in the VF configuration is still supported. The main caveat to this mode is that VMs with connected `hostdev` SR-IOV VFs *cannot be live migrated between nodes*. Only a `shutdown` migration is supported, and, like `macvtap`, an identical PCIe device at the same bus address must be present on the target node. To prevent unexpected failures, PVC will explicitly set the VM metadata for the "migration method" to "shutdown" the first time that a `hostdev` VF is attached to it; if this changes later, the administrator must change this back explicitly.
Generally speaking, SR-IOV connections are not recommended unless there is a good usecase for them. On modern hardware, software bridges are extremely performant, and are much simpler to manage. The functionality is provided for those rare usecases where SR-IOV is asbolutely required by the administrator, but care must be taken to understand all the requirements and caveats of SR-IOV before using it in production.
#### Other Client Networks
Future PVC versions may support other client network types, such as direct-routing between VMs.

View File

@ -451,6 +451,12 @@ pvc_nodes:
pvc_bridge_device: bondU
pvc_sriov_enable: True
pvc_sriov_device:
- phy: ens1f0
mtu: 9000
vfcount: 6
pvc_upstream_device: "{{ networks['upstream']['device'] }}"
pvc_upstream_mtu: "{{ networks['upstream']['mtu'] }}"
pvc_upstream_domain: "{{ networks['upstream']['domain'] }}"
@ -901,6 +907,18 @@ The IPMI password for the node management controller. Unless a per-host override
The device name of the underlying network interface to be used for "bridged"-type client networks. For each "bridged"-type network, an IEEE 802.3q vLAN and bridge will be created on top of this device to pass these networks. In most cases, using the reflexive `networks['cluster']['raw_device']` or `networks['upstream']['raw_device']` from the Base role is sufficient.
#### `pvc_sriov_enable`
* *optional*
Whether to enable or disable SR-IOV functionality.
#### `pvc_sriov_device`
* *optional*
A list of SR-IOV devices. See the Daemon manual for details.
#### `pvc_<network>_*`
The next set of entries is hard-coded to use the values from the global `networks` list. It should not need to be changed under most circumstances. Refer to the previous sections for specific notes about each entry.

View File

@ -146,6 +146,11 @@ pvc:
console_log_lines: 1000
networking:
bridge_device: ens4
sriov_enable: True
sriov_device:
- phy: ens1f0
mtu: 9000
vfcount: 7
upstream:
device: ens4
mtu: 1500
@ -422,6 +427,34 @@ How many lines of VM console logs to keep in the Zookeeper database for each VM.
The network interface device used to create Bridged client network vLANs on. For most clusters, should match the underlying device of the various static networks (e.g. `ens4` or `bond0`), though may also use a separate network interface.
#### `system` → `configuration` → `networking` → `sriov_enable`
* *optional*, defaults to `False`
* *requires* `functions``enable_networking`
Enables (or disables) SR-IOV functionality in PVC. If enabled, at least one `sriov_device` entry should be specified.
#### `system` → `configuration` → `networking` → `sriov_device`
* *optional*
* *requires* `functions``enable_networking`
Contains a list of SR-IOV PF (physical function) devices and their basic configuration. Each element contains the following entries:
##### `phy`:
* *required*
The raw Linux network device with SR-IOV PF functionality.
##### `mtu`
The MTU of the PF device, set on daemon startup.
##### `vfcount`
The number of VF devices to create on this PF. VF devices are then managed via PVC on a per-node basis.
#### `system` → `configuration` → `networking`
* *optional*

View File

@ -764,6 +764,99 @@
},
"type": "object"
},
"sriov_pf": {
"properties": {
"mtu": {
"description": "The MTU of the SR-IOV PF device",
"type": "string"
},
"phy": {
"description": "The name of the SR-IOV PF device",
"type": "string"
},
"vfs": {
"items": {
"description": "The PHY name of a VF of this PF",
"type": "string"
},
"type": "list"
}
},
"type": "object"
},
"sriov_vf": {
"properties": {
"config": {
"id": "sriov_vf_config",
"properties": {
"link_state": {
"description": "The current SR-IOV VF link state (either enabled, disabled, or auto)",
"type": "string"
},
"query_rss": {
"description": "Whether VF RSS querying is enabled or disabled",
"type": "boolean"
},
"spoof_check": {
"description": "Whether device spoof checking is enabled or disabled",
"type": "boolean"
},
"trust": {
"description": "Whether guest device trust is enabled or disabled",
"type": "boolean"
},
"tx_rate_max": {
"description": "The maximum TX rate of the SR-IOV VF device",
"type": "string"
},
"tx_rate_min": {
"description": "The minimum TX rate of the SR-IOV VF device",
"type": "string"
},
"vlan_id": {
"description": "The tagged vLAN ID of the SR-IOV VF device",
"type": "string"
},
"vlan_qos": {
"description": "The QOS group of the tagged vLAN",
"type": "string"
}
},
"type": "object"
},
"mac": {
"description": "The current MAC address of the VF device",
"type": "string"
},
"mtu": {
"description": "The current MTU of the VF device",
"type": "integer"
},
"pf": {
"description": "The name of the SR-IOV PF parent of this VF device",
"type": "string"
},
"phy": {
"description": "The name of the SR-IOV VF device",
"type": "string"
},
"usage": {
"id": "sriov_vf_usage",
"properties": {
"domain": {
"description": "The UUID of the domain the SR-IOV VF is currently used by",
"type": "boolean"
},
"used": {
"description": "Whether the SR-IOV VF is currently used by a VM or not",
"type": "boolean"
}
},
"type": "object"
}
},
"type": "object"
},
"storage-template": {
"properties": {
"disks": {
@ -1459,8 +1552,15 @@
},
"/api/v1/initialize": {
"post": {
"description": "Note: Normally used only once during cluster bootstrap; checks for the existence of the \"/primary_node\" key before proceeding and returns 400 if found",
"description": "<br/>If the 'overwrite' option is not True, the cluster will return 400 if the `/config/primary_node` key is found. If 'overwrite' is True, the existing cluster<br/>data will be erased and new, empty data written in its place.<br/><br/>All node daemons should be stopped before running this command, and the API daemon started manually to avoid undefined behavior.",
"parameters": [
{
"description": "A flag to enable or disable (default) overwriting existing data",
"in": "query",
"name": "overwrite",
"required": false,
"type": "bool"
},
{
"description": "A confirmation string to ensure that the API consumer really means it",
"in": "query",
@ -4453,6 +4553,181 @@
]
}
},
"/api/v1/sriov/pf": {
"get": {
"description": "",
"responses": {
"200": {
"description": "OK",
"schema": {
"$ref": "#/definitions/sriov_pf"
}
}
},
"summary": "Return a list of SR-IOV PFs on a given node",
"tags": [
"network / sriov"
]
}
},
"/api/v1/sriov/pf/{node}": {
"get": {
"description": "",
"responses": {
"200": {
"description": "OK",
"schema": {
"$ref": "#/definitions/sriov_pf"
}
}
},
"summary": "Return a list of SR-IOV PFs on node {node}",
"tags": [
"network / sriov"
]
}
},
"/api/v1/sriov/vf": {
"get": {
"description": "",
"responses": {
"200": {
"description": "OK",
"schema": {
"$ref": "#/definitions/sriov_vf"
}
}
},
"summary": "Return a list of SR-IOV VFs on a given node, optionally limited to those in the specified PF",
"tags": [
"network / sriov"
]
}
},
"/api/v1/sriov/vf/{node}": {
"get": {
"description": "",
"responses": {
"200": {
"description": "OK",
"schema": {
"$ref": "#/definitions/sriov_vf"
}
}
},
"summary": "Return a list of SR-IOV VFs on node {node}, optionally limited to those in the specified PF",
"tags": [
"network / sriov"
]
}
},
"/api/v1/sriov/vf/{node}/{vf}": {
"get": {
"description": "",
"responses": {
"200": {
"description": "OK",
"schema": {
"$ref": "#/definitions/sriov_vf"
}
},
"404": {
"description": "Not found",
"schema": {
"$ref": "#/definitions/Message"
}
}
},
"summary": "Return information about {vf} on {node}",
"tags": [
"network / sriov"
]
},
"put": {
"description": "",
"parameters": [
{
"description": "The vLAN ID for vLAN tagging (0 is disabled)",
"in": "query",
"name": "vlan_id",
"required": false,
"type": "integer"
},
{
"description": "The vLAN QOS priority (0 is disabled)",
"in": "query",
"name": "vlan_qos",
"required": false,
"type": "integer"
},
{
"description": "The minimum TX rate (0 is disabled)",
"in": "query",
"name": "tx_rate_min",
"required": false,
"type": "integer"
},
{
"description": "The maximum TX rate (0 is disabled)",
"in": "query",
"name": "tx_rate_max",
"required": false,
"type": "integer"
},
{
"description": "The administrative link state",
"enum": [
"auto",
"enable",
"disable"
],
"in": "query",
"name": "link_state",
"required": false,
"type": "string"
},
{
"description": "Enable or disable spoof checking",
"in": "query",
"name": "spoof_check",
"required": false,
"type": "boolean"
},
{
"description": "Enable or disable VF user trust",
"in": "query",
"name": "trust",
"required": false,
"type": "boolean"
},
{
"description": "Enable or disable query RSS support",
"in": "query",
"name": "query_rss",
"required": false,
"type": "boolean"
}
],
"responses": {
"200": {
"description": "OK",
"schema": {
"$ref": "#/definitions/Message"
}
},
"400": {
"description": "Bad request",
"schema": {
"$ref": "#/definitions/Message"
}
}
},
"summary": "Set the configuration of {vf} on {node}",
"tags": [
"network / sriov"
]
}
},
"/api/v1/status": {
"get": {
"description": "",
@ -5721,7 +5996,8 @@
"mem",
"vcpus",
"load",
"vms"
"vms",
"none (cluster default)"
],
"in": "query",
"name": "selector",