97329bb90d
Sort Ceph pool data by name
...
There is no guarantee that both commands output the pools in the same
order, so sort them by name first so the iteration over the pools by ID
is successful.
2024-07-22 13:26:27 -04:00
dcb9c0d12c
Improve fence handling conditions
...
Use the intermediate output text when judging the fence status, rather
than the retcode of the stop as this should be more reliable.
2024-05-08 10:55:15 -04:00
79ad09ae59
Switch virtual memory free to allocated
...
Avoids incorrect reporting if cache/buffers exceeds normal.
2024-04-19 10:25:33 -04:00
a5763c9d25
Fix possible race condition applying schemas
...
Found an instance where two of these fired too close together, and
caused a fatal error. Use a write lock, and then catch the schema.apply
function in case it fails anyways.
2024-01-11 10:21:01 -05:00
123c7ce857
Update copyright header on all files for 2024
...
Last release of 2023 is probably the best time to do this.
2023-12-29 11:16:59 -05:00
e654fbba08
Move debug condition handling to Logger
...
Avoids many dozens of conditionals sprinkled throughout the code by
centralizing this check into the main Logger instance.
2023-12-27 13:01:45 -05:00
3e4cc53fdd
Add node network statistics and utilization values
...
Adds a new physical network interface stats parser to the node
keepalives, and leverages this information to provide a network
utilization overview in the Prometheus metrics.
2023-12-21 15:45:01 -05:00
0f24184b78
Explicitly clear resources of fenced node
...
This actually solves the bug originally "fixed" in
5f1432ccdd
without breaking VM resource
allocations for working nodes.
2023-12-11 12:14:56 -05:00
1ba37fe33d
Restore VM resource allocation location
...
Commit 5f1432ccdd
changed where these
happen due to a bug after fencing. However this completely broke node
resource reporting as only the final instance will be queried here.
Revert this change and look further into the original bug.
2023-12-11 11:52:59 -05:00
1a05077b10
Fix missing fstring
2023-12-11 11:29:49 -05:00
7bc0760b78
Add time to "starting keepalive" message
...
Matches the pvchealthd output and provides a useful message detail to
this otherwise contextless message.
2023-12-10 00:40:32 -05:00
1fb0463dea
Adjust daemon service startup
...
Add healthd, adjust workerd, lower waittime
2023-11-30 03:28:02 -05:00
03a738f878
Move config parser into daemon_lib
...
And reformat/add config values for API.
2023-11-30 00:05:37 -05:00
4a2eba0961
Improve node output messages (from pvchealthd)
...
1. Output startup "list" entries in cyan with s state
2. Add start of keepalive run message
2023-11-29 21:21:51 -05:00
83ceb41138
Add daemon name to Logger entries
2023-11-29 15:18:37 -05:00
2545a7b744
Allow similar for IPMI hostnames
2023-11-28 16:09:01 -05:00
ce907ff26a
Allow specifying static IPs instead of a file
2023-11-28 15:28:31 -05:00
fc3d292081
Add missing subdirectory configs
2023-11-27 13:40:07 -05:00
eab1ae873b
Ensure upstream_gateway key will exist
2023-11-27 13:37:57 -05:00
eaf93cdf96
Readd missing subsystem configurations
2023-11-27 13:33:41 -05:00
c8f4cbb39e
Fix node entry keys
2023-11-27 13:24:01 -05:00
bcc57638a9
Refactor pvcnoded to use new configuration
2023-11-26 15:41:25 -05:00
18e43a9377
Adjust name in worker log output
2023-11-16 02:25:14 -05:00
aef38639cf
Rename pvcapid-worker to pvcworkerd
2023-11-15 20:31:39 -05:00
5f1432ccdd
Fix memory allocation updates and add more debug
...
Previously, we were assigning memalloc/memprov/vcpualloc during an
earlier phase using the main d_domain list. I'm not sure exactly why,
but this was throwing off stats after a fence. Instead, set these values
later on while parsing the actually-active VMs.
2023-11-10 10:29:32 -05:00
d6b8808448
Clean up fencing handler
...
1. Remove all format strings in favour of f-strings
2. Ensure all logger messages have a prefix
3. Add a few more logger messages for clarity
2023-11-10 10:09:54 -05:00
83c4c6633d
Readd RBD lock detection and clearing on startup
...
This is still needed due to the nature of the locks and freeing them on
startup, and to preserve lock=fail behaviour on VM startup.
Also fixes the fencing lock flush to directly use the client library
outside of Celery. I don't like this hack but it seems prudent until we
move fencing to the workers as well.
2023-11-10 01:33:48 -05:00
2c15036f86
Add KeyDB to node startup services
...
Also ensure API worker starts on all nodes, not just coordinators.
2023-11-05 19:26:38 -05:00
30d7e49401
Start API worker with node daemon on coordinators
2023-11-04 13:08:16 -04:00
8b93f9a80e
Handle OSD index errors during stats collection
2023-11-01 21:33:40 -04:00
0769f1ea52
Increase service start time to 10s
2023-10-23 22:24:03 -04:00
457b7bed3d
Handle exceptions in fence migrations
2023-09-16 22:56:09 -04:00
48662e90c1
Remove obsolete monitoring_instance passing
2023-09-15 22:47:45 -04:00
079381c03e
Move printing to end and add runtime
2023-09-15 22:40:09 -04:00
4d51318a40
Make monitoring interval configurable
2023-09-15 16:54:51 -04:00
254303b9d4
Use coordinator_state instead of router_state
...
Makes it much clearer what this variable represents.
2023-09-15 16:47:56 -04:00
40b7d68853
Separate monitoring and move to 60s interval
...
Removes the dependency of the monitoring subsystem from the node
keepalives, and runs them at a 60s interval to avoid excessive backups
if a plugin takes too long.
Adds its own logs and related items as required.
Finally adds a new required argument to the run() of plugins, the
coordinator state, which can be used by a plugin to determine actions
based on whether the node is a primary, secondary, or non-coordinator.
2023-09-15 16:47:11 -04:00
cb413e5ce6
[Bookworm] Fix Ceph 16 OSD stat parsing
2023-08-31 00:45:03 -04:00
ed087d83c2
Found cpuload to 2 decimal places
2023-08-29 21:41:44 -04:00
7c07fbefff
Adjust keepalive health printing and ordering
2023-02-24 11:08:30 -05:00
f4eef30770
Add JSON health to cluster data
2023-02-15 15:26:57 -05:00
bc88d764b0
Add logging flag for montioring plugin output
2023-02-13 22:04:39 -05:00
2ee52e44d3
Move Ceph cluster health reporting to plugin
...
Also removes several outputs from the normal keepalive that were
superfluous/static so that the main output fits on one line.
2023-02-13 12:13:56 -05:00
3c742a827b
Initial implementation of monitoring plugin system
2023-02-13 12:06:26 -05:00
726d0a562b
Update copyright header year
2022-10-06 11:55:27 -04:00
5942aa50fc
Avoid raise/handle deadlocks
...
Can cause log flooding in some edge cases and isn't really needed any
longer. Use a proper conditional followed by an actual error handler.
2022-10-03 14:04:12 -04:00
8d0f26ff7a
Add additional kb_ values to OSD stats
...
Allows for easier parsing later to get e.g. % values and more details on
the used amounts.
2022-08-11 11:06:36 -04:00
23b1501f40
Fix linting error F541 f-string placeholders
2021-11-06 03:26:03 -04:00
c41664d2da
Reformat code with Black code formatter
...
Unify the code style along PEP and Black principles using the tool.
2021-11-06 03:02:43 -04:00
2e7b9b28b3
Add some delay and additional tries to fencing
2021-10-27 16:24:17 -04:00