Problems
This page lists a number of problems that can or have happened, with hints for their solution.
Server failure
Surf servers sometimes lose contact with a storage volume. They won't shutdown properly then. The problem manifests as various disk errors. After a hard reboot things generally work again.
Disk pressure
Managed from the Longhorn console.
Memory pressure
Some diagnostics and remediations:
kubectl top pod --all-namespaces --containers --sort-by=memory
kubectl get po --all-namespaces --field-selector 'status.phase==Failed' -o json | kubectl delete -f -
kubectl get po --all-namespaces --field-selector 'status.phase==Evicted' -o json | kubectl delete -f -
kubectl get po --all-namespaces --field-selector 'status.phase==Pending' -o json | kubectl delete -f –
Also see Grafana.
Procedure for node (server) reboot
kubectl drain
Docs server
Moved to Netlify and mkdocs.
This docs server docs.jointcyberrange.nl runs with a Postgress database on a Longhorn volume. Most of the actual content is on a gitlab repository. There should be backups of the volumes.
Longhorn volume backups
Managed from the Longhorn console, see also longhorn-s3.
Longhorn stale replicas
These are supposed to go away automatically after staleReplicaTimeout seconds.
Longhorn readonly
When Longhorn nodes suffer from CPU starvation, the nodes can go in to ReadOnly.
For example:
Unknown condition true of kubernetes node aks-argo16-31656333-vmss000000: condition type is ReadonlyFilesystem, reason is FilesystemIsReadOnly, message is EXT4-fs (sdl): Remounting filesystem read-only
Example approach:
kubectl cordon aks-argo16-31656333-vmss000002
kubectl drain aks-argo16-31656333-vmss000002 --ignore-daemonsets --delete-emptydir-data
Then reboot the failed node, through a GUI or CLI, and uncordon. There may be detached volumes and pods in error states left.
Logging runs out of space
No approach yet.