Problems

This page lists a number of problems that can or have happened, with hints for their solution.

Server failure

Surf servers sometimes lose contact with a storage volume. They won't shutdown properly then. The problem manifests as various disk errors. After a hard reboot things generally work again.

Disk pressure

Managed from the Longhorn console.

Memory pressure

Some diagnostics and remediations:

kubectl top pod --all-namespaces --containers --sort-by=memory
kubectl get po --all-namespaces --field-selector 'status.phase==Failed' -o json | kubectl delete -f - 
kubectl get po --all-namespaces --field-selector 'status.phase==Evicted' -o json | kubectl delete -f - 
kubectl get po --all-namespaces --field-selector 'status.phase==Pending' -o json | kubectl delete -f –

Also see Grafana.

Procedure for node (server) reboot

kubectl drain

Docs server

Moved to Netlify and mkdocs.

This docs server docs.jointcyberrange.nl runs with a Postgress database on a Longhorn volume. Most of the actual content is on a gitlab repository. There should be backups of the volumes.

Longhorn volume backups

Managed from the Longhorn console, see also longhorn-s3.

Longhorn stale replicas

These are supposed to go away automatically after staleReplicaTimeout seconds.

Longhorn readonly

When Longhorn nodes suffer from CPU starvation, the nodes can go in to ReadOnly.

For example:

Unknown condition true of kubernetes node aks-argo16-31656333-vmss000000: condition type is ReadonlyFilesystem, reason is FilesystemIsReadOnly, message is EXT4-fs (sdl): Remounting filesystem read-only

Example approach:

kubectl cordon aks-argo16-31656333-vmss000002
kubectl drain aks-argo16-31656333-vmss000002 --ignore-daemonsets --delete-emptydir-data

Then reboot the failed node, through a GUI or CLI, and uncordon. There may be detached volumes and pods in error states left.

Logging runs out of space

No approach yet.