Saturday, July 10, 2021

ORA-00800: soft external error, arguments: [Set Priority Failed], [VKTM], [Check traces and OS configuration]

While adding a new standby node to the existing DG configuration noticed the following in the new standby databases's alert log.
Starting background process VKTM
2021-03-30T00:08:42.781044+10:00
Errors in file /opt/app/oracle/diag/rdbms/dbx12/dbx12/trace/dbx12_vktm_32410.trc  (incident=41):
ORA-00800: soft external error, arguments: [Set Priority Failed], [VKTM], [Check traces and OS configuration], [Check Oracle document and MOS notes], []
Incident details in: /opt/app/oracle/diag/rdbms/dbx12/dbx12/incident/incdir_41/dbx12_vktm_32410_i41.trc
2021-03-30T00:08:42.782979+10:00
Error attempting to elevate VKTM's priority: no further priority changes will be attempted for this process
VKTM started with pid=5, OS id=32410
MOS Doc 2718971.1 gives a workaround for this issue (seems this doc has now been made internal).
The problme was related to cgroup setup. In a database setup where everythig is working fine the cgroup for the VKTM would have /. For example if VKTM PID is 5207 then following could be be used to find otu cgroup setting
cat /proc/5207/cgroup | grep cpu
10:cpu,cpuacct:/
6:cpuset:/
Any other setting would mean VKTM would run into above issue.
One of the solutions is to set the hidden parameter _high_priority_processes='VKTM'. But this was already set to TRUE so not going to be a solution.
The problem server had the following cgroup setting.
# ps -eaf | grep -i vktm | grep -v grep
oracle    3315     1  0 17:53 ?        00:00:00 ora_vktm_dbx12
oracle    3357     1  0 14:19 ?        00:01:40 asm_vktm_+ASM
# cat /proc/3315/cgroup | grep cpu
11:cpuset:/
6:cpu,cpuacct:/user.slice
So the next workround was to set the kernel parameter as below.
# echo 0 > /sys/fs/cgroup/cpu,cpuacct/system.slice/cpu.rt_runtime_us
# echo 950000 > /sys/fs/cgroup/cpu,cpuacct/user.slice/cpu.rt_runtime_us
This seem to resolve the situation and was able stop and start the standby instance without the above error message appaering on the alert log.



However, the question remains why did the cgroup change. This was the first incident of facing this error. Since this is adding a standby database server to exisitng setup there was a reference point to check against any changes. So first the OS. The "good" servers had OEL 7.9 while the "problem" server had OEL 7.7. Both servers are Azure VMs
Then decided to check inside cgroup setting. The good server had following (no user.slice)
 ls -l /sys/fs/cgroup/cpu,cpuacct/
drwxr-xr-x. 3 root root 0 Apr  7 09:32 WALinuxAgent
-rw-r--r--. 1 root root 0 Apr  7 09:32 tasks
-rw-r--r--. 1 root root 0 Apr  7 09:32 cgroup.procs
-rw-r--r--. 1 root root 0 Apr  7 09:32 cpu.cfs_period_us
-r--r--r--. 1 root root 0 Apr  7 09:32 cgroup.sane_behavior
-r--r--r--. 1 root root 0 Apr  7 09:32 cpu.stat
-r--r--r--. 1 root root 0 Apr  7 09:32 cpuacct.usage_percpu_sys
-rw-r--r--. 1 root root 0 Apr  7 09:32 cpu.shares
-r--r--r--. 1 root root 0 Apr  7 09:32 cpuacct.usage_percpu
-r--r--r--. 1 root root 0 Apr  7 09:32 cpuacct.stat
-rw-r--r--. 1 root root 0 Apr  7 09:32 cpuacct.usage
-rw-r--r--. 1 root root 0 Apr  7 09:32 cpu.cfs_quota_us
-r--r--r--. 1 root root 0 Apr  7 09:32 cpuacct.usage_sys
-r--r--r--. 1 root root 0 Apr  7 09:32 cpuacct.usage_all
-r--r--r--. 1 root root 0 Apr  7 09:32 cpuacct.usage_percpu_user
-rw-r--r--. 1 root root 0 Apr  7 09:32 cpu.rt_runtime_us
-rw-r--r--. 1 root root 0 Apr  7 09:32 notify_on_release
-rw-r--r--. 1 root root 0 Apr  7 09:32 cpu.rt_period_us
-rw-r--r--. 1 root root 0 Apr  7 09:32 release_agent
-rw-r--r--. 1 root root 0 Apr  7 09:32 cgroup.clone_children
-r--r--r--. 1 root root 0 Apr  7 09:32 cpuacct.usage_user
While the problem server had the following
drwxr-xr-x.  2 root root 0 Apr  7 09:29 auoms
drwxr-xr-x.  2 root root 0 Apr  7 09:29 auomscollect
-rw-r--r--.  1 root root 0 Apr  7 09:29 cgroup.clone_children
-rw-r--r--.  1 root root 0 Apr  7 09:29 cgroup.procs
-r--r--r--.  1 root root 0 Apr  7 09:29 cgroup.sane_behavior
-r--r--r--.  1 root root 0 Apr  7 09:29 cpuacct.stat
-rw-r--r--.  1 root root 0 Apr  7 09:29 cpuacct.usage
-r--r--r--.  1 root root 0 Apr  7 09:29 cpuacct.usage_all
-r--r--r--.  1 root root 0 Apr  7 09:29 cpuacct.usage_percpu
-r--r--r--.  1 root root 0 Apr  7 09:29 cpuacct.usage_percpu_sys
-r--r--r--.  1 root root 0 Apr  7 09:29 cpuacct.usage_percpu_user
-r--r--r--.  1 root root 0 Apr  7 09:29 cpuacct.usage_sys
-r--r--r--.  1 root root 0 Apr  7 09:29 cpuacct.usage_user
-rw-r--r--.  1 root root 0 Apr  7 09:29 cpu.cfs_period_us
-rw-r--r--.  1 root root 0 Apr  7 09:29 cpu.cfs_quota_us
-rw-r--r--.  1 root root 0 Apr  7 09:29 cpu.rt_period_us
-rw-r--r--.  1 root root 0 Apr  7 09:29 cpu.rt_runtime_us
-rw-r--r--.  1 root root 0 Apr  7 09:29 cpu.shares
-r--r--r--.  1 root root 0 Apr  7 09:29 cpu.stat
-rw-r--r--.  1 root root 0 Apr  7 09:29 notify_on_release
-rw-r--r--.  1 root root 0 Apr  7 09:29 release_agent
drwxr-xr-x. 69 root root 0 Apr  7 09:29 system.slice
-rw-r--r--.  1 root root 0 Apr  6 14:35 tasks
drwxr-xr-x.  2 root root 0 Apr  7 09:29 user.slice
drwxr-xr-x.  2 root root 0 Apr  7 09:29 WALinuxAgent
Beside user.slice the auoms* seem to be the different between the two. Auoms is Azure management agent plugin. Could this be the reason why cgroup has a user.slice? To test this out disableed auoms and restarted the server.
# systemctl stop auoms.service
# systemctl disable auoms.service
Removed symlink /etc/systemd/system/multi-user.target.wants/auoms.service.
Removed symlink /etc/systemd/system/auoms.service.
# /sbin/reboot
When the server restart the cgroup didn't have a user.slice
drwxr-xr-x. 3 root root 0 Apr  7 09:32 WALinuxAgent
-rw-r--r--. 1 root root 0 Apr  7 09:32 tasks
-rw-r--r--. 1 root root 0 Apr  7 09:32 cgroup.procs
-rw-r--r--. 1 root root 0 Apr  7 09:32 cpu.cfs_period_us
-r--r--r--. 1 root root 0 Apr  7 09:32 cgroup.sane_behavior
-r--r--r--. 1 root root 0 Apr  7 09:32 cpu.stat
-r--r--r--. 1 root root 0 Apr  7 09:32 cpuacct.usage_percpu_sys
-rw-r--r--. 1 root root 0 Apr  7 09:32 cpu.shares
-r--r--r--. 1 root root 0 Apr  7 09:32 cpuacct.usage_percpu
-r--r--r--. 1 root root 0 Apr  7 09:32 cpuacct.stat
-rw-r--r--. 1 root root 0 Apr  7 09:32 cpuacct.usage
-rw-r--r--. 1 root root 0 Apr  7 09:32 cpu.cfs_quota_us
-r--r--r--. 1 root root 0 Apr  7 09:32 cpuacct.usage_sys
-r--r--r--. 1 root root 0 Apr  7 09:32 cpuacct.usage_all
-r--r--r--. 1 root root 0 Apr  7 09:32 cpuacct.usage_percpu_user
-rw-r--r--. 1 root root 0 Apr  7 09:32 cpu.rt_runtime_us
-rw-r--r--. 1 root root 0 Apr  7 09:32 notify_on_release
-rw-r--r--. 1 root root 0 Apr  7 09:32 cpu.rt_period_us
-rw-r--r--. 1 root root 0 Apr  7 09:32 release_agent
-rw-r--r--. 1 root root 0 Apr  7 09:32 cgroup.clone_children
-r--r--r--. 1 root root 0 Apr  7 09:32 cpuacct.usage_user
DB was starting without the VKTM related error on alert log (earlier set kernel parameters were not set to be persistent). VKTM had the / for cgroup.
 ps ax  | grep vktm
 3314 ?        Ss     0:03 asm_vktm_+ASM
 3506 ?        Ss     0:03 ora_vktm_dbx12
 4664 pts/0    S+     0:00 grep --color=auto vktm

# cat /proc/3506/cgroup | grep cpu
6:cpu,cpuacct:/
4:cpuset:/
There were several instances to be added to the DG and on other servers the OS was upgraded to 7.9 from 7.7. This upgrade seem to be remove the auoms and there were no cgroup issues.
If facing similar issues first check what has caused the cgroup change before attempting hidden parameter or kernel parameter workarounds.