Starting background process VKTM 2021-03-30T00:08:42.781044+10:00 Errors in file /opt/app/oracle/diag/rdbms/dbx12/dbx12/trace/dbx12_vktm_32410.trc (incident=41): ORA-00800: soft external error, arguments: [Set Priority Failed], [VKTM], [Check traces and OS configuration], [Check Oracle document and MOS notes], [] Incident details in: /opt/app/oracle/diag/rdbms/dbx12/dbx12/incident/incdir_41/dbx12_vktm_32410_i41.trc 2021-03-30T00:08:42.782979+10:00 Error attempting to elevate VKTM's priority: no further priority changes will be attempted for this process VKTM started with pid=5, OS id=32410MOS Doc 2718971.1 gives a workaround for this issue (seems this doc has now been made internal).
The problme was related to cgroup setup. In a database setup where everythig is working fine the cgroup for the VKTM would have /. For example if VKTM PID is 5207 then following could be be used to find out cgroup setting
cat /proc/5207/cgroup | grep cpu 10:cpu,cpuacct:/ 6:cpuset:/Any other setting would mean VKTM would run into above issue.
One of the solutions is to set the hidden parameter _high_priority_processes='VKTM'. But this was already set to TRUE so not going to be a solution.
The problem server had the following cgroup setting.
# ps -eaf | grep -i vktm | grep -v grep oracle 3315 1 0 17:53 ? 00:00:00 ora_vktm_dbx12 oracle 3357 1 0 14:19 ? 00:01:40 asm_vktm_+ASM # cat /proc/3315/cgroup | grep cpu 11:cpuset:/ 6:cpu,cpuacct:/user.sliceSo the next workround was to set the kernel parameter as below.
# echo 0 > /sys/fs/cgroup/cpu,cpuacct/system.slice/cpu.rt_runtime_us # echo 950000 > /sys/fs/cgroup/cpu,cpuacct/user.slice/cpu.rt_runtime_usThis seem to resolve the situation and was able stop and start the standby instance without the above error message appaering on the alert log.
However, the question remains why did the cgroup change. This was the first incident of facing this error. Since this is adding a standby database server to exisitng setup there was a reference point to check against any changes. So first the OS. The "good" servers had OEL 7.9 while the "problem" server had OEL 7.7. Both servers are Azure VMs
Then decided to check inside cgroup setting. The good server had following (no user.slice)
ls -l /sys/fs/cgroup/cpu,cpuacct/ drwxr-xr-x. 3 root root 0 Apr 7 09:32 WALinuxAgent -rw-r--r--. 1 root root 0 Apr 7 09:32 tasks -rw-r--r--. 1 root root 0 Apr 7 09:32 cgroup.procs -rw-r--r--. 1 root root 0 Apr 7 09:32 cpu.cfs_period_us -r--r--r--. 1 root root 0 Apr 7 09:32 cgroup.sane_behavior -r--r--r--. 1 root root 0 Apr 7 09:32 cpu.stat -r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage_percpu_sys -rw-r--r--. 1 root root 0 Apr 7 09:32 cpu.shares -r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage_percpu -r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.stat -rw-r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage -rw-r--r--. 1 root root 0 Apr 7 09:32 cpu.cfs_quota_us -r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage_sys -r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage_all -r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage_percpu_user -rw-r--r--. 1 root root 0 Apr 7 09:32 cpu.rt_runtime_us -rw-r--r--. 1 root root 0 Apr 7 09:32 notify_on_release -rw-r--r--. 1 root root 0 Apr 7 09:32 cpu.rt_period_us -rw-r--r--. 1 root root 0 Apr 7 09:32 release_agent -rw-r--r--. 1 root root 0 Apr 7 09:32 cgroup.clone_children -r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage_userWhile the problem server had the following
drwxr-xr-x. 2 root root 0 Apr 7 09:29 auoms drwxr-xr-x. 2 root root 0 Apr 7 09:29 auomscollect -rw-r--r--. 1 root root 0 Apr 7 09:29 cgroup.clone_children -rw-r--r--. 1 root root 0 Apr 7 09:29 cgroup.procs -r--r--r--. 1 root root 0 Apr 7 09:29 cgroup.sane_behavior -r--r--r--. 1 root root 0 Apr 7 09:29 cpuacct.stat -rw-r--r--. 1 root root 0 Apr 7 09:29 cpuacct.usage -r--r--r--. 1 root root 0 Apr 7 09:29 cpuacct.usage_all -r--r--r--. 1 root root 0 Apr 7 09:29 cpuacct.usage_percpu -r--r--r--. 1 root root 0 Apr 7 09:29 cpuacct.usage_percpu_sys -r--r--r--. 1 root root 0 Apr 7 09:29 cpuacct.usage_percpu_user -r--r--r--. 1 root root 0 Apr 7 09:29 cpuacct.usage_sys -r--r--r--. 1 root root 0 Apr 7 09:29 cpuacct.usage_user -rw-r--r--. 1 root root 0 Apr 7 09:29 cpu.cfs_period_us -rw-r--r--. 1 root root 0 Apr 7 09:29 cpu.cfs_quota_us -rw-r--r--. 1 root root 0 Apr 7 09:29 cpu.rt_period_us -rw-r--r--. 1 root root 0 Apr 7 09:29 cpu.rt_runtime_us -rw-r--r--. 1 root root 0 Apr 7 09:29 cpu.shares -r--r--r--. 1 root root 0 Apr 7 09:29 cpu.stat -rw-r--r--. 1 root root 0 Apr 7 09:29 notify_on_release -rw-r--r--. 1 root root 0 Apr 7 09:29 release_agent drwxr-xr-x. 69 root root 0 Apr 7 09:29 system.slice -rw-r--r--. 1 root root 0 Apr 6 14:35 tasks drwxr-xr-x. 2 root root 0 Apr 7 09:29 user.slice drwxr-xr-x. 2 root root 0 Apr 7 09:29 WALinuxAgentBeside user.slice the auoms* seem to be the different between the two. Auoms is Azure management agent plugin. Could this be the reason why cgroup has a user.slice? To test this out disableed auoms and restarted the server.
# systemctl stop auoms.service # systemctl disable auoms.service Removed symlink /etc/systemd/system/multi-user.target.wants/auoms.service. Removed symlink /etc/systemd/system/auoms.service. # /sbin/rebootWhen the server restart the cgroup didn't have a user.slice
drwxr-xr-x. 3 root root 0 Apr 7 09:32 WALinuxAgent -rw-r--r--. 1 root root 0 Apr 7 09:32 tasks -rw-r--r--. 1 root root 0 Apr 7 09:32 cgroup.procs -rw-r--r--. 1 root root 0 Apr 7 09:32 cpu.cfs_period_us -r--r--r--. 1 root root 0 Apr 7 09:32 cgroup.sane_behavior -r--r--r--. 1 root root 0 Apr 7 09:32 cpu.stat -r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage_percpu_sys -rw-r--r--. 1 root root 0 Apr 7 09:32 cpu.shares -r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage_percpu -r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.stat -rw-r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage -rw-r--r--. 1 root root 0 Apr 7 09:32 cpu.cfs_quota_us -r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage_sys -r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage_all -r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage_percpu_user -rw-r--r--. 1 root root 0 Apr 7 09:32 cpu.rt_runtime_us -rw-r--r--. 1 root root 0 Apr 7 09:32 notify_on_release -rw-r--r--. 1 root root 0 Apr 7 09:32 cpu.rt_period_us -rw-r--r--. 1 root root 0 Apr 7 09:32 release_agent -rw-r--r--. 1 root root 0 Apr 7 09:32 cgroup.clone_children -r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage_userDB was starting without the VKTM related error on alert log (earlier set kernel parameters were not set to be persistent). VKTM had the / for cgroup.
ps ax | grep vktm 3314 ? Ss 0:03 asm_vktm_+ASM 3506 ? Ss 0:03 ora_vktm_dbx12 4664 pts/0 S+ 0:00 grep --color=auto vktm # cat /proc/3506/cgroup | grep cpu 6:cpu,cpuacct:/ 4:cpuset:/There were several instances to be added to the DG and on other servers the OS was upgraded to 7.9 from 7.7. This upgrade seem to be remove the auoms and there were no cgroup issues.
If facing similar issues first check what has caused the cgroup change before attempting hidden parameter or kernel parameter workarounds.