Monday, February 17, 2014

Oracle Invoked Out-of-Memory Killer (oom-killer)

One node on a 2-node RAC (11.1.0.7) crashed and following could seen on the /var/log/messages
Feb 11 09:46:01 server02 logger: Oracle CSSD waiting for OPROCD to start
Feb 11 22:46:21 server02 kernel: oracle invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
Feb 11 22:46:21 server02 kernel:
Feb 11 22:46:21 server02 kernel: Call Trace:
Feb 11 22:46:21 server02 kernel:  [] out_of_memory+0x8e/0x2f3
Feb 11 22:46:21 server02 kernel:  [] __alloc_pages+0x27f/0x308
Feb 11 22:46:21 server02 kernel:  [] getnstimeofday+0x10/0x28
Feb 11 22:46:21 server02 kernel:  [] __do_page_cache_readahead+0x96/0x179
Feb 11 22:46:21 server02 kernel:  [] filemap_nopage+0x14c/0x360
Feb 11 22:46:21 server02 kernel:  [] __handle_mm_fault+0x1fa/0xfaa
Feb 11 22:46:21 server02 kernel:  [] do_page_fault+0x4cb/0x874
Feb 11 22:46:21 server02 kernel:  [] thread_return+0x62/0xfe
Feb 11 22:46:21 server02 kernel:  [] error_exit+0x0/0x84
Feb 11 22:46:21 server02 kernel:
Feb 11 22:46:23 server02 kernel: Mem-info:
Feb 11 22:46:23 server02 kernel: Node 0 DMA per-cpu:
Feb 11 22:46:23 server02 kernel: cpu 0 hot: high 0, batch 1 used:0
Feb 11 22:46:24 server02 kernel: cpu 0 cold: high 0, batch 1 used:0
Feb 11 22:46:25 server02 kernel: cpu 1 hot: high 0, batch 1 used:0
Feb 11 22:46:26 server02 kernel: cpu 1 cold: high 0, batch 1 used:0
Feb 11 22:46:26 server02 kernel: cpu 2 hot: high 0, batch 1 used:0
Feb 11 22:46:26 server02 kernel: cpu 2 cold: high 0, batch 1 used:0
Feb 11 22:46:27 server02 kernel: cpu 3 hot: high 0, batch 1 used:0
Feb 11 22:46:27 server02 kernel: cpu 3 cold: high 0, batch 1 used:0
Feb 11 22:46:28 server02 kernel: Node 0 DMA32 per-cpu:
Feb 11 22:46:29 server02 kernel: cpu 0 hot: high 186, batch 31 used:16
Feb 11 22:46:29 server02 kernel: cpu 0 cold: high 62, batch 15 used:29
Feb 11 22:46:29 server02 kernel: cpu 1 hot: high 186, batch 31 used:26
Feb 11 22:46:29 server02 kernel: cpu 1 cold: high 62, batch 15 used:14
Feb 11 22:46:29 server02 kernel: cpu 2 hot: high 186, batch 31 used:27
Feb 11 22:46:29 server02 kernel: cpu 2 cold: high 62, batch 15 used:32
Feb 11 22:46:29 server02 kernel: cpu 3 hot: high 186, batch 31 used:9
Feb 11 22:46:30 server02 kernel: cpu 3 cold: high 62, batch 15 used:59
Feb 11 22:46:30 server02 kernel: Node 0 Normal per-cpu:
Feb 11 22:46:30 server02 kernel: cpu 0 hot: high 186, batch 31 used:80
Feb 11 22:46:30 server02 kernel: cpu 0 cold: high 62, batch 15 used:54
Feb 11 22:46:30 server02 kernel: cpu 1 hot: high 186, batch 31 used:38
Feb 11 22:46:30 server02 kernel: cpu 1 cold: high 62, batch 15 used:55
Feb 11 22:46:30 server02 kernel: cpu 2 hot: high 186, batch 31 used:30
Feb 11 22:46:30 server02 kernel: cpu 2 cold: high 62, batch 15 used:58
Feb 11 22:46:31 server02 kernel: cpu 3 hot: high 186, batch 31 used:38
Feb 11 22:46:31 server02 kernel: cpu 3 cold: high 62, batch 15 used:51
Feb 11 22:46:31 server02 kernel: Node 0 HighMem per-cpu: empty
Feb 11 22:46:31 server02 kernel: Free pages:       76148kB (0kB HighMem)


Oracle provides a metalink note for lowMem region memory pressure (452326.1) but this crash is due to highMem region so note wasn't much help. If it's not related to a bug (551991.1) then the root cause could be due to memory pressure(1502301.1). It must be verified that system is indeed under memory pressure. System statistics (either with sysstat or OSWatcher) could be used to check if memory consumption differed from normal due to any recent application changes.
Other solution includes setting vm.lower_zone_protection (only on 32-bit systems) or vm.min_free_kbytes kernel parameter depending on the architecture.

Useful Metalink notes
Linux: Out-of-Memory (OOM) Killer [452000.1]
How to Check Whether a System is Under Memory Pressure [1502301.1]
Linux Kernel: The SLAB Allocator [434351.1]
BUG 6167888: RMAN-10038 ERROR ON LINUX AFTER CREATING A LARGE BACKUPSET ON NFS [551991.1]
Linux Kernel Lowmem Pressure Issues and Kernel Structures [452326.1]