Menu

#2717 [patch] Resolve memory leak in agent/mibgroup/host/data_access/swrun_kinfo.c

freeBSD
closed
None
5
2020-02-01
2016-06-07
No

Submitted by: Sylvain GALLIANO sg@efficientip.com

Step to reproduce:

  • start snmpd using valgrind --leak-check=full
  • execute several snmpwalk like:

snmpwalk -v 2c -c public 10.0.22.22 HOST-RESOURCES-MIB::hrSWRunTable >

/dev/null

  • stop valgrind

You should see:

==86177== 9,568 bytes in 26 blocks are definitely lost in loss record 337
of 339
==86177== at 0x4C1EC90: calloc (in
/data1/valgrind/vgpreload_memcheck-amd64-freebsd.so)
==86177== by 0x5127B29: netsnmp_swrun_entry_create (swrun.c:306)
==86177== by 0x5139FA4: netsnmp_arch_swrun_container_load
(swrun_kinfo.c:203)
==86177== by 0x512770D: netsnmp_swrun_container_load (swrun.c:224)
==86177== by 0x51273E0: _cache_load (swrun.c:140)
==86177== by 0x4E3C867: _cache_load (cache_handler.c:700)
==86177== by 0x4E3C295: netsnmp_cache_check_and_reload
(cache_handler.c:542)
==86177== by 0x4E3C6DA: netsnmp_cache_helper_handler
(cache_handler.c:638)
==86177== by 0x4E52D61: netsnmp_call_handler (agent_handler.c:526)
==86177== by 0x4E5328F: netsnmp_call_next_handler (agent_handler.c:640)
==86177== by 0x4E45392: table_helper_handler (table.c:712)
==86177== by 0x4E52D61: netsnmp_call_handler (agent_handler.c:526)

The issue come from duplicate index insertion on hrSWRunTable (all threads
are retrieved, leading to duplicate PID indexes)

diff -ru net-snmp-5.7.3-leak/agent/mibgroup/host/data_access/swrun_kinfo.c
net-snmp-5.7.3/agent/mibgroup/host/data_access/swrun_kinfo.c
--- net-snmp-5.7.3-leak/agent/mibgroup/host/data_access/swrun_kinfo.c
2016-06-07 18:48:23.000000000 +0200
+++ net-snmp-5.7.3/agent/mibgroup/host/data_access/swrun_kinfo.c
 2016-06-07 18:48:40.000000000 +0200
@@ -189,7 +189,7 @@
 #elif defined(HAVE_KVM_GETPROC2)
     proc_table = kvm_getproc2(kd, KERN_PROC_ALL, 0, sizeof(struct
kinfo_proc2), &nprocs );
 #else
-    proc_table = kvm_getprocs(kd, KERN_PROC_ALL, 0, &nprocs );
+    proc_table = kvm_getprocs(kd, KERN_PROC_PROC, 0, &nprocs );
 #endif
     for ( i=0 ; i<nprocs; i++ ) {
         if ( 0 == proc_table[i].SWRUN_K_STAT )

Discussion

  • Niels Baggesen

    Niels Baggesen - 2016-06-13
    • status: open --> pending
    • assigned_to: Niels Baggesen
     
  • Niels Baggesen

    Niels Baggesen - 2016-06-13

    This should no happen :-)
    Can you give me the output from "procstat -at" and/or "ps -auxH -o lwp" when it happens?

     
  • Sylvain GALLIANO

    # ps -auxH -o lwp
    USER    PID %CPU %MEM    VSZ  RSS TT  STAT STARTED        TIME COMMAND             LWP
    root     11 99.0  0.0      0   16 ??  RL    1Jun16 18894:01.88 [idle]           100003
    root      0  0.0  0.0      0  160 ??  DLs   1Jun16   289:18.79 [kernel/swapper] 100000
    root      0  0.0  0.0      0  160 ??  DLs   1Jun16     0:00.00 [kernel/firmware 100010
    root      0  0.0  0.0      0  160 ??  DLs   1Jun16     0:00.00 [kernel/acpi_tas 100015
    root      0  0.0  0.0      0  160 ??  DLs   1Jun16     0:00.00 [kernel/acpi_tas 100016
    root      0  0.0  0.0      0  160 ??  DLs   1Jun16     0:00.00 [kernel/acpi_tas 100017
    root      0  0.0  0.0      0  160 ??  DLs   1Jun16     0:00.00 [kernel/kqueue t 100018
    root      0  0.0  0.0      0  160 ??  DLs   1Jun16     0:00.00 [kernel/ffs_trim 100019
    root      0  0.0  0.0      0  160 ??  DLs   1Jun16     0:00.00 [kernel/thread t 100022
    root      0  0.0  0.0      0  160 ??  DLs   1Jun16     3:27.38 [kernel/em0 task 100028
    root      0  0.0  0.0      0  160 ??  DLs   1Jun16   149:37.33 [kernel/dummynet 100032
    root      1  0.0  0.0   6276  228 ??  ILs   1Jun16     0:00.42 /sbin/init --    100002
    root      2  0.0  0.0      0   16 ??  DL    1Jun16     0:00.00 [crypto]         100011
    root      3  0.0  0.0      0   16 ??  DL    1Jun16     0:00.00 [crypto returns] 100012
    root      4  0.0  0.0      0   16 ??  DL    1Jun16     0:00.00 [mpt_recovery0]  100027
    root      5  0.0  0.0      0   16 ??  DL    1Jun16     0:00.00 [sctp_iterator]  100031
    root      6  0.0  0.0      0   16 ??  DL    1Jun16     0:01.30 [xpt_thrd]       100033
    root      7  0.0  0.0      0   16 ??  DL    1Jun16     0:02.04 [pagedaemon]     100034
    root      8  0.0  0.0      0   16 ??  DL    1Jun16     0:00.00 [vmdaemon]       100035
    root      9  0.0  0.0      0   16 ??  DL    1Jun16     0:00.04 [pagezero]       100036
    root     10  0.0  0.0      0   16 ??  DL    1Jun16     0:00.00 [audit]          100001
    root     12  0.0  0.0      0  192 ??  WL    1Jun16     0:00.00 [intr/swi3: vm]  100004
    root     12  0.0  0.0      0  192 ??  WL    1Jun16     0:00.01 [intr/swi1: neti 100005
    root     12  0.0  0.0      0  192 ??  WL    1Jun16   160:10.68 [intr/swi4: cloc 100006
    root     12  0.0  0.0      0  192 ??  WL    1Jun16     0:00.00 [intr/swi6: task 100014
    root     12  0.0  0.0      0  192 ??  WL    1Jun16     0:16.61 [intr/swi2: camb 100020
    root     12  0.0  0.0      0  192 ??  WL    1Jun16     0:00.00 [intr/swi5: fast 100021
    root     12  0.0  0.0      0  192 ??  WL    1Jun16     0:00.04 [intr/swi6: Gian 100023
    root     12  0.0  0.0      0  192 ??  WL    1Jun16     0:00.00 [intr/irq14: ata 100024
    root     12  0.0  0.0      0  192 ??  WL    1Jun16     0:39.96 [intr/irq15: ata 100025
    root     12  0.0  0.0      0  192 ??  WL    1Jun16     0:17.51 [intr/irq17: mpt 100026
    root     12  0.0  0.0      0  192 ??  WL    1Jun16     0:00.22 [intr/irq1: atkb 100029
    root     12  0.0  0.0      0  192 ??  WL    1Jun16     0:00.00 [intr/swi0: uart 100030
    root     13  0.0  0.0      0   48 ??  DL    1Jun16     0:00.02 [geom/g_event]   100007
    root     13  0.0  0.0      0   48 ??  DL    1Jun16     0:14.13 [geom/g_up]      100008
    root     13  0.0  0.0      0   48 ??  DL    1Jun16     0:48.12 [geom/g_down]    100009
    root     14  0.0  0.0      0   16 ??  DL    1Jun16     0:58.29 [yarrow]         100013
    root     15  0.0  0.0      0   16 ??  DL    1Jun16     0:01.25 [idlepoll]       100037
    root     16  0.0  0.0      0   16 ??  DL    1Jun16     0:05.68 [bufdaemon]      100038
    root     17  0.0  0.0      0   16 ??  DL    1Jun16     0:08.01 [vnlru]          100039
    root     18  0.0  0.0      0   16 ??  DL    1Jun16    11:42.39 [syncer]         100040
    root     19  0.0  0.0      0   16 ??  DL    1Jun16     0:10.25 [softdepflush]   100041
    root    190  0.0  0.1   9972 1040 ??  Is    1Jun16     0:00.00 adjkerntz -i     100052
    root    258  0.0  0.1  12088 1260 ??  Is    1Jun16     0:00.42 dhclient: em0 [p 100044
    _dhcp   296  0.0  0.1  12088 1320 ??  Is    1Jun16     0:00.60 dhclient: em0 (d 100061
    root    447  0.0  0.4  10376 4012 ??  Is    1Jun16     0:00.68 /sbin/devd       100055
    root    751  0.0  0.1  14172 1456 ??  Ss    1Jun16     0:01.38 /usr/sbin/rpcbin 100043
    root    783  0.0  0.2  12088 1876 ??  Is    1Jun16     0:00.01 /usr/sbin/mountd 100054
    root    799  0.0  0.1   9968 1188 ??  Is    1Jun16     0:00.00 nfsuserd: master 100049
    root    801  0.0  0.1   9968 1192 ??  S     1Jun16     0:01.24 nfsuserd: slave  100060
    root    802  0.0  0.1   9968 1192 ??  S     1Jun16     0:00.53 nfsuserd: slave  100056
    root    803  0.0  0.1   9968 1192 ??  S     1Jun16     0:00.50 nfsuserd: slave  100048
    root    804  0.0  0.1   9968 1192 ??  S     1Jun16     0:00.46 nfsuserd: slave  100057
    root    805  0.0  0.1   9964 1524 ??  Is    1Jun16     0:00.03 nfsd: master (nf 100046
    root    808  0.0  0.1   9964 1092 ??  S     1Jun16     0:03.71 nfsd: server (nf 100053
    root    808  0.0  0.1   9964 1092 ??  I     1Jun16     0:00.00 nfsd: server (nf 100062
    root    808  0.0  0.1   9964 1092 ??  I     1Jun16     0:00.00 nfsd: server (nf 100063
    root    808  0.0  0.1   9964 1092 ??  I     1Jun16     0:01.89 nfsd: server (nf 100064
    root    825  0.0  0.1 274196 1296 ??  Ss    1Jun16     0:00.87 /usr/sbin/rpc.st 100067
    root    827  0.0  0.1  14176 1248 ??  Ss    1Jun16     0:01.46 /usr/sbin/rpc.lo 100069
    root    837  0.0  0.3  53272 2940 ??  S     1Jun16    11:48.48 /usr/local/bin/v 100070
    root    987  0.0  0.2  49272 2552 ??  Is    1Jun16     0:00.02 /usr/sbin/sshd   100050
    root    996  0.0  0.2  20352 2064 ??  Ss    1Jun16     0:20.07 sendmail: accept 100066
    smmsp   999  0.0  0.2  20352 2032 ??  Is    1Jun16     0:00.33 sendmail: Queue  100074
    root   8317  0.0  0.6  74532 5876 ??  Is   11:46AM     0:00.04 sshd: sg [priv]  100058
    sg     8320  0.0  0.6  74532 5888 ??  S    11:46AM     0:00.71 sshd: sg@pts/1 ( 100110
    root  63544  0.0  0.5  74532 5316 ??  Is   Tue01PM     0:02.14 sshd: sg [priv]  100094
    sg    63547  0.0  0.5  74532 5436 ??  I    Tue01PM     0:09.39 sshd: sg@pts/0 ( 100098
    root  82014  0.0  0.1  12092 1276 ??  Ss    3Jun16     0:03.95 /usr/sbin/syslog 100078
    root  83642  0.0  0.1  14188 1384 ??  Ss    3Jun16     0:04.45 /usr/sbin/cron - 100047
    root   1069  0.0  0.1  12092 1100 u0  Is+   1Jun16     0:00.00 /usr/libexec/get 100076
    root   1070  0.0  0.1  12092 1100 v0  Is+   1Jun16     0:00.00 /usr/libexec/get 100077
    root   1075  0.0  0.1  45344 1480 v1  Is    1Jun16     0:00.02 login [pam] (log 100082
    root   2964  0.0  0.3  14516 2872 v1  I+    1Jun16     0:00.23 -csh (csh)       100095
    root   1063  0.0  0.1  45344 1480 v2  Is    1Jun16     0:00.01 login [pam] (log 100071
    root   8153  0.0  0.2  14516 2188 v2  I+    1Jun16     0:00.03 -csh (csh)       100090
    root   1064  0.0  0.1  12092 1100 v3  Is+   1Jun16     0:00.00 /usr/libexec/get 100051
    root   1065  0.0  0.1  12092 1100 v4  Is+   1Jun16     0:00.00 /usr/libexec/get 100042
    root   1066  0.0  0.1  12092 1100 v5  Is+   1Jun16     0:00.00 /usr/libexec/get 100073
    root   1067  0.0  0.1  12092 1100 v6  Is+   1Jun16     0:00.00 /usr/libexec/get 100072
    root   1068  0.0  0.1  12092 1100 v7  Is+   1Jun16     0:00.00 /usr/libexec/get 100075
    sg    63548  0.0  0.3  14516 2580  0  Is   Tue01PM     0:00.01 -tcsh (tcsh)     100102
    root  63551  0.0  0.2  45340 1972  0  I    Tue01PM     0:00.01 su -m            100096
    root  63552  0.0  0.3  14516 3104  0  I+   Tue01PM     0:00.35 _su -m (tcsh)    100106
    sg     8321  0.0  0.3  14516 2800  1  Is   11:46AM     0:00.01 -tcsh (tcsh)     100099
    root   8324  0.0  0.2  45340 2172  1  I    11:46AM     0:00.00 su -m            100109
    root   8325  0.0  0.3  14516 3572  1  S    11:46AM     0:00.15 _su -m (tcsh)    100091
    root   8515  0.0  0.2  14464 2156  1  T    11:47AM     0:00.04 emacs work/net-s 100086
    root  26982  0.0  0.2  16300 1776  1  R+   12:17PM     0:00.00 ps -auxH -o lwp  100080
    

    I added a log in swrun_kinfo.c:

            if ( -1 == proc_table[i].SWRUN_K_PID )
                continue;
    #ifdef SWRUN_K_TID
            if ( 0 == proc_table[i].SWRUN_K_PPID )
                proc_table[i].SWRUN_K_PID = proc_table[i].SWRUN_K_TID;
    #endif
    
    >>>        fprintf(stderr,"got ID %u\n",proc_table[i].SWRUN_K_PID);
    
            entry = netsnmp_swrun_entry_create(proc_table[i].SWRUN_K_PID);
    

    When running a 'snmpwalk -v 2c -c public 10.0.231.14 HOST-RESOURCES-MIB::hrSWRunTable'

    I got these duplicate lines:

    got ID 827
    got ID 825
    got ID 808
    got ID 808
    got ID 808
    got ID 808
    got ID 805
    got ID 804

    -> 808 is nfsd (multi-threads)

     
  • Niels Baggesen

    Niels Baggesen - 2016-06-20

    OK, I see the problem. Yes, you are right, that replacing KERN_PROC_ALL with KERN_PROC_PROC would fix it.
    On the other hand, I would like to keep the kernel threads visible, but maybe that is not possible with one sweep of the proc table.
    I will have to think a bit more abiut this.

     
  • Markus Wennrich

    Markus Wennrich - 2017-05-04

    Thank you for the patch.
    We have heavy threaded java applications and had to restart snmpd every couple of weeks because of this leak.
    The patch works perfectly for us :-)

     
  • Markus Wennrich

    Markus Wennrich - 2017-05-04

    Actually, INSERT_CONTAINER returns -1 if entry is a duplicate.

    So the correct solutions is:

    --- swrun_kinfo.c.orig 2017-05-04 20:37:09.650803000 +0200
    +++ swrun_kinfo.c 2017-05-04 20:38:46.640120000 +0200
    @@ -176,7 +176,7 @@
    netsnmp_arch_swrun_container_load( netsnmp_container container, u_int flags)
    {
    struct SWRUN_TABLE
    proc_table;
    - int nprocs, i, rc;
    + int nprocs, i;
    char buf[BUFSIZ+1], *argv;
    netsnmp_swrun_entry
    entry;

    @@ -203,7 +203,12 @@
    entry = netsnmp_swrun_entry_create(proc_table[i].SWRUN_K_PID);
    if (NULL == entry)
    continue; / error already logged by function /
    - rc = CONTAINER_INSERT(container, entry);
    + if ( -1 == CONTAINER_INSERT(container, entry)) {
    + // entry didn't get inserted (duplicate key)
    + free(entry);
    + continue;
    + }
    +

         /*
          * There are two possible sources for the command being run:
    

    That way kernel threads get inserted, normal processes get inserted (once), and memory doesn't leak.
    Will upload patch file.

     

    Last edit: Markus Wennrich 2017-05-04
  • Bart Van Assche

    Bart Van Assche - 2020-02-01
    • status: pending --> closed
     

Log in to post a comment.