Skip to content

modbus: Race condition on uart_buf_ctr corrupting RAM #91088

Open
@legoabram

Description

@legoabram

Describe the bug

I've encountered a serious race condition in the Modbus serial driver that can corrupt massive swaths of RAM. The offending function is modbus_rtu_rx_adu. Between the checks against cfg->uart_buf_ctr being greater then or equal to 4 bytes, uart_buf_ctr can be assigned a value of 3 or less. Then when the rx_adu length is calculated, the value overflows and a length of 65535 can be set. This is then directly passed to memcpy which will faithfully destroy the RAM.

Unfortunately, the best I can offer for logs is my dissection of the core dump I took from a failed device. I can reopen this anytime, so feel free to ask for more info. Additionally, the runaway memcpy destroyed ALL of the thread control blocks, so I've had to reverse engineer the stacks from the top down.

The actual crash of our device comes from the stm32 gpio interrupt handler loading an invalid function pointer it's corrupted table. This interrupt is triggered at 16kHz.

There are two stacks of interest: An application thread in the middle of calling modbus_read_holding_regs, and the work queue thread. My best guess for what happened is that the server_work item was triggered or scheduled to run, which then executed once the application thread waited on a semaphore after initializing a transmission. The work item just happened to start processing before any bytes were submitted, (when uart_buf_ctr was still >= 4) but then dropped below during a UART TX interrupt in the 10 instructions between when uart_buf_ctr is loaded once for the check, and once for the math.

The immediately obvious solution would be to only load the value once into a register or stack variable and work on that, but the bigger problem is that it was treating the buffer as RX data in the middle of a TX. As such, I'm not sure how to proceed.

Here is the top of the application stack, which I've broken up into my best guess for function calls where I can, and forgive me for the data dump.

Application Stack:

# More stack
...
0x2401894c <115>: 0x2400bdb0 z_idle_threads in section bss
0x24018950 <116>: 0x00000000 No symbol matches $curr_value.
0x24018954 <117>: 0x2400bbd8 ats_strThread + 24 in section bss
0x24018958 <118>: 0x00000000 No symbol matches $curr_value.
0x2401895c <119>: 0x0815b331 elapsed + 17 in section text
0x24018960 <120>: 0x000000fa No symbol matches $curr_value.
0x24018964 <121>: 0x0815b371 z_add_timeout + 57 in section text
0x24018968 <122>: 0x2400bbc0 ats_strThread in section bss
0x2401896c <123>: 0x00000000 No symbol matches $curr_value.
0x24018970 <124>: 0x000000fa No symbol matches $curr_value.
0x24018974 <125>: 0x240189e0 ats_strThreadStack + 608 in section noinit
0x24018978 <126>: 0x00000003 No symbol matches $curr_value.
0x2401897c <127>: 0x24018a50 ats_strThreadStack + 720 in section noinit
0x24018980 <128>: 0x00000000 No symbol matches $curr_value.
0x24018984 <129>: 0xe000ed00 No symbol matches $curr_value.
0x24018988 <130>: 0x24015344 _kernel in section bss
0x2401898c <131>: 0x00000000 No symbol matches $curr_value.
0x24018990 <132>: 0x00000000 No symbol matches $curr_value.
0x24018994 <133>: 0x0815adb7 z_pend_curr + 47 in section text
0x24018998 <134>: 0x0812f268 arch_swap + 32 in section text
0x2401899c <135>: 0x61000000 No symbol matches $curr_value.

# Lost track of what the stack was doing here.

z_pend_curr
    #  push    {r4, lr}
0x240189a0 <136>: 0x000000fa No symbol matches $curr_value.                 r4
0x240189a4 <137>: 0x0815a741 z_impl_k_sem_take + 61 in section text         lr
z_pend_curr

z_impl_k_sem_take
    # sub     sp, #8
0x240189a8 <138>: 0x000000fa No symbol matches $curr_value.
0x240189ac <139>: 0x00000000 No symbol matches $curr_value.
    # push    {r4, lr}
0x240189b0 <140>: 0x24000a88 mb_ctx_tbl in section datas                    r4
0x240189b4 <141>: 0x081375d7 modbus_tx_wait_rx_adu + 95 in section text     lr

modbus_tx_wait_rx_adu
    # sub     sp, #32
0x240189b8 <142>: 0x2400bbc0 ats_strThread in section bss
0x240189bc <143>: 0x0816bf99 add_to_waitq_locked + 11 in section text
0x240189c0 <144>: 0x00000000 No symbol matches $curr_value.
0x240189c4 <145>: 0xe000ed00 No symbol matches $curr_value.
0x240189c8 <146>: 0x24015344 _kernel in section bss
0x240189cc <147>: 0x00000000 No symbol matches $curr_value.
0x240189d0 <148>: 0x4000ae84 No symbol matches $curr_value.
0x240189d4 <149>: 0x0815adb7 z_pend_curr + 47 in section text
    # push    {r4, lr}
0x240189d8 <150>: 0x24000a88 mb_ctx_tbl in section datas                    r4
0x240189dc <151>: 0x08139725 mbc_send_cmd + 29 in section text              lr

mbc_send_cmd
    # sub     sp, #36 ; 0x24
0x240189e0 <152>: 0xffffffff No symbol matches $curr_value.
0x240189e4 <153>: 0x0815a741 z_impl_k_sem_take + 61 in section text
0x240189e8 <154>: 0x2400bfa0 slice_timeouts in section bss
0x240189ec <155>: 0x2400bdb0 z_idle_threads in section bss
0x240189f0 <156>: 0x00000000 No symbol matches $curr_value.
0x240189f4 <157>: 0x0815aa71 z_reset_time_slice + 25 in section text
0x240189f8 <158>: 0x00000000 No symbol matches $curr_value.
0x240189fc <159>: 0x00000000 No symbol matches $curr_value.
0x24018a00 <160>: 0x83c06d74 No symbol matches $curr_value.
    # stmdb   sp!, {r4, r5, r6, r7, r8, r9, lr}
0x24018a04 <161>: 0x24000a88 mb_ctx_tbl in section datas                    r4
0x24018a08 <162>: 0x00000095 No symbol matches $curr_value.                 r5
0x24018a0c <163>: 0x00000001 No symbol matches $curr_value.                 r6
0x24018a10 <164>: 0x24000aa4 mb_ctx_tbl + 28 in section datas               r7
0x24018a14 <165>: 0x24018a50 ats_strThreadStack + 720 in section noinit     r8
0x24018a18 <166>: 0x00000003 No symbol matches $curr_value.                 r9
0x24018a1c <167>: 0x0816613d modbus_read_holding_regs + 83 in section text  lr

modbus_read_holding_regs
    # stmdb   sp!, {r3, r4, r5, r6, r7, r8, r9, lr}
0x24018a20 <168>: 0x24018a50 ats_strThreadStack + 720 in section noinit     r3
0x24018a24 <169>: 0xfffffff8 No symbol matches $curr_value.                 r4
0x24018a28 <170>: 0x7c3f928a No symbol matches $curr_value.                 r5
0x24018a2c <171>: 0xc0600af2 No symbol matches $curr_value.                 r6
0x24018a30 <172>: 0x24018a48 ats_strThreadStack + 712 in section noinit     r7
0x24018a34 <173>: 0x00000000 No symbol matches $curr_value.                 r8
0x24018a38 <174>: 0x00000000 No symbol matches $curr_value.                 r9
0x24018a3c <175>: 0x081562f7 ats_ThreadEntry + 487 in section text          lr

ats_ThreadEntry
    # sub     sp, #92 ; 0x5c
0x24018a40 <176>: 0x00000003 No symbol matches $curr_value.
0x24018a44 <177>: 0x00000000 No symbol matches $curr_value.
0x24018a48 <178>: 0xaaaaaaaa No symbol matches $curr_value.
0x24018a4c <179>: 0xaaaaaaaa No symbol matches $curr_value.
0x24018a50 <180>: 0x00000080 No symbol matches $curr_value.
0x24018a54 <181>: 0xaaaa0003 No symbol matches $curr_value.
0x24018a58 <182>: 0x00000100 No symbol matches $curr_value.
0x24018a5c <183>: 0x00000001 No symbol matches $curr_value.
0x24018a60 <184>: 0x0000008d No symbol matches $curr_value.
0x24018a64 <185>: 0x00000000 No symbol matches $curr_value.
0x24018a68 <186>: 0xaaaaaaaa No symbol matches $curr_value.
0x24018a6c <187>: 0xaaaaaaaa No symbol matches $curr_value.
0x24018a70 <188>: 0xaaaaaaaa No symbol matches $curr_value.
0x24018a74 <189>: 0xaaaaaaaa No symbol matches $curr_value.
0x24018a78 <190>: 0x00000002 No symbol matches $curr_value.
0x24018a7c <191>: 0x08174960 No symbol matches $curr_value.
0x24018a80 <192>: 0xaaaaaaaa No symbol matches $curr_value.
0x24018a84 <193>: 0xaaaaaaaa No symbol matches $curr_value.
0x24018a88 <194>: 0xaaaaaaaa No symbol matches $curr_value.
0x24018a8c <195>: 0xaaaaaaaa No symbol matches $curr_value.
0x24018a90 <196>: 0xaaaaaaaa No symbol matches $curr_value.
0x24018a94 <197>: 0xaaaaaaaa No symbol matches $curr_value.
0x24018a98 <198>: 0xaaaaaaaa No symbol matches $curr_value.

    # stmdb   sp!, {r4, r5, r6, r7, r8, r9, lr}
0x24018a9c <199>: 0x08156111 ats_ThreadEntry + 1 in section text    r4
0x24018aa0 <200>: 0x00000000 No symbol matches $curr_value.         r5
0x24018aa4 <201>: 0x00000000 No symbol matches $curr_value.         r6
0x24018aa8 <202>: 0x00000000 No symbol matches $curr_value.         r7
0x24018aac <203>: 0x00000000 No symbol matches $curr_value.         r8
0x24018ab0 <204>: 0x00000000 No symbol matches $curr_value.         r9
0x24018ab4 <205>: 0x08161347 z_thread_entry + 13 in section text    lr

z_thread_entry
0x24018ab8 <206>: 0x00000000 No symbol matches $curr_value.
0x24018abc <207>: 0xaaaaaaaa No symbol matches $curr_value.

System Work Queue Stack:

# More Stack
...
0x2401b518 <198>: 0x000013ff No symbol matches $curr_value.
0x2401b51c <199>: 0x00000000 No symbol matches $curr_value.
0x2401b520 <200>: 0x24000ae8 mb_ctx_tbl + 96 in section datas
0x2401b524 <201>: 0x0813793b modbus_rtu_rx_adu + 67 in section text
    # modbus_rtu_rx_adu + 67 matches the LR for 0x8137936 <modbus_rtu_rx_adu+62>:    bl      0x8120f6c <memcpy>
0x2401b528 <202>: 0x08120ffe memcpy + 146 in section text
    # I'm uncertain, but I think this is the PC that was currently executing in this thread when the external interrupt fired.
0x2401b52c <203>: 0x21000000 No symbol matches $curr_value.
0x2401b530 <204>: 0x0815aac8 z_reschedule + 32 in section text
0x2401b534 <205>: 0x00000708 No symbol matches $curr_value.
0x2401b538 <206>: 0x5cc1fcfd No symbol matches $curr_value.
0x2401b53c <207>: 0x00830000 No symbol matches $curr_value.
0x2401b540 <208>: 0x2401b5a8 sys_work_q_stack + 936 in section noinit
0x2401b544 <209>: 0x0816cac4 __device_dts_ord_63 in section device_area
0x2401b548 <210>: 0x00000000 No symbol matches $curr_value.
0x2401b54c <211>: 0x00002710 No symbol matches $curr_value.
0x2401b550 <212>: 0x2401b5c0 sys_work_q_stack + 960 in section noinit
0x2401b554 <213>: 0x0813c925 can_mcan_send + 1 in section text
0x2401b558 <214>: 0x00000000 No symbol matches $curr_value.
0x2401b55c <215>: 0x00000000 No symbol matches $curr_value.
0x2401b560 <216>: 0x00000000 No symbol matches $curr_value.
0x2401b564 <217>: 0x0813b9d7 z_impl_can_send + 39 in section text
0x2401b568 <218>: 0x081307b9 receive_can_tx + 1 in section text
0x2401b56c <219>: 0x24001b88 app_ICC_CANISOTP + 40 in section bss
0x2401b570 <220>: 0x00000000 No symbol matches $curr_value.
0x2401b574 <221>: 0x00000000 No symbol matches $curr_value.

modbus_rtu_rx_adu
    # sub     sp, #56 ; 0x38
0x2401b578 <222>: 0x00000000 No symbol matches $curr_value.
0x2401b57c <223>: 0x081664dd net_buf_alloc_len + 185 in section text
0x2401b580 <224>: 0x2401c260 _net_buf_isotp_rx_sf_ff_pool + 56 in section noinit
0x2401b584 <225>: 0x08166415 fixed_data_alloc + 39 in section text
0x2401b588 <226>: 0x2401c260 _net_buf_isotp_rx_sf_ff_pool + 56 in section noinit
0x2401b58c <227>: 0x2401b5bc sys_work_q_stack + 956 in section noinit
0x2401b590 <228>: 0x00000000 No symbol matches $curr_value.
0x2401b594 <229>: 0x08166399 data_alloc + 37 in section text
0x2401b598 <230>: 0x00000000 No symbol matches $curr_value.
0x2401b59c <231>: 0xe000ed00 No symbol matches $curr_value.
0x2401b5a0 <232>: 0x24015344 _kernel in section bss
0x2401b5a4 <233>: 0x00000000 No symbol matches $curr_value.
0x2401b5a8 <234>: 0x24001a08 isotp_rx_sf_ff_pool in section net_buf_pool_area
0x2401b5ac <235>: 0x0815adb7 z_pend_curr + 47 in section text
    # push    {r4, r5, r6, lr}
0x2401b5b0 <236>: 0x24000038 modbus_serial_cfg in section datas                 r4
0x2401b5b4 <237>: 0x24000a88 mb_ctx_tbl in section datas                        r5
0x2401b5b8 <238>: 0x00000000 No symbol matches $curr_value.                     r6
0x2401b5bc <239>: 0x08137b63 modbus_serial_rx_adu + 35 in section text          lr

modbus_serial_rx_adu
0x2401b5c0 <240>: 0x00000000 No symbol matches $curr_value.
0x2401b5c4 <241>: 0x00000000 No symbol matches $curr_value.
0x2401b5c8 <242>: 0x00000000 No symbol matches $curr_value.
0x2401b5cc <243>: 0x00000000 No symbol matches $curr_value.
0x2401b5d0 <244>: 0x00000000 No symbol matches $curr_value.
0x2401b5d4 <245>: 0x08165d59 modbus_serial_rx_off + 11 in section text
0x2401b5d8 <246>: 0x24000ad0 mb_ctx_tbl + 72 in section datas
0x2401b5dc <247>: 0x08165e7f modbus_serial_rx_disable + 7 in section text
0x2401b5e0 <248>: 0x24000ad0 mb_ctx_tbl + 72 in section datas                   r4
0x2401b5e4 <249>: 0x081374f3 modbus_rx_handler + 55 in section text             lr

modbus_rx_handler
0x2401b5e8 <250>: 0x00000000 No symbol matches $curr_value.
0x2401b5ec <251>: 0x00000000 No symbol matches $curr_value.
0x2401b5f0 <252>: 0xffffffff No symbol matches $curr_value.
0x2401b5f4 <253>: 0xffffffff No symbol matches $curr_value.
0x2401b5f8 <254>: 0x00000000 No symbol matches $curr_value.
0x2401b5fc <255>: 0x0815adb1 z_pend_curr + 41 in section text
0x2401b600 <256>: 0x00000000 No symbol matches $curr_value.
0x2401b604 <257>: 0x0815b1c7 z_sched_wait + 19 in section text
0x2401b608 <258>: 0x24000ad0 mb_ctx_tbl + 72 in section datas           r4
0x2401b60c <259>: 0x2400bfc0 k_sys_work_q in section bss                r5
0x2401b610 <260>: 0x00000000 No symbol matches $curr_value.             r6
0x2401b614 <261>: 0x0815a913 work_queue_main + 139 in section text      lr

work_queue_main
0x2401b618 <262>: 0xffffffff No symbol matches $curr_value.
0x2401b61c <263>: 0xffffffff No symbol matches $curr_value.
0x2401b620 <264>: 0x00000000 No symbol matches $curr_value.
0x2401b624 <265>: 0x2400bfc0 k_sys_work_q in section bss
0x2401b628 <266>: 0x0815a889 work_queue_main + 1 in section text
0x2401b62c <267>: 0x00000000 No symbol matches $curr_value.
0x2401b630 <268>: 0x00000000 No symbol matches $curr_value.
0x2401b634 <269>: 0x08161347 z_thread_entry + 13 in section text

z_thread_entry
0x2401b638 <270>: 0x00000000 No symbol matches $curr_value.
0x2401b63c <271>: 0xaaaaaaaa No symbol matches $curr_value.

Then a couple of GDB outputs to show what happened. These, thankfully, weren't corrupted.

(gdb) p mb_ctx_tbl[0].iface_lock
$22 = {wait_q = {waitq = {{head = 0x24000aa4 <mb_ctx_tbl+28>, next = 0x24000aa4 <mb_ctx_tbl+28>}, {
        tail = 0x24000aa4 <mb_ctx_tbl+28>, prev = 0x24000aa4 <mb_ctx_tbl+28>}}}, owner = 0x2400bbc0 <ats_strThread>,
  lock_count = 1, owner_orig_prio = 0}
(gdb) p *mb_ctx_tbl[0].cfg
$24 = {dev = 0x816c9fc <__device_dts_ord_156>, rtu_timeout = 2005,
  uart_buf_ptr = 0x24000091 <modbus_serial_cfg+89> "ç\003`ª", de = 0x24000d08 <de_gpios_cfg_0>, re = 0x0,
  rtu_timer = {timeout = {node = {{head = 0x0, next = 0x0}, {tail = 0x0, prev = 0x0}},
      fn = 0x815b541 <z_timer_expiration_handler>, dticks = 0}, wait_q = {waitq = {{
          head = 0x24000068 <modbus_serial_cfg+48>, next = 0x24000068 <modbus_serial_cfg+48>}, {
          tail = 0x24000068 <modbus_serial_cfg+48>, prev = 0x24000068 <modbus_serial_cfg+48>}}},
    expiry_fn = 0x8137a2d <rtu_tmr_handler>, stop_fn = 0x0, period = {ticks = 0}, status = 1,
    user_data = 0x24000a88 <mb_ctx_tbl>}, uart_buf_ctr = 3,
  uart_buf = "\001ê\000\000\003\025ç\003`ª", '\000' <repeats 244 times>}
(gdb) p mb_ctx_tbl[0].rx_adu
$26 = {trans_id = 0, proto_id = 0, length = 65535, unit_id = 1 '\001', fc = 234 'ê',
  data = "\000\000\003\025ç\003`ª", '\000' <repeats 242 times>, crc = 0}

Regression

  • This is a regression.

Steps to reproduce

Forgive me, but I have no way to reliably replicate this, and in fact I can barely get it to happen with any useable frequency.

Relevant log output

Impact

Major – Severely degrades functionality; workaround is difficult or unavailable.

Environment

  • OS: Windows 11
  • Zephyr 3.5.0-rc3 (bbf46b9)
  • Zephyr SDK 0.17.0
  • STM32H753VI

Additional Context

The target heavily uses DMA and has a 16kHz interrupt from an external pin to retrieve data over SPI from 2 ADCs. This likely heavily affects scheduling and such.

Metadata

Metadata

Assignees

Labels

area: modbusbugThe issue is a bug, or the PR is fixing a bugpriority: lowLow impact/importance bug

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions