Skip to content

GPU训练35次后出现overflow #441

@haexin

Description

@haexin

/home/hwx1410890/.local/lib/python3.10/site-packages/mindspore/lib/libmindspore_ops.so: undefined symbol: ZN9mindspore20CheckAndConvertUtils16CheckTypeIdsSameERKSsRKSt6vectorINS_6TypeIdESaIS4_EES2

mindyolo0.5.0+python3.11+mindspore2.5.0
2025-04-10 16:49:32,284 [INFO] Epoch 35/300, Step 4/4, imgsize (640, 640), loss: 0.2778, lbox: 0.1035, lobj: 0.0723, lcls: 0.1021, cur_lr: 0.008844999596476555
2025-04-10 16:49:32,287 [INFO] Epoch 35/300, Step 4/4, step time: 23330.36 ms
2025-04-10 16:49:32,977 [INFO] Saving model to ./runs/2025.04.10-15.40.05/weights/yolov5s-35_4.ckpt
2025-04-10 16:49:32,978 [INFO] Epoch 35/300, epoch time: 1.57 min.
2025-04-10 16:51:08,503 [INFO] Epoch 36/300, Step 4/4, imgsize (640, 640), loss: 0.2732, lbox: 0.1003, lobj: 0.0710, lcls: 0.1019, cur_lr: 0.008812000043690205
2025-04-10 16:51:08,507 [INFO] Epoch 36/300, Step 4/4, step time: 23882.45 ms
2025-04-10 16:51:09,268 [INFO] Saving model to ./runs/2025.04.10-15.40.05/weights/yolov5s-36_4.ckpt
2025-04-10 16:51:09,268 [INFO] Epoch 36/300, epoch time: 1.60 min.
2025-04-10 16:52:44,639 [INFO] Epoch 37/300, Step 4/4, imgsize (640, 640), loss: 0.2813, lbox: 0.1001, lobj: 0.0793, lcls: 0.1019, cur_lr: 0.00877899955958128
2025-04-10 16:52:44,643 [INFO] Epoch 37/300, Step 4/4, step time: 23843.84 ms
2025-04-10 16:52:45,312 [INFO] Saving model to ./runs/2025.04.10-15.40.05/weights/yolov5s-37_4.ckpt
2025-04-10 16:52:45,312 [INFO] Epoch 37/300, epoch time: 1.60 min.
2025-04-10 16:54:20,197 [WARNING] overflow, still update, loss scale adjust to 1024.0
2025-04-10 16:54:20,202 [INFO] Epoch 38/300, Step 4/4, imgsize (640, 640), loss: 0.2788, lbox: 0.1005, lobj: 0.0768, lcls: 0.1014, cur_lr: 0.00874600000679493
2025-04-10 16:54:20,205 [INFO] Epoch 38/300, Step 4/4, step time: 23723.17 ms
2025-04-10 16:54:20,973 [INFO] Saving model to ./runs/2025.04.10-15.40.05/weights/yolov5s-38_4.ckpt
2025-04-10 16:54:20,974 [INFO] Epoch 38/300, epoch time: 1.59 min.
2025-04-10 16:55:39,614 [WARNING] overflow, still update, loss scale adjust to 1024.0
2025-04-10 16:55:39,620 [INFO] Epoch 39/300, Step 4/4, imgsize (640, 640), loss: nan, lbox: nan, lobj: nan, lcls: nan, cur_lr: 0.00871300045400858
2025-04-10 16:55:39,624 [INFO] Epoch 39/300, Step 4/4, step time: 19662.64 ms
2025-04-10 16:55:40,264 [INFO] Saving model to ./runs/2025.04.10-15.40.05/weights/yolov5s-39_4.ckpt
2025-04-10 16:55:40,264 [INFO] Epoch 39/300, epoch time: 1.32 min.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions