linux-kernel-fault-codes

Linux 内核 6.6.0 故障错误码与系统影响 — 完整参考手册

范围:rasdaemon 关心的所有 trace event + 错误码定义 — MCE / AER / EDAC / CXL / extlog / non-standard / memory-failure / devlink / block / signal
基线:Linux kernel 6.6.0(/work/work/git_code/linux
性质:内核侧错误码字典 + 严重性 + 系统影响 + 来源溯源(file:line


0. 总览:rasdaemon 关心的 14 个 trace event 在内核的来源

Trace event头文件 / 源文件触发点
mce_recordinclude/trace/events/mce.harch/x86/kernel/cpu/mce/core.c
mc_eventinclude/ras/ras_event.h:98-163drivers/edac/edac_mc.c:929 edac_raw_mc_handle_error()
aer_eventinclude/ras/ras_event.h:269-339drivers/pci/pcie/aer.c
cxl_aer_uncorrectable_errordrivers/cxl/core/trace.h:51drivers/cxl/core/pci.c
cxl_aer_correctable_errordrivers/cxl/core/trace.h:99drivers/cxl/core/pci.c
cxl_overflowdrivers/cxl/core/trace.h:127drivers/cxl/core/mbox.c
cxl_generic_eventdrivers/cxl/core/trace.h:225drivers/cxl/core/mbox.c:863 cxl_event_trace_record()
cxl_general_mediadrivers/cxl/core/trace.h:315同上
cxl_dramdrivers/cxl/core/trace.h:398同上
cxl_memory_moduledrivers/cxl/core/trace.h:547同上
cxl_poisondrivers/cxl/core/trace.h:643drivers/cxl/core/mbox.c
arm_eventinclude/ras/ras_event.h:171-208drivers/ras/ras.c:24 log_arm_hw_error()ghes.c:507
extlog_mem_eventinclude/ras/ras_event.h:27-77drivers/acpi/acpi_extlog.c:178
non_standard_eventinclude/ras/ras_event.h:219-253drivers/ras/ras.c:17ghes.c:676
memory_failure_eventinclude/ras/ras_event.h:399-423mm/memory-failure.c:1323
signal_generateinclude/trace/events/signal.h:50-80kernel/signal.c
devlink_health_reportinclude/trace/events/devlink.h:81-107net/devlink/health.c
block_rq_errorinclude/trace/events/block.h:165-170block/blk-core.c

1. MCE — Machine Check Exception(x86)

1.1 mce_record trace event 字段

include/trace/events/mce.h:12-70 — 对应 struct mce (arch/x86/include/uapi/asm/mce.h:13-39):

字段类型来源 MSR / 寄存器
cpuu32m->extcpu
mcgcapu64MCGCAP
mcgstatusu64MCG_STATUS
banku80…63(MCA bank index)
statusu64MCi_STATUS
ipidu64MCA_IPID(SMCA 才有)
addru64MCi_ADDR
miscu64MCi_MISC
syndu64MCA_SYND(SMCA 才有)
csu8code segment
ipu64instruction pointer
tscu64RDTSC
cpuvendoru32X86_VENDOR_*(INTEL=0, AMD=2, HYGON=9, ZHAOXIN=11…)
cpuidu32CPUID 1.EAX(family/model/stepping)
walltimeu64m->time(秒)
socketidu32CPU socket id
apicidu32initial APIC ID

1.2 MCG_STATUS 位(全局机器检查状态)

arch/x86/include/asm/mce.h:23-27

名称含义
0MCG_STATUS_RIPVRestart IP valid — 可在 m->ip 处恢复执行
1MCG_STATUS_EIPVError IP valid — m->ip 指向出错指令
2MCG_STATUS_MCIPMachine Check In Progress — 软件处理完后必须清零
3MCG_STATUS_LMCESLMCE Signaled — MCE 作为 Local MCE 投递

MCG_CAP 特性位mce.h:12-21):

名称含义
7:0MCG_BANKCNT_MASKMCA bank 数量
8MCG_CTL_PMCG_CTL 存在
9MCG_EXT_P扩展寄存器(MCG_EIP 等)
10MCG_CMCI_PCMCI 支持
23:16MCG_EXT_CNT_MASK扩展寄存器计数
24MCG_SER_PSoftware Error Recovery 支持(UCNA/SRAR)
26MCG_ELOG_PExtended error log 支持
27MCG_LMCE_PLocal MCE 支持

1.3 MCI_STATUS 位(每 bank MCA 状态)

arch/x86/include/asm/mce.h:32-52

名称含义严重性
63MCI_STATUS_VAL寄存器有效(必先检查)
62MCI_STATUS_OVER之前的错误被丢弃(溢出)
61MCI_STATUS_UCUncorrected errorUncorrected
60MCI_STATUS_ENError enabled(已配置为检测)
59MCI_STATUS_MISCVMCi_MISC 有效
58MCI_STATUS_ADDRVMCi_ADDR 有效
57MCI_STATUS_PCCProcessor Context Corrupt — 内核无法继续致命
56MCI_STATUS_SSignaled — MCE 已通过异常/CMCI 投递
55MCI_STATUS_ARAction Required — 软件必须动作(如下线页)Recoverable w/ action
52:38MCI_STATUS_CEC_MASKCorrected Error Count(已纠正错误数)
31:16MSCODModel-Specific Error Code
15:0MCACODMCA Error Code(bit 12 过滤位忽略)

AMD 特有位mce.h:48-52):

名称含义
55MCI_STATUS_TCCTask Context Corrupt(MCA 扩展)
53MCI_STATUS_SYNDVMCi_SYND 有效
44MCI_STATUS_DEFERRED未纠正但延迟(无异常;通过 APIC 记录)
43MCI_STATUS_POISON访问到已被下毒的数据(数据 poisoning / SUCCOR)
40MCI_STATUS_SCRUB巡检 scrub 期间发现

MCACOD 宏mce.h:73-80):

名称含义
MCACOD0xefffMCACOD 字段掩码
MCACOD_SCRUB0x00C0Memory scrubbing error(bits 0xC0…0xCF)
MCACOD_SCRUBMSK0xeff0Scrub 掩码(跳过 bit 12)
MCACOD_L3WB0x017AL3 explicit writeback 错误
MCACOD_DATA0x0134Data Load(SRAR)
MCACOD_INSTR0x0150Instruction Fetch(SRAR)

1.4 MCI_MISC 位

mce.h:82-92

含义
MCI_MISC_ADDR_LSB(m)m & 0x3fAddress LSB(亚页位掩码)
MCI_MISC_ADDR_MODE(m)(m>>6) & 7Address mode
MCI_MISC_ADDR_SEGOFF0Segment offset
MCI_MISC_ADDR_LINEAR1Linear
MCI_MISC_ADDR_PHYS2Physical(mce_usable_address 使用)
MCI_MISC_ADDR_MEM3Memory
MCI_MISC_ADDR_GENERIC7Generic
MCI_ADDR_PHYSADDRGENMASK_ULL(phys_bits-1, 0)物理地址掩码

MCI_CTL2mce.h:94-96):

名称含义
30MCI_CTL2_CMCI_ENCMCI 中断使能
14:0MCI_CTL2_CMCI_THRESHOLD_MASKCMCI 阈值

1.5 严重性等级 enum severity_level

arch/x86/kernel/cpu/mce/internal.h:11-21

常量含义
0MCE_NO_SEVERITY无效 / 不启用(!VAL!EN)— 丢弃
1MCE_DEFERRED_SEVERITY (=MCE_UCNA_SEVERITY)延迟错误或 UCNA — 保留
2MCE_KEEP_SEVERITYCE / 非致命 UC — 保留
3MCE_SOME_SEVERITY动作可选但未知
4MCE_AO_SEVERITYAction Optional(scrub / L3WB)
5MCE_UC_SEVERITYUncorrected 无 AR
6MCE_AR_SEVERITYAction Required(页必须离线)
7MCE_PANIC_SEVERITY致命 — 触发 panic

1.6 Intel 严重性表(severity.c:38-218

核心规则(按顺序,首匹配胜):

严重性触发条件消息
NO!MCI_STATUS_VAL“Invalid”
NO!MCI_STATUS_EN(异常上下文)“Not enabled”
PANICMCI_STATUS_PCC set“Processor context corrupt”
PANICMCG_STATUS_MCIP clear“MCIP not set in MCA handler”
PANICRIPV & EIPV 都清零“Neither restart nor error IP”
PANICRIPV clear in KERNEL/KERNEL_RECOV“In kernel and no restart IP”
KEEPUC clear“Corrected error”
AO`MCI_UC_ARMCACOD_SCRUBMSK==UC
AO`MCI_UC_ARMCACOD==UC
AOSKX step≥4, banks 13-18, `ADDR0x001000c0`
UCNAMCI_UC_SAR == UC(S, AR clear)“Uncorrected no action required”
PANICOVER+UC+AR+!S“Illegal combination (UCNA with AR=1)”
KEEPS clear in SER mode“Non signaled machine check”
PANICOVER+UC+S+AR“Action required with lost events”
KEEPUC+S+AR+ADDRV, RIPV“Action required but unaffected thread is continuable”
ARUC+S+AR+ADDRV+MCACOD_DATA, KERNEL_RECOV“Data load in error recoverable area of kernel”
ARUC+S+AR+ADDRV+MCACOD_DATA, USER“Data load error in a user process”
ARUC+S+AR+ADDRV+MCACOD_INSTR, USER“Instruction fetch error in a user process”
PANICUC+S+AR+ADDRV+MCACOD_DATA, KERNEL“Data load in unrecoverable area of kernel”
PANICUC+S+AR+ADDRV+MCACOD_INSTR, KERNEL“Instruction fetch error in kernel”
PANICUC+S+AR, no MCACOD“Action required: unknown MCACOD”
SOMEUC+S, no AR“Action optional: unknown MCACOD”
SOMEOVER+UC+S“Action optional with lost events”
PANICOVER+UC“Overflowed uncorrected”
PANICUC, KERNEL“Uncorrected in kernel”
UCUC, no AR“Uncorrected”
SOMEcatch-all“No match”

复合宏(severity.c:67-70):

  • MCI_UC_S = MCI_STATUS_UC | MCI_STATUS_S
  • MCI_UC_AR = MCI_STATUS_UC | MCI_STATUS_AR
  • MCI_UC_SAR = MCI_STATUS_UC | MCI_STATUS_S | MCI_STATUS_AR
  • MCI_ADDR = MCI_STATUS_ADDRV | MCI_STATUS_MISCV

1.7 AMD 严重性(severity_amd()severity.c:310-368

检查严重性消息
MCI_STATUS_PCCMCE_PANIC_SEVERITY“Processor Context Corrupt”
MCI_STATUS_DEFERREDMCE_DEFERRED_SEVERITY
!MCI_STATUS_UCMCE_KEEP_SEVERITY“Corrected or deferred-by-UC=0”
OVER && !overflow_recovMCE_PANIC_SEVERITY“Overflowed uncorrected without MCA Overflow Recovery”
!succorMCE_PANIC_SEVERITY“Uncorrected without MCA Recovery”
error_context == IN_KERNELMCE_PANIC_SEVERITY“Uncorrected unrecoverable in kernel context”
默认MCE_AR_SEVERITY

error_context()severity.c:274-307):

  • IN_USER — cs & 3 == 3
  • IN_KERNEL — 无 RIPV
  • IN_KERNEL_RECOV — uaccess/copy with extable fixup,置 MCE_IN_KERNEL_RECOV / MCE_IN_KERNEL_COPYIN

1.8 公共函数

函数位置行为
mce_usable_address()core.c:462-482UC 无 DEFERRED
mce_is_correctable()core.c:525-537满足 ADDRV && MISCV && LSB≤PAGE_SHIFT && ADDR_MODE==PHYS
mce_is_memory_error()core.c:485-514AMD:SMCA_UMC/UMC_V2 且 XEC=0,或 K8 bank 4 XEC=0x8。Intel:`(status & 0xef80)==BIT(7)

1.9 AMD SMCA 26 个 bank type

arch/x86/include/asm/mce.h:299-334enum smca_bank_types):

索引名称简名
0SMCA_LSload_store
1SMCA_LS_V2load_store
2SMCA_IFinsn_fetch
3SMCA_L2_CACHEl2_cache
4SMCA_DEdecode_unit
5SMCA_RESERVEDreserved
6SMCA_EXexecution_unit
7SMCA_FPfloating_point
8SMCA_L3_CACHEl3_cache
9SMCA_CScoherent_slave
10SMCA_CS_V2coherent_slave
11SMCA_PIEpie
12SMCA_UMCumc
13SMCA_UMC_V2umc_v2
14SMCA_PBparam_block
15SMCA_PSPpsp
16SMCA_PSP_V2psp
17SMCA_SMUsmu
18SMCA_SMU_V2smu
19SMCA_MP5mp5
20SMCA_MPDMAmpdma
21SMCA_NBIOnbio
22SMCA_PCIEpcie
23SMCA_PCIE_V2pcie
24SMCA_XGMI_PCSxgmi_pcs
25SMCA_NBIFnbif
26SMCA_SHUBshub
27SMCA_SATAsata
28SMCA_USBusb
29SMCA_GMI_PCSgmi_pcs
30SMCA_XGMI_PHYxgmi_phy
31SMCA_WAFL_PHYwafl_phy
32SMCA_GMI_PHYgmi_phy
N_SMCA_BANK_TYPES33(sentinel)

HWID/MCATYPE 映射amd.c:160-219):

BankHWIDMCATYPE
SMCA_LS0xB00x0
SMCA_LS_V20xB00x10
SMCA_IF0xB00x1
SMCA_L2_CACHE0xB00x2
SMCA_DE0xB00x3
SMCA_EX0xB00x5
SMCA_FP0xB00x6
SMCA_CS0x2E0x0
SMCA_PIE0x2E0x1
SMCA_CS_V20x2E0x2
SMCA_UMC0x960x0
SMCA_UMC_V20x960x1
SMCA_PB0x050x0
SMCA_PSP0xFF0x0
SMCA_PSP_V20xFF0x1
SMCA_SMU0x010x0
SMCA_SMU_V20x010x1
SMCA_MP50x010x2
SMCA_MPDMA0x010x3
SMCA_NBIO0x180x0
SMCA_PCIE0x460x0
SMCA_PCIE_V20x460x1
SMCA_XGMI_PCS0x500x0
SMCA_NBIF0x6C0x0
SMCA_SHUB0x800x0
SMCA_SATA0xA80x0
SMCA_USB0xAA0x0
SMCA_GMI_PCS0x2410x0
SMCA_XGMI_PHY0x2590x0
SMCA_WAFL_PHY0x2670x0
SMCA_GMI_PHY0x2690x0

宏:HWID_MCATYPE(hwid, mcatype) = ((hwid) << 16) | (mcatype)amd.c:74

  • MCI_IPID_MCATYPE = 0xFFFF0000
  • MCI_IPID_HWID = 0xFFF
  • MCI_CONFIG_MCAX = 0x1(指示 MCA 扩展)

1.10 AMD 阈值 / 延迟错误位

amd.c:32-56

常量含义
THRESHOLD_MAX0xFFF12-bit 最大计数
NR_BLOCKS5每 bank 最多 MISC block
INT_TYPE_APIC0x00020000中断类型 = APIC LVT
MASK_VALID_HI0x80000000MCi_MISC.high block 有效
MASK_CNTP_HI0x40000000计数器存在
MASK_LOCKED_HI0x20000000阈值已锁
MASK_LVTOFF_HI0x00F00000LVT offset(bits 23:20)
MASK_COUNT_EN_HI0x00080000计数使能
MASK_INT_TYPE_HI0x00060000中断类型字段
MASK_OVERFLOW_HI0x00010000溢出位
MASK_ERR_COUNT_HI0x00000FFF错误计数(bits 11:0)
MASK_BLKPTR_LO0xFF000000low word block 指针
MCG_XBLK_ADDR0xC0000400Extended block 基址
MSR_CU_DEF_ERR0xC0000410延迟错误配置 MSR
MASK_DEF_LVTOFF0x000000F0延迟 LVT offset
MASK_DEF_INT_TYPE0x00000006延迟中断类型
DEF_LVT_OFF0x2默认 LVT offset
DEF_INT_TYPE_APIC0x2默认 APIC
SMCA_THR_LVT_OFF0xF000SMCA 阈值 LVT offset

K8/Family 0x15 bank 4 子块amd.c:357-374):

  • 0x00000413 → “dram”
  • 0xc0000408 → “ht_links”
  • 0xc0000409 → “l3_cache”

1.11 AMD SMCA MSR 地址

arch/x86/include/asm/mce.h:113-132

MSR
MSR_AMD64_SMCA_MC0_CTL0xc0002000
MSR_AMD64_SMCA_MC0_STATUS0xc0002001
MSR_AMD64_SMCA_MC0_ADDR0xc0002002
MSR_AMD64_SMCA_MC0_MISC00xc0002003
MSR_AMD64_SMCA_MC0_CONFIG0xc0002004
MSR_AMD64_SMCA_MC0_IPID0xc0002005
MSR_AMD64_SMCA_MC0_SYND0xc0002006
MSR_AMD64_SMCA_MC0_DESTAT0xc0002008(延迟错误状态)
MSR_AMD64_SMCA_MC0_DEADDR0xc0002009(延迟错误地址)
MSR_AMD64_SMCA_MC0_MISC10xc000200a
MSR_AMD64_SMCA_MCx_*基址 + 0x10 * x

MCA_CONFIG 用到的位amd.c:274-334):

  • bit 32 (high & 0x1) = McaX — 启用 SMCA 寄存器布局
  • bit 37 (low & 0x20) = DeferredIntTypeSupported
  • bits 38:37 = DeferredIntType(0x1 = APIC)
  • bit 40 (low & 0x100) = McaLsbInStatusSupported

1.12 CMCI(Corrected Machine Check Interrupt)

intel.c:66-79 + mce.h:30, 95-96

常量含义
CMCI_THRESHOLD1默认阈值
CMCI_POLL_INTERVAL30*HZ正常轮询间隔
CMCI_STORM_INTERVALHZ风暴间隔
CMCI_STORM_THRESHOLD15风暴阈值(每秒事件数)

CMCI 风暴状态(intel.c:75-79):

名称含义
0CMCI_STORM_NONE无风暴
1CMCI_STORM_ACTIVE风暴中;切到轮询
2CMCI_STORM_SUBSIDED风暴结束;即将恢复中断模式

MCG_CMCI_P = BIT_ULL(10) — CMCI 支持位

1.13 mce-inject 操作类型

inject.c:47-61enum injection_type):

常量触发
0SW_INJmce_log(&i_mce) — 解码 only,安全
1HW_INJint $18#MC 异常 — 真实 MCE,PCC 时 panic
2DFR_INT_INJint $DEFERRED_ERROR_VECTOR — AMD 延迟错误 APIC
3THR_INT_INJint $THRESHOLD_APIC_VECTOR — AMD 阈值 APIC
4N_INJ_TYPESsentinel

注入上下文标志(mce.h:98-105):

  • MCJ_CTX_MASK = 3, MCJ_CTX(flags) = (flags) & 3
  • MCJ_CTX_RANDOM = 0(默认)
  • MCJ_CTX_PROCESS = 1(进程上下文)
  • MCJ_CTX_IRQ = 2(IRQ 上下文)
  • MCJ_NMI_BROADCAST = 0x4
  • MCJ_EXCEPTION = 0x8
  • MCJ_IRQ_BROADCAST = 0x10

Debugfs 文件(inject.c:696-710):statusmiscaddrsyndipidbankflagscpuREADME

1.14 mce.kflags(kernel-only)

mce.h:137-158

名称含义
0MCE_HANDLED_CEC由 CEC 处理
1MCE_HANDLED_UC由 UC handler 处理
2MCE_HANDLED_EXTLOG由 Extended Log 处理
3MCE_HANDLED_NFIT由 NFIT 处理
4MCE_HANDLED_EDAC由 EDAC 处理
5MCE_HANDLED_MCELOG/dev/mcelog 处理
6MCE_IN_KERNEL_RECOV内核可恢复
7MCE_IN_KERNEL_COPYINcopy_from_user 期间

1.15 mce_notifier_prios

mce.h:176-186(notifier chain 优先级)

优先级常量用途
0MCE_PRIO_LOWEST默认 print-only
1MCE_PRIO_MCELOG/dev/mcelog 旧设备
2MCE_PRIO_EDACEDAC 子系统
3MCE_PRIO_NFITNVDIMM Firmware Interface
4MCE_PRIO_EXTLOGExtended error log
5MCE_PRIO_UCUC handler(页离线)
6MCE_PRIO_EARLY最早消费方(emit trace_mce_record)
7MCE_PRIO_CEC = MCE_PRIO_HIGHESTCorrected error collector

1.16 mcp_flags(轮询标志)

mce.h:256-262

标志含义
MCP_TIMESTAMP读 bank 时盖 TSC
MCP_UC记录 UC 错误
MCP_DONTLOG清但不记录
MCP_QUEUE_LOG仅入 genpool(boot 时用)

1.17 Vendor flags / quirks

internal.h:139-179

名称含义
0overflow_recovMCA overflow recovery(F15h 00h-0fh 总是有)
1succorAMD S/W UnCorrectable COntainment & Recovery
2smcaAMD Scalable MCA
3zen_ifu_quirkZen IFU poison-consumption EIPV/RIPV bug
4amd_thresholdAMD-style 阈值 bank
5p5Pentium family-5
6winchipCentaur Winchip C6
7snb_ifu_quirkSandy Bridge IFU EIPV/RIPV bug
8skx_repmov_quirkSkylake/Cascade/Cooper Lake REP MOVS bug

1.18 CPU vendor 枚举

arch/x86/include/asm/processor.hX86_VENDOR_INTEL=0, X86_VENDOR_AMD=2, X86_VENDOR_HYGON=9, X86_VENDOR_CENTAUR=5/8, X86_VENDOR_ZHAOXIN=11, X86_VENDOR_UNKNOWN

1.19 Kernel cmdline MCE 选项

core.c:2238-2290

选项效果
mce启用 Pentium P5 MCE
mce=off禁用全部 MCE
mce=no_cmci禁用 CMCI
mce=no_lmce禁用 LMCE
mce=dont_log_ce不记录 CE
mce=print_all全部 MCE 打到控制台
mce=ignore_ce禁用 CE 轮询和 CMCI
mce=bootlog / nobootlog启动前日志切换
mce=bios_cmci_threshold不编程 CMCI 阈值
mce=recovery强制启用 copy_mc_fragile()
mce=TOLERANCELEVELmonarch 超时(微秒)
nomcemce=off 的别名

1.20 其他常量

名称位置
MAX_NR_BANKS64mce.h:227
MCE_LOG_MIN_LEN32mce.h:107
MCE_LOG_SIGNATURE“MACHINECHECK”mce.h:108
MCE_POOLSZ2*PAGE_SIZE (8KiB)genpool.c:22
MCE_OVERFLOW0mce.h:107
NBCFG0x44inject.c:45

/dev/mcelog ioctls(uapi/asm/mce.h:41-43):

  • MCE_GET_RECORD_LEN = _IOR('M', 1, int)
  • MCE_GET_LOG_LEN = _IOR('M', 2, int)
  • MCE_GETCLEAR_FLAGS = _IOR('M', 3, int)

CPUHP 状态(core.c:2767-2778):

  • CPUHP_X86_MCE_DEAD = “x86/mce:dead” → mce_cpu_dead()
  • CPUHP_AP_ONLINE_DYN = “x86/mce:online” → mce_cpu_online()

2. PCIe AER

2.1 aer_event trace event 字段

include/ras/ras_event.h:269-339

TP_PROTO(const char *dev_name,
         const u32 status,
         const u8 severity,
         const u8 tlp_header_valid,
         struct aer_header_log_regs *tlp)
字段类型描述
dev_namestring设备 slot 名([domain:]bus:device.function
statusu32Correctable 或 Uncorrectable 寄存器值
severityu8AER_* 枚举值
tlp_header_validu8是否捕获到 TLP header
tlp_headeru32[4]TLP header log 4 个 DWORD

2.2 AER 严重性枚举

include/linux/aer.h:14-17

名称含义
0AER_NONFATALUncorrected, non-fatal
1AER_FATALUncorrected, fatal
2AER_CORRECTABLECorrected
3DPC_FATALFatal via Downstream Port Containment(驱动内部)

2.3 Correctable Error Status(PCI_ERR_COR_STATUS = 0x10)

include/uapi/linux/pci_regs.h:748-802

掩码名称含义
00x00000001PCI_ERR_COR_RCVRReceiver Error(PHY)
60x00000040PCI_ERR_COR_BAD_TLPBad TLP
70x00000080PCI_ERR_COR_BAD_DLLPBad DLLP
80x00000100PCI_ERR_COR_REP_ROLLREPLAY_NUM Rollover
120x00001000PCI_ERR_COR_REP_TIMERReplay Timer Timeout
130x00002000PCI_ERR_COR_ADV_NFATAdvisory Non-Fatal
140x00004000PCI_ERR_COR_INTERNALCorrected Internal
150x00008000PCI_ERR_COR_LOG_OVERHeader Log Overflow

AER_MAX_TYPEOF_COR_ERRS = 16

2.4 Uncorrectable Error Status(PCI_ERR_UNCOR_STATUS = 0x04)

掩码名称含义
00x00000001PCI_ERR_UNC_UNDUndefined
40x00000010PCI_ERR_UNC_DLPData Link Protocol Error
50x00000020PCI_ERR_UNC_SURPDNSurprise Down
120x00001000PCI_ERR_UNC_POISON_TLPPoisoned TLP
130x00002000PCI_ERR_UNC_FCPFlow Control Protocol
140x00004000PCI_ERR_UNC_COMP_TIMECompletion Timeout
150x00008000PCI_ERR_UNC_COMP_ABORTCompleter Abort
160x00010000PCI_ERR_UNC_UNX_COMPUnexpected Completion
170x00020000PCI_ERR_UNC_RX_OVERReceiver Overflow
180x00040000PCI_ERR_UNC_MALF_TLPMalformed TLP
190x00080000PCI_ERR_UNC_ECRCECRC Error
200x00100000PCI_ERR_UNC_UNSUPUnsupported Request
210x00200000PCI_ERR_UNC_ACSVACS Violation
220x00400000PCI_ERR_UNC_INTNUncorrectable Internal
230x00800000PCI_ERR_UNC_MCBTLPMC Blocked TLP
240x01000000PCI_ERR_UNC_ATOMEGAtomicOp Egress Blocked
250x02000000PCI_ERR_UNC_TLPPRETLP Prefix Blocked

AER_MAX_TYPEOF_UNCOR_ERRS = 27

2.5 AER Capability 寄存器布局

偏移寄存器
0x04PCI_ERR_UNCOR_STATUS
0x08PCI_ERR_UNCOR_MASK
0x0cPCI_ERR_UNCOR_SEVER(bit set = Fatal)
0x10PCI_ERR_COR_STATUS
0x14PCI_ERR_COR_MASK
0x18PCI_ERR_CAP
0x1cPCI_ERR_HEADER_LOG(16 字节)
0x2cPCI_ERR_ROOT_COMMAND
0x30PCI_ERR_ROOT_STATUS
0x34PCI_ERR_ROOT_ERR_SRC

PCI_ERR_CAP 字段

  • 0x1f PCI_ERR_CAP_FEP(x) — First Error Pointer(5-bit)
  • 0x20 PCI_ERR_CAP_ECRC_GENC — ECRC Generation Capable
  • 0x40 PCI_ERR_CAP_ECRC_GENE — ECRC Generation Enable
  • 0x80 PCI_ERR_CAP_ECRC_CHKC — ECRC Check Capable
  • 0x100 PCI_ERR_CAP_ECRC_CHKE — ECRC Check Enable

PCI_ERR_ROOT_COMMAND

  • 0x01 PCI_ERR_ROOT_CMD_COR_EN
  • 0x02 PCI_ERR_ROOT_CMD_NONFATAL_EN
  • 0x04 PCI_ERR_ROOT_CMD_FATAL_EN

PCI_ERR_ROOT_STATUS

  • 0x01 PCI_ERR_ROOT_COR_RCV
  • 0x02 PCI_ERR_ROOT_MULTI_COR_RCV
  • 0x04 PCI_ERR_ROOT_UNCOR_RCV
  • 0x08 PCI_ERR_ROOT_MULTI_UNCOR_RCV
  • 0x10 PCI_ERR_ROOT_FIRST_FATAL
  • 0x20 PCI_ERR_ROOT_NONFATAL_RCV
  • 0x40 PCI_ERR_ROOT_FATAL_RCV
  • 0xf8000000 PCI_ERR_ROOT_AER_IRQ

AER_ERR_STATUS_MASK = PCI_ERR_ROOT_UNCOR_RCV | PCI_ERR_ROOT_COR_RCV | PCI_ERR_ROOT_MULTI_COR_RCV | PCI_ERR_ROOT_MULTI_UNCOR_RCVaer.c:104-107

PCI_ERR_ROOT_ERR_SRC 解码aer.c:101-102):

  • ERR_COR_ID(d) = d & 0xffff
  • ERR_UNCOR_ID(d) = d >> 16

2.6 AER recovery

状态触发
pci_channel_io_normalNon-Fatal → pcie_do_recovery()
pci_channel_io_frozenFatal → pcie_do_recovery()

aer_recover_ring = 16 entries(AER_RECOVER_RING_SIZE=16aer.c:972),由 aer_recover_work_func 处理。aer_root_reset() 选择 FLR(PCI_EXP_TYPE_RC_EC)或 pci_bus_error_reset()(RP / Downstream)。

2.7 Agent / Layer 分类

aer.c:399-432

Agents(4):AER_AGENT_RECEIVER(0), _REQUESTER(1), _COMPLETER(2), _TRANSMITTER(3)
Layers(3):AER_PHYSICAL_LAYER_ERROR(0), _DATA_LINK_LAYER_ERROR(1), _TRANSACTION_LAYER_ERROR(2)

宏:AER_GET_AGENT()AER_GET_LAYER_ERROR()

2.8 AER_LOG_TLP_MASKS

aer.c:88-93

#define AER_LOG_TLP_MASKS (PCI_ERR_UNC_POISON_TLP | PCI_ERR_UNC_ECRC | \
                           PCI_ERR_UNC_UNSUP | PCI_ERR_UNC_COMP_ABORT | \
                           PCI_ERR_UNC_UNX_COMP | PCI_ERR_UNC_MALF_TLP)

2.9 AER_INJECT

drivers/pci/pcie/aer_inject.c

struct aer_error_inj {
    u8 bus, dev, fn;
    u32 uncor_status, cor_status;
    u32 header_log0..3;
    u32 domain;
};

/dev/aer_inject 字符设备(aer_inject.c:512-516),拦截 pci_ops 写 root port AER 寄存器,触发 root IRQ。

2.10 CPER AER section

include/linux/cper.h:246-253, 509-542

struct cper_sec_pcie {
    u64 validation_bits;   // CPER_PCIE_VALID_AER_INFO = 0x80
    u8 aer_info[96];       // uncor_status, uncor_mask, uncor_severity,
                           // cor_status, cor_mask, cap_control,
                           // header_log[4], root_command, root_status
    // ... other fields
};

cper_severity_to_aer() 映射(aer.c:750-761):

  • CPER_SEV_RECOVERABLEAER_NONFATAL
  • CPER_SEV_FATALAER_FATAL
  • 其他 → AER_CORRECTABLE

3. EDAC / Memory Controller

3.1 enum hw_event_mc_err_type

include/linux/edac.h:113-119

名称含义系统影响
0HW_EVENT_ERR_CORRECTEDECC 在硬件中纠正单/多位已纠正;可能触发 page scrub
1HW_EVENT_ERR_UNCORRECTED未被 ECC 纠正但不致命页被下毒;可能杀进程;数据不可用
2HW_EVENT_ERR_DEFERRED数据下毒,非紧急处理预防性下线页
3HW_EVENT_ERR_FATAL未恢复的不可纠正错误panic / machine check
4HW_EVENT_ERR_INFO信息性(按 CPER 规范)诊断用

3.2 enum dev_type(DRAM 设备宽度)

include/linux/edac.h:72-81

名称含义
0DEV_UNKNOWN未知
1DEV_X11-bit 数据宽度
2DEV_X22-bit
3DEV_X44-bit(典型)
4DEV_X88-bit(典型)
5DEV_X1616-bit
6DEV_X3232-bit
7DEV_X6464-bit

DEV_FLAG_X1DEV_FLAG_X64 派生位标志(edac.h:83-90)。x4/x8 通常用 EDAC_S4ECD4ED / EDAC_S8ECD8ED 启用 chipkill。

3.3 enum edac_mc_layer_type(层级)

include/linux/edac.h:346-352

名称含义
0EDAC_MC_LAYER_BRANCHFB-DIMM branch
1EDAC_MC_LAYER_CHANNEL内存 channel
2EDAC_MC_LAYER_SLOTDIMM slot
3EDAC_MC_LAYER_CHIP_SELECTRank(csrow)
4EDAC_MC_LAYER_ALL_MEM全部内存(firmware-driven)

显示名(edac_mc.c:794-800):“branch” / “channel” / “slot” / “csrow” / “memory”

3.4 enum mem_type(DRAM 类型,参考)

include/linux/edac.h:191-221

名称类型
0MEM_EMPTY
1MEM_RESERVED
2MEM_UNKNOWN
3MEM_FPM~1995
4MEM_EDO~1998
5MEM_BEDOBurst EDO
6MEM_SDRSDR SDRAM
7MEM_RDRRegistered SDR
8MEM_DDR
9MEM_RDDR
10MEM_RMBSRambus
11MEM_DDR2
12MEM_FB_DDR2Fully-Buffered
13MEM_RDDR2
14MEM_XDRRambus XDR
15MEM_DDR3
16MEM_RDDR3
17MEM_LRDDR3
18MEM_LPDDR3
19MEM_DDR4
20MEM_RDDR4
21MEM_LRDDR4
22MEM_LPDDR4
23MEM_DDR5
24MEM_RDDR5
25MEM_LRDDR5
26MEM_NVDIMM
27MEM_WIO2
28MEM_HBM2

3.5 enum edac_type(ECC 能力)

include/linux/edac.h:265-276

名称含义
0EDAC_UNKNOWN
1EDAC_NONE
2EDAC_RESERVED
3EDAC_PARITY
4EDAC_EC仅检测
5EDAC_SECDED单错纠正 + 双错检测
6EDAC_S2ECD2ED
7EDAC_S4ECD4EDchipkill x4
8EDAC_S8ECD8EDchipkill x8
9EDAC_S16ECD16EDchipkill x16

3.6 enum scrub_type(scrub 模式)

include/linux/edac.h:301-312SCRUB_UNKNOWN, SCRUB_NONE, SCRUB_SW_PROG, SCRUB_SW_SRC, SCRUB_SW_PROG_SRC, SCRUB_SW_TUNABLE, SCRUB_HW_PROG, SCRUB_HW_SRC, SCRUB_HW_PROG_SRC, SCRUB_HW_TUNABLE

3.7 EDAC_OPSTATE 状态

include/linux/edac.h:26-29, 326-330

名称含义
-1EDAC_OPSTATE_INVAL无效/未初始化
0EDAC_OPSTATE_POLL轮询
1EDAC_OPSTATE_NMINMI
2EDAC_OPSTATE_INT中断

mci->op_state 内部(edac.h:603):0x100 OP_ALLOC, 0x201 OP_RUNNING_POLL, 0x202 OP_RUNNING_INTERRUPT, 0x203 OP_RUNNING_POLL_INTR, 0x300 OP_OFFLINE

3.8 各平台驱动常量

平台文件关键常量
Skylakedrivers/edac/skx_common.cNUM_CHANNELS, NUM_DIMMSMTR[12:13]=rank, MTR[2:4]=row+12, MTR[0:1]=col+10;标签 "CPU_SrcID#%u_MC#%u_Chan#%u_DIMM#%u"
Icelake/SPRdrivers/edac/i10nm_base.cI10NM_REVISION "v0.0.6", I10NM_HBM_IMC_MMIO_SIZE=0x9000, I10NM_IS_HBM_PRESENT(reg)=bits[27:30], I10NM_IS_HBM_IMC(reg)=bit 29
SB/IVB/HSW/BDWdrivers/edac/sb_edac.cNUM_CHANNELS=6, MAX_DIMMS=3, KNL_MAX_CHAS=38, KNL_MAX_CHANNELS=6, KNL_MAX_EDCS=8;chipset enum:SANDY_BRIDGE, IVY_BRIDGE, HASWELL, BROADWELL, KNIGHTS_LANDING
Apollo Lake / Denvertondrivers/edac/pnd2_edac.cAPL_NUM_CHANNELS=4, DNV_NUM_CHANNELS=2, DNV_MAX_DIMMS=2;PMI ports apl_dports[4]={0x18,0x10,0x11,0x19}, dnv_dports[2]={0x10,0x12}
E7520/i3100drivers/edac/e752x_edac.cE752X_NR_CSROWS=8;chip enum:E7520=0, E7525=1, E7320=2, I3100=3;PCI device IDs: 0x3590/0x3591/0x359E/0x3593/0x3592/0x35B0/0x35B1;err regs: 0x40-0x84
AMD64drivers/edac/amd64_edac.cDCTs: K8=1, F10h=2, F16h=1, F15h=dynamic;DCT_CFG_SEL F1x10C;DCSB_CS_ENABLEF10_NB_ARRAY_DRAM(注入)
X-Genedrivers/edac/xgene_edac.cMCU_MAX_RANK=8, MCU_RANK_STRIDE=0x40MCUGECR=0x0110MCU_GECR_DEMANDUCINTREN_MASK=BIT(0) 等);MCUGESR=0x0114MCU_GESR_ADDRNOMATCH_ERR_MASK=BIT(7) 等);MCUESRR0=0x0314MCU_ESRR_MULTUCERR_MASK=BIT(3), BACKUCERR_MASK=BIT(2), DEMANDUCERR_MASK=BIT(1), CERR_MASK=BIT(0)

3.9 mc_event trace event 字段

include/ras/ras_event.h:98-163

字段类型含义
error_typeu32HW_EVENT_ERR_* 严重性
msgstring人类可读错误消息
labelstringDIMM 标签
error_countu16错误数
mc_indexu8MC 索引
top_layers8顶层层级索引
middle_layers8中层
lower_layers8底层
addresslong物理地址(page << PAGE_SHIFT | offset)
grain_bitsu8log2(grain)
syndromelongECC 综合征
driver_detailstring驱动额外信息

4. CXL(Compute Express Link)

4.1 8 个 CXL trace events(v6.6.0,无 cxl_memory_sparing

#Event行号
1cxl_aer_uncorrectable_errordrivers/cxl/core/trace.h:51
2cxl_aer_correctable_errordrivers/cxl/core/trace.h:99
3cxl_overflowdrivers/cxl/core/trace.h:127
4cxl_generic_eventdrivers/cxl/core/trace.h:225
5cxl_general_mediadrivers/cxl/core/trace.h:315
6cxl_dramdrivers/cxl/core/trace.h:398
7cxl_memory_moduledrivers/cxl/core/trace.h:547
8cxl_poisondrivers/cxl/core/trace.h:643

4.2 CXL AER Uncorrectable 位

drivers/cxl/core/trace.h:17-31(注意是 CXL_RAS_UC_* 而非 CXL_AER_UE_*

符号含义
0CXL_RAS_UC_CACHE_DATA_PARITYCache data parity
1CXL_RAS_UC_CACHE_ADDR_PARITYCache address parity
2CXL_RAS_UC_CACHE_BE_PARITYCache byte-enable parity
3CXL_RAS_UC_CACHE_DATA_ECCCache data ECC UC
4CXL_RAS_UC_MEM_DATA_PARITYMemory data parity
5CXL_RAS_UC_MEM_ADDR_PARITYMemory address parity
6CXL_RAS_UC_MEM_BE_PARITYMemory byte-enable parity
7CXL_RAS_UC_MEM_DATA_ECCMemory data ECC UC
8CXL_RAS_UC_REINIT_THRESHREINIT 阈值
9CXL_RAS_UC_RSVD_ENCODE收到未识别编码
10CXL_RAS_UC_POISON收到对端 poison
11CXL_RAS_UC_RECV_OVERFLOW接收方溢出
14CXL_RAS_UC_INTERNAL_ERR设备内部错
15CXL_RAS_UC_IDE_TX_ERRIDE 发送错
16CXL_RAS_UC_IDE_RX_ERRIDE 接收错

掩码:cxl.h:130 = GENMASK(16,14) | GENMASK(11,0)show_uc_errs()trace.h:33-49

4.3 CXL AER Correctable 位

trace.h:81-87CXL_RAS_CE_*

符号含义
0CXL_RAS_CE_CACHE_DATA_ECCCache ECC 纠正
1CXL_RAS_CE_MEM_DATA_ECCMemory ECC 纠正
2CXL_RAS_CE_CRC_THRESHCRC 阈值
3CLX_RAS_CE_RETRY_THRESH (内核拼写错:缺 ‘X’)Retry 阈值
4CXL_RAS_CE_CACHE_POISON收到 cache poison
5CXL_RAS_CE_MEM_POISON收到 memory poison
6CXL_RAS_CE_PHYS_LAYER_ERRPHY 错误

掩码:cxl.h:137 = GENMASK(6,0)show_ce_errs()trace.h:89-97

4.4 CXL Event Record Type UUIDs(mbox.c:843-861

UUIDTypeTrace eventCXL Spec
fbcd0a77-c260-417f-85a9-088b1621eba6General Media (GMER)cxl_general_media3.0 §8.2.9.2.1.1
601dcbb3-9c06-4eab-b8af-4e9bfb5c9624DRAM (DER)cxl_dram3.0 §8.2.9.2.1.2
fe927475-dd59-4339-a586-79bab113b774Memory Module (MMER)cxl_memory_module3.0 §8.2.9.2.1.3
其他Genericcxl_generic_event

4.5 memory_event_type (GMER/DER type)

trace.h:278-285

符号含义
0x00CXL_GMER_MEM_EVT_TYPE_ECC_ERRORECC error
0x01CXL_GMER_MEM_EVT_TYPE_INV_ADDRInvalid address
0x02CXL_GMER_MEM_EVT_TYPE_DATA_PATH_ERRORData path error

(注意:v6.6.0 没有 memory_event_sub_type 枚举)

4.6 transaction_type

trace.h:287-302

符号含义
0x00CXL_GMER_TRANS_UNKNOWNUnknown
0x01CXL_GMER_TRANS_HOST_READHost read
0x02CXL_GMER_TRANS_HOST_WRITEHost write
0x03CXL_GMER_TRANS_HOST_SCAN_MEDIAHost scan-media
0x04CXL_GMER_TRANS_HOST_INJECT_POISONHost inject-poison
0x05CXL_GMER_TRANS_INTERNAL_MEDIA_SCRUBInternal media scrub
0x06CXL_GMER_TRANS_INTERNAL_MEDIA_MANAGEMENTInternal media mgmt

4.7 cxl_event_log_type(严重性)

cxlmem.h:620-626cxl.h:161-170

名称状态寄存器位
0x00CXL_EVENT_TYPE_INFOCXLDEV_EVENT_STATUS_INFO = BIT(0)
0x01CXL_EVENT_TYPE_WARNCXLDEV_EVENT_STATUS_WARN = BIT(1)
0x02CXL_EVENT_TYPE_FAILCXLDEV_EVENT_STATUS_FAIL = BIT(2)
0x03CXL_EVENT_TYPE_FATALCXLDEV_EVENT_STATUS_FATAL = BIT(3)
CXLDEV_EVENT_STATUS_ALL0x0F(OR)
0x04CXL_EVENT_TYPE_MAXsentinel

cxl_event_thread() ISR(pci.c:619)读 32-bit 状态,按设置位触发 Get Event Records mailbox(mbox.c:1023,FATAL→INFO 顺序)。

4.8 DPA Flags

trace.h:255-263

符号含义
CXL_DPA_FLAGS_MASK0x3F低 6 位掩码
CXL_DPA_MASK~CXL_DPA_FLAGS_MASK高位
CXL_DPA_VOLATILEBIT(0)易失性
CXL_DPA_NOT_REPAIRABLEBIT(1)不可修复

4.9 Event Descriptor Flags

trace.h:269-276

符号含义
CXL_GMER_EVT_DESC_UNCORECTABLE_EVENTBIT(0)(内核拼写:UNCORECTABLE)
CXL_GMER_EVT_DESC_THRESHOLD_EVENTBIT(1)阈值触发
CXL_GMER_EVT_DESC_POISON_LIST_OVERFLOWBIT(2)poison list 溢出

4.10 Health Status Flags (DHI)

trace.h:487-494

符号含义
CXL_DHI_HS_MAINTENANCE_NEEDEDBIT(0)维护需要
CXL_DHI_HS_PERFORMANCE_DEGRADEDBIT(1)性能降级
CXL_DHI_HS_HW_REPLACEMENT_NEEDEDBIT(2)硬件需更换

(v6.6.0 MEM_CAPACITY_DEGRADED

4.11 Media Status Enum (DHI)

trace.h:496-527

符号含义
0x00CXL_DHI_MS_NORMALNormal
0x01CXL_DHI_MS_NOT_READYNot ready
0x02CXL_DHI_MS_WRITE_PERSISTENCY_LOST写持久性丢失
0x03CXL_DHI_MS_ALL_DATA_LOST全部数据丢失
0x04CXL_DHI_MS_WRITE_PERSISTENCY_LOSS_EVENT_POWER_LOSS掉电时持久性丢失
0x05CXL_DHI_MS_WRITE_PERSISTENCY_LOSS_EVENT_SHUTDOWN关机时持久性丢失
0x06CXL_DHI_MS_WRITE_PERSISTENCY_LOSS_IMMINENT即将丢失
0x07CXL_DHI_MS_WRITE_ALL_DATA_LOSS_EVENT_POWER_LOSS掉电时全数据丢失
0x08CXL_DHI_MS_WRITE_ALL_DATA_LOSS_EVENT_SHUTDOWN关机时全数据丢失
0x09CXL_DHI_MS_WRITE_ALL_DATA_LOSS_IMMINENT即将全数据丢失

4.12 DHI Add-Status 位字段

trace.h:529-545

符号含义
CXL_DHI_AS_NORMAL0x0Normal
CXL_DHI_AS_WARNING0x1Warning
CXL_DHI_AS_CRITICAL0x2Critical
CXL_DHI_AS_LIFE_USED(as)as & 0x3寿命 2-bit
CXL_DHI_AS_DEV_TEMP(as)(as & 0xC) >> 2温度 2-bit
CXL_DHI_AS_COR_VOL_ERR_CNT(as)(as & 0x10) >> 4易失性错误 1-bit
CXL_DHI_AS_COR_PER_ERR_CNT(as)(as & 0x20) >> 5持久性错误 1-bit

4.13 Memory Module Event Types

trace.h:467-480

符号含义
0x00CXL_MMER_HEALTH_STATUS_CHANGEHealth changed
0x01CXL_MMER_MEDIA_STATUS_CHANGEMedia status changed
0x02CXL_MMER_LIFE_USED_CHANGE寿命阈值跨越
0x03CXL_MMER_TEMP_CHANGE温度阈值跨越
0x04CXL_MMER_DATA_PATH_ERRORData path 错误
0x05CXL_MMER_LSA_ERRORLabel Storage Area 错误

Sparing 标志(v6.6.0 不存在,后加):

  • HARD_SPARINGQUERY_RESOURCESDEVICE_INITIATED — 在更新内核

4.14 Poison List Source Enum

cxlmem.h:778-782

符号含义
0CXL_POISON_SOURCE_UNKNOWNUnknown
1CXL_POISON_SOURCE_EXTERNALExternal(peer)
2CXL_POISON_SOURCE_INTERNALInternal
3CXL_POISON_SOURCE_INJECTEDInjected(host)
7CXL_POISON_SOURCE_VENDORVendor(4-6 保留)

存储:64-bit poison 地址低 3 位。cxlmem.h:763-764

  • CXL_POISON_START_MASK = GENMASK_ULL(63, 6)
  • CXL_POISON_SOURCE_MASK = GENMASK(2, 0)

长度单位:CXL_POISON_LEN_MULT = 64 字节。cxlmem.h:773-775 flags:

  • CXL_POISON_FLAG_MORE = BIT(0)
  • CXL_POISON_FLAG_OVERFLOW = BIT(1)
  • CXL_POISON_FLAG_SCANNING = BIT(2)
  • CXL_POISON_LIST_MAX = 1024

4.15 Common Event Record Flags(header)

trace.h:165-174

符号含义
CXL_EVENT_RECORD_FLAG_PERMANENTBIT(2)永久
CXL_EVENT_RECORD_FLAG_MAINT_NEEDEDBIT(3)维护需要
CXL_EVENT_RECORD_FLAG_PERF_DEGRADEDBIT(4)性能降级
CXL_EVENT_RECORD_FLAG_HW_REPLACEBIT(5)硬件更换

4.16 CXLMDEV Memory Device Status

cxlmem.h:11-32

符号含义
CXLMDEV_STATUS_OFFSET0x0
CXLMDEV_DEV_FATALBIT(0)设备致命
CXLMDEV_FW_HALTBIT(1)固件停机
CXLMDEV_STATUS_MEDIA_STATUS_MASKGENMASK(3,2)
CXLMDEV_MS_NOT_READY0
CXLMDEV_MS_READY1
CXLMDEV_MS_ERROR2
CXLMDEV_MS_DISABLED3
CXLMDEV_MBOX_IF_READYBIT(4)
CXLMDEV_RESET_NEEDED_MASKGENMASK(7,5)
CXLMDEV_RESET_NEEDED_NOT0
CXLMDEV_RESET_NEEDED_COLD1
CXLMDEV_RESET_NEEDED_WARM2
CXLMDEV_RESET_NEEDED_HOT3
CXLMDEV_RESET_NEEDED_CXL4

CXLMDEV_DEV_FATALCXLMDEV_FW_HALT 触发 cxl_err()pci.c:77-80)→ dev_err_ratelimited

4.17 Mailbox Return Codes(cxlmem.h:142-174

CMD_CMD_RC_TABLE 30 个 CXL_MBOX_CMD_RC_*

SUCCESS, BACKGROUND, INPUT, UNSUPPORTED, INTERNAL, RETRY, BUSY, MEDIADISABLED, FWINPROGRESS, FWOOO, FWAUTH, FWSLOT, FWROLLBACK, FWRESET, HANDLE, PADDR(-EFAULT), POISONLMT, MEDIAFAILURE, ABORT, SECURITY, PASSPHRASE, MBUNSUPPORTED, PAYLOADLEN, LOG, INTERRUPTED, FEATUREVERSION, FEATURESELVALUE, FEATURETRANSFERIP, FEATURETRANSFEROOO, RESOURCEEXHAUSTED, EXTLIST

4.18 Mailbox Opcodes for Event(cxlmem.h:493-531

Opcode符号规范
0x0100CXL_MBOX_OP_GET_EVENT_RECORD3.0 §8.2.9.2.2
0x0101CXL_MBOX_OP_CLEAR_EVENT_RECORD3.0 §8.2.9.2.3
0x0102CXL_MBOX_OP_GET_EVT_INT_POLICY3.0 §8.2.9.2.4
0x0103CXL_MBOX_OP_SET_EVT_INT_POLICY3.0 §8.2.9.2.4
0x4300CXL_MBOX_OP_GET_POISON3.0 §8.2.9.8.4.1
0x4301CXL_MBOX_OP_INJECT_POISON3.0 §8.2.9.8.4.2
0x4302CXL_MBOX_OP_CLEAR_POISON3.0 §8.2.9.8.4.3
0x4303CXL_MBOX_OP_GET_SCAN_MEDIA_CAPS3.0
0x4304CXL_MBOX_OP_SCAN_MEDIA3.0(UAPI 标记 deprecated)
0x4305CXL_MBOX_OP_GET_SCAN_MEDIA3.0(UAPI 标记 deprecated)
0x4400CXL_MBOX_OP_SANITIZE3.0 §8.2.9.8.5.1
0x4401CXL_MBOX_OP_SECURE_ERASE3.0 §8.2.9.8.5.2
0x4500-0x4505Security commands3.0
0x10000CXL_MBOX_OP_MAXsentinel

4.19 Get Event Payload Flags(cxlmem.h:604-615

符号
CXL_GET_EVENT_FLAG_OVERFLOWBIT(0)
CXL_GET_EVENT_FLAG_MORE_RECORDSBIT(1)
struct cxl_get_event_payload {
    u8 flags;
    u8 reserved1;
    __le16 overflow_err_count;
    __le64 first_overflow_timestamp;
    __le64 last_overflow_timestamp;
    __le16 record_count;
    u8 reserved2[10];
    struct cxl_event_record_raw records[];
};

cxl_overflow 事件在 flags & CXL_GET_EVENT_FLAG_OVERFLOW 时发射。

4.20 Event Interrupt Modes(cxlmem.h:213-217

enum cxl_event_int_mode {
    CXL_INT_NONE     = 0x00,
    CXL_INT_MSI_MSIX = 0x01,
    CXL_INT_FW       = 0x02
};

cxl_event_config_msgnums() (pci.c:677) 硬设 4 个严重性 → CXL_INT_MSI_MSIX

4.21 PMEM Security State Flags(cxlmem.h:819-824

符号含义
CXL_PMEM_SEC_STATE_USER_PASS_SET0x01
CXL_PMEM_SEC_STATE_MASTER_PASS_SET0x02
CXL_PMEM_SEC_STATE_LOCKED0x04
CXL_PMEM_SEC_STATE_FROZEN0x08
CXL_PMEM_SEC_STATE_USER_PLIMIT0x10
CXL_PMEM_SEC_STATE_MASTER_PLIMIT0x20

4.22 DVSEC 标识符(cxlpci.h:15-56

ID符号规范
0x1E98PCI_DVSEC_VENDOR_ID_CXLCXL 2.0 §8.1
0CXL_DVSEC_PCIE_DEVICE2.0 §8.1.3
2CXL_DVSEC_FUNCTION_MAP2.0 §8.1.4
3CXL_DVSEC_PORT_EXTENSIONS2.0 §8.1.5
4CXL_DVSEC_PORT_GPF2.0 §8.1.6
5CXL_DVSEC_DEVICE_GPF2.0 §8.1.7
7CXL_DVSEC_PCIE_FLEXBUS_PORT2.0 §8.1.8
8CXL_DVSEC_REG_LOCATOR2.0 §8.1.9

Device DVSEC 控制位:CXL_DVSEC_MEM_CAPABLE=BIT(2), CXL_DVSEC_HDM_COUNT_MASK=GENMASK(5,4), CXL_DVSEC_MEM_ENABLE=BIT(2), CXL_DVSEC_MEM_INFO_VALID=BIT(0), CXL_DVSEC_MEM_ACTIVE=BIT(1)

4.23 RAS 寄存器偏移和掩码(cxl.h:129-145

偏移寄存器掩码
0x0CXL_RAS_UNCORRECTABLE_STATUS_OFFSET`GENMASK(16,14)
0x4CXL_RAS_UNCORRECTABLE_MASK_OFFSET`GENMASK(16,14)
0x8CXL_RAS_UNCORRECTABLE_SEVERITY_OFFSET
0xCCXL_RAS_CORRECTABLE_STATUS_OFFSETGENMASK(6,0)
0x10CXL_RAS_CORRECTABLE_MASK_OFFSET
0x14CXL_RAS_CAP_CONTROL_OFFSETGENMASK(5,0) (FE pointer)
0x18CXL_RAS_HEADER_LOG_OFFSET
CXL_RAS_CAPABILITY_LENGTH0x58
CXL_HEADERLOG_SIZESZ_512
CXL_HEADERLOG_SIZE_U32SZ_512 / sizeof(u32)

4.24 Misc

名称位置
CXL_EVENT_RECORD_DATA_LENGTH0x50cxlmem.h:594
CXL_EVENT_GEN_MED_COMP_ID_SIZE0x10cxlmem.h:645
CXL_EVENT_DER_CORRECTION_MASK_SIZE0x20cxlmem.h:664
CXL_CLEAR_EVENT_MAX_HANDLESU8_MAXcxlmem.h:639
CXL_FW_TRANSFER_ALIGNMENT128cxlmem.h:323

重要:v6.6.0 内核没有 CXL_1_1 / CXL_2_0 / CXL_3_0 版本宏 — 硬编码 3.0。


5. extlog(Extended Log)

5.1 extlog_mem_event trace event 字段

include/ras/ras_event.h:27-77,触发:drivers/acpi/acpi_extlog.c:178

字段类型含义
err_sequ32MCE extlog 错误序号
etypeu8CPER 内存 error type,或 ~0(无效)
sevu8CPER 严重性
pau64错误物理地址,或 ~0ull(无效)
pa_mask_lsbu8__ffs64(pa_mask),或 ~0
fru_idguid_t (16 字节)FRU GUID
fru_textstring (≤20 字符)FRU 文本
datacper_mem_err_compact压缩的 CPER 内存错误字段

6. CPER(Common Platform Error Record)

6.1 cper_severity 枚举

include/linux/cper.h:38-43

名称含义
0CPER_SEV_RECOVERABLE可恢复
1CPER_SEV_FATAL致命
2CPER_SEV_CORRECTED已纠正
3CPER_SEV_INFORMATIONAL信息性

cper_severity_str() 映射(drivers/firmware/efi/cper.c:58-69):

  • RECOVERABLE → “recoverable”
  • FATAL → “fatal”
  • CORRECTED → “corrected”
  • INFORMATIONAL → “info”
  • 其他 → “unknown”

6.2 CPER 内存 error type 字符串

drivers/firmware/efi/cper.c:189-213mem_err_type_strs[]

Index字符串含义影响
0“unknown”Unknown不定
1“no error”Spurious
2“single-bit ECC”CE 单 bit已纠正
3“multi-bit ECC”UE 多 bit页下毒
4“single-symbol chipkill ECC”单符号 chipkill已纠正
5“multi-symbol chipkill ECC”多符号 chipkill 失败UE
6“master abort”Bus master abort事务中止
7“target abort”Bus target abort事务中止
8“parity error”Bus parity完整性破坏
9“watchdog timeout”Device watchdog事务停滞
10“invalid address”非法地址UE
11“mirror Broken”镜像损坏镜像禁用
12“memory sparing”备用 rank 激活已纠正
13“scrub corrected error”巡检纠错已纠正
14“scrub uncorrected error”巡检发现 UE页下毒
15“physical memory map-out event”DIMM map-out设备移除

6.3 CPER 内存 validation bits

include/linux/cper.h:215-244

名称字段
0x0001CPER_MEM_VALID_ERROR_STATUSerror_status
0x0002CPER_MEM_VALID_PAphysical_addr
0x0004CPER_MEM_VALID_PA_MASKphysical_addr_mask
0x0008CPER_MEM_VALID_NODEnode
0x0010CPER_MEM_VALID_CARDcard
0x0020CPER_MEM_VALID_MODULEmodule
0x0040CPER_MEM_VALID_BANKbank
0x0080CPER_MEM_VALID_DEVICEdevice
0x0100CPER_MEM_VALID_ROWrow
0x0200CPER_MEM_VALID_COLUMNcolumn
0x0400CPER_MEM_VALID_BIT_POSITIONbit_pos
0x0800CPER_MEM_VALID_REQUESTOR_IDrequestor_id
0x1000CPER_MEM_VALID_RESPONDER_IDresponder_id
0x2000CPER_MEM_VALID_TARGET_IDtarget_id
0x4000CPER_MEM_VALID_ERROR_TYPEerror_type
0x8000CPER_MEM_VALID_RANK_NUMBERrank
0x10000CPER_MEM_VALID_CARD_HANDLEmem_array_handle
0x20000CPER_MEM_VALID_MODULE_HANDLEmem_dev_handle
0x40000CPER_MEM_VALID_ROW_EXTExtended row
0x80000CPER_MEM_VALID_BANK_GROUPBank group
0x100000CPER_MEM_VALID_BANK_ADDRESSBank address
0x200000CPER_MEM_VALID_CHIP_IDChip ID

掩码/位移:

  • CPER_MEM_EXT_ROW_MASK = 0x3CPER_MEM_EXT_ROW_SHIFT = 16
  • CPER_MEM_BANK_ADDRESS_MASK = 0xff
  • CPER_MEM_BANK_GROUP_SHIFT = 8
  • CPER_MEM_CHIP_ID_SHIFT = 5

6.4 CPER Section Type GUIDs

include/linux/cper.h:154-200

GUIDSection Type厂商
9876CCAD-47B4-4bdb-B65E-16F193C4F3DBCPER_SEC_PROC_GENERIC通用处理器
DC3EA0B0-A144-4797-B95B-53FA242B6E1DCPER_SEC_PROC_IAIntel x86/x64
E429FAF1-3CB7-11D4-0BCA-070080C73C88CPER_SEC_PROC_IPFItanium
E19E3D16-BC11-11E4-9CAAC2051D5D46B0CPER_SEC_PROC_ARMARM
A5BC1114-6F64-4EDE-B863-3E83ED7C83B1CPER_SEC_PLATFORM_MEMPlatform memory
D995E954-BBC1-430F-AD91-B44DCB3C6F35CPER_SEC_PCIEPCIe
81212A96-09ED-4996-94718D729C8E69EDCPER_SEC_FW_ERR_REC_REFFirmware error record ref
C5753963-3B84-4095-BF78-EDDAD3F9C9DDCPER_SEC_PCI_X_BUSPCI/PCI-X bus
EB5E4685-CA66-4769-B6A2-26068B001326CPER_SEC_PCI_DEVPCI component/device
5B51FEF7-C79D-4434-8F1B-AA62DE3E2C64CPER_SEC_DMAR_GENERICGeneric DMAr
71761D37-32B2-45cd-A7D0-B0FEDD93E8CFCPER_SEC_DMAR_VTIntel VT-d DMAr
036F84E1-7F37-428c-A79E-575FDFAA84ECCPER_SEC_DMAR_IOMMUIOMMU DMAr

6.5 CPER Section Flags

include/linux/cper.h:123-146

标志含义
CPER_SEC_VALID_FRU_ID0x1
CPER_SEC_VALID_FRU_TEXT0x2
CPER_SEC_PRIMARY0x0001直接关联错误
CPER_SEC_CONTAINMENT_WARNING0x0002可能已传播
CPER_SEC_RESET0x0004组件需重新初始化
CPER_SEC_ERROR_THRESHOLD_EXCEEDED0x0008错误阈值
CPER_SEC_RESOURCE_NOT_ACCESSIBLE0x0010查询冲突
CPER_SEC_LATENT_ERROR0x0020错误已包含但未纠正
CPER_SEC_REV0x0100

6.6 CPER Record Flags

include/linux/cper.h:97-101

标志含义
CPER_HW_ERROR_FLAGS_RECOVERED0x1错误已恢复
CPER_HW_ERROR_FLAGS_PREVERR0x2来自上 boot
CPER_HW_ERROR_FLAGS_SIMULATED0x4测试注入

6.7 CPER Processor Validation Bits

include/linux/cper.h:201-213

名称字段
0x0001CPER_PROC_VALID_TYPEproc_type
0x0002CPER_PROC_VALID_ISAproc_isa
0x0004CPER_PROC_VALID_ERROR_TYPEproc_error_type
0x0008CPER_PROC_VALID_OPERATION
0x0010CPER_PROC_VALID_FLAGS
0x0020CPER_PROC_VALID_LEVEL
0x0040CPER_PROC_VALID_VERSION
0x0080CPER_PROC_VALID_BRAND_INFO
0x0100CPER_PROC_VALID_ID
0x0200CPER_PROC_VALID_TARGET_ADDRESS
0x0400CPER_PROC_VALID_REQUESTOR_ID
0x0800CPER_PROC_VALID_RESPONDER_ID
0x1000CPER_PROC_VALID_IP

6.8 CPER PCIe Validation Bits

include/linux/cper.h:246-255

名称字段
0x1CPER_PCIE_VALID_PORT_TYPE
0x2CPER_PCIE_VALID_VERSION
0x4CPER_PCIE_VALID_COMMAND_STATUS
0x8CPER_PCIE_VALID_DEVICE_ID
0x10CPER_PCIE_VALID_SERIAL_NUMBER
0x20CPER_PCIE_VALID_BRIDGE_CONTROL_STATUS
0x40CPER_PCIE_VALID_CAPABILITY
0x80CPER_PCIE_VALID_AER_INFOaer_info[] populated

CPER_PCIE_SLOT_SHIFT = 3

6.9 CPER ARM Error Types

include/linux/cper.h:273-315

名称
0CPER_ARM_CACHE_ERROR
1CPER_ARM_TLB_ERROR
2CPER_ARM_BUS_ERROR
3CPER_ARM_VENDOR_ERROR (= CPER_ARM_MAX_TYPE)

ARM validation bits 和 err_info 字段位移定义在 cper.h:257-315CPER_ARM_ERR_TRANSACTION_SHIFT=16, OPERATION_SHIFT=18 等)。

6.10 CPER MCE creator / section UUIDs

arch/x86/kernel/cpu/mce/apei.c:127-132

名称UUID
CPER_CREATOR_MCE75a574e3-5052-4b29-8a8e-be2c6490b89d
CPER_SECTION_TYPE_MCEfe08ffbe-95e4-4be7-bc73-4096044a38fc

apei_write_mce() (apei.c:144-174) 写 ERST 用 CPER_HW_ERROR_FLAGS_PREVERR | CPER_SEC_PRIMARY | CPER_SEV_FATAL

6.11 CPER Notification Type GUIDs

include/linux/cper.h:60-91

名称含义
CPER_NOTIFY_CMCCorrected Machine Check
CPER_NOTIFY_CPECorrected Platform Error
CPER_NOTIFY_MCEMachine Check Exception
CPER_NOTIFY_PCIEPCI Express Error
CPER_NOTIFY_INITINIT Record(IPF)
CPER_NOTIFY_NMINon-Maskable Interrupt
CPER_NOTIFY_BOOTBOOT Error Record
CPER_NOTIFY_DMARDMA Remapping Error

7. APEI / GHES

7.1 ghes_severity 枚举

include/acpi/ghes.h:51-56

名称含义Mapped to MCE.status
0x0GHES_SEV_NO无 / 未知
0x1GHES_SEV_CORRECTED已纠正MCI_STATUS_UC
0x2GHES_SEV_RECOVERABLE可恢复MCI_STATUS_UCapei.c:53-54
0x3GHES_SEV_PANIC致命 / panic设 `MCI_STATUS_UC

GHES_EXITING = 0x0002ghes.h:16)— ghes->flags 标记系统退出

CPER → GHES 映射ghes.c:291-302):

  • CPER_SEV_INFORMATIONALGHES_SEV_NO
  • CPER_SEV_CORRECTEDGHES_SEV_CORRECTED
  • CPER_SEV_RECOVERABLEGHES_SEV_RECOVERABLE
  • CPER_SEV_FATALGHES_SEV_PANIC

7.2 APEI → MCE 桥

apei_mce_report_mem_error() (apei.c:29-64):

status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | MCI_STATUS_MISCV | 0x9f
misc = (MCI_MISC_ADDR_PHYS << 6) | lsb
bank = -1  // synthetic

SMCA:apei_smca_report_x86_error() (apei.c:66-125) 提取 6 个 SMCA 寄存器(MCA_STATUS, MCA_ADDR, MCA_MISC, MCA_CONFIG, MCA_IPID, MCA_SYND)

7.3 APEI HEST Source Types

include/acpi/actbl1.h:1406-1419

名称含义
0ACPI_HEST_TYPE_IA32_CHECKIA-32 MCE 源
1ACPI_HEST_TYPE_IA32_CORRECTED_CHECKIA-32 CMCI
2ACPI_HEST_TYPE_IA32_NMIIA-32 NMI
6ACPI_HEST_TYPE_AER_ROOT_PORTPCIe AER root port
7ACPI_HEST_TYPE_AER_ENDPOINTPCIe AER endpoint
8ACPI_HEST_TYPE_AER_BRIDGEPCIe AER bridge
9ACPI_HEST_TYPE_GENERIC_ERRORGHES
10ACPI_HEST_TYPE_GENERIC_ERROR_V2GHES v2(read_ack 寄存器)
11ACPI_HEST_TYPE_IA32_DEFERRED_CHECKIA-32 延迟 MCE

7.4 APEI HEST Notification Types

include/acpi/actbl1.h:1491-1505

名称
0ACPI_HEST_NOTIFY_POLLED
1ACPI_HEST_NOTIFY_EXTERNAL
2ACPI_HEST_NOTIFY_LOCAL
3ACPI_HEST_NOTIFY_SCI
4ACPI_HEST_NOTIFY_NMI
5ACPI_HEST_NOTIFY_CMCI(ACPI 5.0)
6ACPI_HEST_NOTIFY_MCE(ACPI 5.0)
7ACPI_HEST_NOTIFY_GPIO(ACPI 6.0)
8ACPI_HEST_NOTIFY_SEA(ACPI 6.1, ARM)
9ACPI_HEST_NOTIFY_SEI(ACPI 6.1, ARM)
10ACPI_HEST_NOTIFY_GSIV(ACPI 6.1)
11ACPI_HEST_NOTIFY_SOFTWARE_DELEGATED(ACPI 6.2)

8. arm_event trace event

include/ras/ras_event.h:171-208,触发:drivers/ras/ras.c:24 log_arm_hw_error()ghes.c:507

字段类型含义
mpidru64Multiprocessor Affinity Register
midru64Main ID Register
running_stateu32PSCI state (bit 0 set)
psci_stateu32PSCI state
affinityu8亲和性 level

Validation bits(include/linux/cper.h:257-260):CPER_ARM_VALID_AFFINITY_LEVEL, CPER_ARM_VALID_MPIDR, CPER_ARM_VALID_RUNNING_STATE


9. non_standard_event trace event

include/ras/ras_event.h:219-253,触发:drivers/ras/ras.c:17ghes.c:676

字段类型含义
sec_type[16]char 数组section type GUID raw bytes
fru_id[16]char 数组FRU ID GUID
fru_textstringFRU text
sevu8CPER severity
lenu32raw data 长度
bufdynamic u8raw error data hex dump

10. signal_generate trace event

include/trace/events/signal.h:50-80

字段类型含义
sigint信号号
errnointsiginfo si_errno
codeintsiginfo si_code
comm[]char[16]目标 task comm
pidpid_t目标 PID
groupint1 = 进程组, 0 = 单进程
resultintTRACE_SIGNAL_* 结果

10.1 TRACE_SIGNAL_* 结果码

include/trace/events/signal.h:27-33

名称含义
0TRACE_SIGNAL_DELIVERED已投递
1TRACE_SIGNAL_IGNORED被忽略
2TRACE_SIGNAL_ALREADY_PENDING已 pending
3TRACE_SIGNAL_OVERFLOW_FAILsigqueue 溢出
4TRACE_SIGNAL_LOSE_INFOsiginfo 丢失

10.2 SIGBUS codes (si_code 值)

include/uapi/asm-generic/siginfo.h:251-258

名称含义
1BUS_ADRALN地址对齐无效
2BUS_ADRERR物理地址不存在
3BUS_OBJERR对象特定硬件错误
4BUS_MCEERR_ARMCE 硬件错误已消费(action required)
5BUS_MCEERR_AOMCE 硬件错误已发现但未消费(action optional)

11. devlink_health_report trace event

include/trace/events/devlink.h:81-107

字段类型含义
bus_namestringdevlink 设备 bus name
dev_namestringdevlink 设备 name
driver_namestring驱动名
reporter_namestringhealth reporter 名
msgstring消息字符串

11.1 devlink_health_reporter_state 枚举

include/net/devlink.h:714-717

名称含义
0DEVLINK_HEALTH_REPORTER_STATE_HEALTHY健康
1DEVLINK_HEALTH_REPORTER_STATE_ERROR错误状态

(v6.6 没有 REPLAYING — 后加)


12. block_rq_error trace event

include/trace/events/block.h:165-170(继承自 block_rq_completion class 行 105-135)

字段类型含义
devdev_t块设备 (MAJOR,MINOR)
sectorsector_t起始 sector
nr_sectoru32512-byte sectors 数
errorintblk_status_to_errno() 转换的 errno
rwbs[]charR/W/B/S 标志
cmddynamic char命令字符串

12.1 blk_status_t 枚举(BLK_STS_*

include/linux/blk_types.h:99-179

名称errno含义
0BLK_STS_OK0Success
1BLK_STS_NOTSUPP-EOPNOTSUPPop not supported
2BLK_STS_TIMEOUT-ETIMEDOUTtimeout
3BLK_STS_NOSPC-ENOSPCcritical space allocation
4BLK_STS_TRANSPORT-ENOLINKrecoverable transport
5BLK_STS_TARGET-EREMOTEIOcritical target
6BLK_STS_RESV_CONFLICT-EBADEreservation conflict
7BLK_STS_MEDIUM-ENODATAcritical medium
8BLK_STS_PROTECTION-EILSEQDIX/PIX protection
9BLK_STS_RESOURCE-ENOMEMkernel resource
10BLK_STS_IOERR-EIOgeneric I/O
11BLK_STS_DM_REQUEUE-EREMCHGDM internal retry
12BLK_STS_AGAIN-EAGAINnonblocking retry
13BLK_STS_DEV_RESOURCE-EBUSYdevice-specific resource
14BLK_STS_ZONE_RESOURCE-zone resource
15BLK_STS_ZONE_OPEN_RESOURCE-ETOOMANYREFStoo many open zones
16BLK_STS_ZONE_ACTIVE_RESOURCE-EOVERFLOWtoo many active zones
17BLK_STS_OFFLINE-ENODEVdevice offline
18BLK_STS_DURATION_LIMIT-ETIMEcommand duration limit exceeded

13. memory_failure_event trace event

include/ras/ras_event.h:399-423,触发:mm/memory-failure.c:1323

字段类型含义
pfnunsigned longPage Frame Number
typeintenum mf_action_page_type
resultintenum mf_result

13.1 MF_MSG_* 页面类型

include/linux/mm.h:3917-3938 + include/ras/ras_event.h:356-376

名称字符串
0MF_MSG_KERNEL“reserved kernel page”
1MF_MSG_KERNEL_HIGH_ORDER“high-order kernel page”
2MF_MSG_SLAB“kernel slab page”
3MF_MSG_DIFFERENT_COMPOUND“different compound page after locking”
4MF_MSG_HUGE“huge page”
5MF_MSG_FREE_HUGE“free huge page”
6MF_MSG_UNMAP_FAILED“unmapping failed page”
7MF_MSG_DIRTY_SWAPCACHE“dirty swapcache page”
8MF_MSG_CLEAN_SWAPCACHE“clean swapcache page”
9MF_MSG_DIRTY_MLOCKED_LRU“dirty mlocked LRU page”
10MF_MSG_CLEAN_MLOCKED_LRU“clean mlocked LRU page”
11MF_MSG_DIRTY_UNEVICTABLE_LRU“dirty unevictable LRU page”
12MF_MSG_CLEAN_UNEVICTABLE_LRU“clean unevictable LRU page”
13MF_MSG_DIRTY_LRU“dirty LRU page”
14MF_MSG_CLEAN_LRU“clean LRU page”
15MF_MSG_TRUNCATED_LRU“already truncated LRU page”
16MF_MSG_BUDDY“free buddy page”
17MF_MSG_DAX“dax page”
18MF_MSG_UNSPLIT_THP“unsplit thp”
19MF_MSG_UNKNOWN“unknown page”

13.2 mf_action_result / enum mf_result

include/linux/mm.h:3910-3915

名称字符串含义
0MF_IGNORED“Ignored”无法处理
1MF_FAILED“Failed”处理失败
2MF_DELAYED“Delayed”稍后处理
3MF_RECOVERED“Recovered”已恢复

13.3 enum mf_flags

include/linux/mm.h:3826-3834

名称含义
1<<0MF_COUNT_INCREASEDrefcount 已增加
1<<1MF_ACTION_REQUIRED需要动作
1<<2MF_MUST_KILL必须杀进程
1<<3MF_SOFT_OFFLINE软离线
1<<4MF_UNPOISONunpoison 请求
1<<5MF_SW_SIMULATED软件模拟注入
1<<6MF_NO_RETRY不重试

13.4 HWPOISON page flag

include/linux/page-flags.h

  • PG_hwpoison (bit 129) — “hardware poisoned page. Don’t touch”
  • __PG_HWPOISON 掩码
  • SetPageHWPoison(page) / ClearPageHWPoison(page) / PageHWPoison(page) — flag ops
  • SetPageHWPoisonTakenOff(page) / ClearPageHWPoisonTakenOff(page) — 子页跟踪
  • PageHasHWPoisoned(page) — 复合页检测

13.5 hwpoison-inject 调试接口

mm/hwpoison-inject.c

  • 路径:/sys/kernel/debug/hwpoison/
  • 文件:corrupt-pfn(调 hwpoison_inject() with MF_SW_SIMULATED),unpoison-pfn
  • 过滤:corrupt-filter-enable, corrupt-filter-dev-major/-minor, corrupt-filter-flags-mask/-value, corrupt-filter-memcg
  • 需要 CAP_SYS_ADMIN 否则 -EPERM!pfn_valid-ENXIO
  • 返回 -EOPNOTSUPP → 0

14. ERST(APEI Error Record Serialization Table)

include/acpi/apei.h:13-51

ERST 存储原始 cper_record_header blobs(signature = “CPER”)。ioctl 接口:

名称含义
APEI_ERST_INVALID_RECORD_ID0xffffffffffffffffULL“无 record” sentinel
APEI_ERST_CLEAR_RECORD_IOW('E', 1, u64)按 id 清记录
APEI_ERST_GET_RECORD_COUNT_IOR('E', 2, u32)总记录数

ERST 没有 “record type” 枚举 — 格式由 cper_record_header.signature == "CPER" 决定。


15. pstore

include/linux/pstore.h:28-44 + fs/pstore/platform.c:44-54

15.1 enum pstore_type_id

名称字符串含义
0PSTORE_TYPE_DMESG“dmesg”内核 log 快照
1PSTORE_TYPE_MCE“mce”MCE 记录
2PSTORE_TYPE_CONSOLE“console”控制台输出
3PSTORE_TYPE_FTRACE“ftrace”ftrace 环形缓冲
4PSTORE_TYPE_PPC_RTAS“rtas”PowerPC RTAS
5PSTORE_TYPE_PPC_OF“powerpc-ofw”PowerPC Open Firmware
6PSTORE_TYPE_PPC_COMMON“powerpc-common”PowerPC common
7PSTORE_TYPE_PMSG“pmsg”用户态 pmsg
8PSTORE_TYPE_PPC_OPAL“powerpc-opal”PowerPC OPAL
9PSTORE_TYPE_MAXsentinel列表结束

15.2 PSTORE_FLAGS_*(frontends,pstore.h:205-208

标志
PSTORE_FLAGS_DMESGBIT(0)
PSTORE_FLAGS_CONSOLEBIT(1)
PSTORE_FLAGS_FTRACEBIT(2)
PSTORE_FLAGS_PMSGBIT(3)

16. 内核侧 → rasdaemon 侧 错误码对照

内核源错误码数在内核的根
MCEarch/x86/include/asm/mce.h~50 (bits + helpers)MSR 0xc0002000+
MCE Severityarch/x86/kernel/cpu/mce/severity.c22 rules表驱动
AERinclude/uapi/linux/pci_regs.h8 CE + 16 UE bitsPCI config 0x04/0x10
EDACinclude/linux/edac.h5 err_type + 4 层级 + 29 mem_type + 9 edac_type + 9 scrub + 3 op_state驱动实现
CXLdrivers/cxl/core/trace.h + cxlmem.h15 UC + 7 CE + 16 mailbox + 30 RC + 3 mem_evt + 7 trans + 5 health + 10 media + 8 dev_status设备寄存器
extloginclude/ras/ras_event.h16 err_type + 4 sevACPI 6456 接口
CPERinclude/linux/cper.h4 sev + 16 mem err_type + 22 mem valid + 12 section typeUEFI spec
APEIinclude/acpi/ghes.h + actbl1.h4 sev + 12 HEST type + 12 HEST notifyACPI 表
ARMinclude/ras/ras_event.h4 err type (CPER)CPER PEI section
memory-failureinclude/linux/mm.h20 MF_MSG + 4 MF result + 7 mf_flags内核
hwpoisoninclude/linux/page-flags.hPG_hwpoison 1 bit内核 page flag
signalinclude/trace/events/signal.h + uapi/asm-generic/siginfo.h5 result + 5 BUS code内核
devlinkinclude/trace/events/devlink.h + include/net/devlink.h2 reporter_state内核
blockinclude/linux/blk_types.h19 BLK_STS_*内核
ERSTinclude/acpi/apei.hioctl 3ACPI 表
pstoreinclude/linux/pstore.h9 type + 4 flags内核

总计:约 400+ 错误码 来自内核源码。


17. 关键文件索引

文件行数作用
arch/x86/include/asm/mce.h334MCE 状态位 / SMCA enum / MSR 地址
arch/x86/include/uapi/asm/mce.h43struct mce / ioctls
arch/x86/kernel/cpu/mce/internal.h250+severity_level / vendor flags / mca_config
arch/x86/kernel/cpu/mce/core.c2800+主分发、severity 决策、CPUHP、cmdline
arch/x86/kernel/cpu/mce/intel.c1000+CMCI 实现、Intel filter quirks
arch/x86/kernel/cpu/mce/amd.c800+smca_hwid_mcatypes、threshold 处理
arch/x86/kernel/cpu/mce/severity.c400+Intel/AMD 严重性表
arch/x86/kernel/cpu/mce/inject.c800+mce-inject / injection_type
arch/x86/kernel/cpu/mce/genpool.c150+MCE_POOLSZ / genpool
arch/x86/kernel/cpu/mce/apei.c200+APEI/MCE 桥
arch/x86/kernel/cpu/mce/dev-mcelog.c380+/dev/mcelog
arch/x86/kernel/cpu/mce/p5.c winchip.c100+旧 MCE
include/trace/events/mce.h70mce_record trace event
include/linux/cper.h560+CPER 全部定义
include/acpi/ghes.h80+GHES_SEV_* / 桥接口
include/acpi/apei.h80+ERST ioctls
include/acpi/actbl1.h1500+HEST 表
include/uapi/linux/pci_regs.h800+PCI_ERR_* 位
include/linux/aer.h50+AER severity / 头结构
drivers/pci/pcie/aer.c1400+AER ISR / 恢复 / 分类
drivers/pci/pcie/aer_inject.c600+AER 软件注入
drivers/pci/pcie/err.c280+pcie_do_recovery()
include/linux/edac.h600+EDAC 全部 enum / 结构
drivers/edac/edac_mc.c1300+主分发
drivers/edac/skx_common.c i10nm_base.c sb_edac.c各 1000-3500Intel 各代解码
drivers/edac/amd64_edac.c3500+AMD64 解码
drivers/cxl/core/trace.h700+8 个 cxl trace events
drivers/cxl/cxlmem.h830+mailbox / 事件 / poison
drivers/cxl/cxl.h470+RAS 寄存器 / event log type
drivers/cxl/cxlpci.h100+DVSEC / regloc type
drivers/cxl/core/mbox.c1200+mailbox 分发
drivers/cxl/core/memdev.c800+memdev 驱动
drivers/cxl/pci.c800+CXL.io AER / event ISR
include/uapi/linux/cxl_mem.h50+UAPI CXL 命令
include/ras/ras_event.h430+全部 RAS trace events
include/linux/mm.h4000+mf_result / mf_action_page_type / mf_flags
mm/memory-failure.c1500+memory_failure / page 处理
mm/hwpoison-inject.c110+debugfs hwpoison 注入
include/linux/blk_types.h300+blk_status_t / REQ_*
include/trace/events/signal.h80+signal_generate / TRACE_SIGNAL_*
include/trace/events/devlink.h170+devlink_health_* trace events
include/trace/events/block.h200+block_rq_error
include/net/devlink.h1000+devlink 全部
include/linux/pstore.h300+pstore 全套

18. 调试与快速定位

问题在哪查
MCE 错误位定义arch/x86/include/asm/mce.h
MCE 严重性决策arch/x86/kernel/cpu/mce/severity.c
SMCA bank → HWID 映射arch/x86/kernel/cpu/mce/amd.c:160-219
AER 错误位include/uapi/linux/pci_regs.h:748-802
AER 严重性映射drivers/pci/pcie/aer.c:750-761 (cper_severity_to_aer)
EDAC 错误类型字符串edac_mc.c:121-136 (mc_event_error_type)
CXL 错误位drivers/cxl/core/trace.h:17-31 (UC) 81-87 (CE)
CXL 事件分发drivers/cxl/core/mbox.c:843-861
CPER 内存 err_type 字符串drivers/firmware/efi/cper.c:189-213
GHES 桥drivers/acpi/apei/ghes.c
signal si_codeinclude/uapi/asm-generic/siginfo.h:251-258
blk_status errnoblock/blk-core.c:151-178 (blk_status_to_errno)
memory_failure 字符串mm/memory-failure.c:870-897
hwpoison 注入mm/hwpoison-inject.c
ERST ioctlinclude/acpi/apei.h:13-51
测点/sys/kernel/debug/tracing/events/<system>/

19. 关键事实总结

  1. rasdaemon 的所有错误码都源于内核 — 它是消费者,不是定义者
  2. CPER 规范是通用传输格式 — 几乎所有硬件平台错误(PEI 段、mem 段、PCIe 段)都用 CPER
  3. APEI 是固件 → 内核的桥 — 错误从固件通过 GHES / AER 通知进内核
  4. 错误严重性在内核决策 — rasdaemon 看到的 error_type 是内核已经分类过的
  5. 操作恢复由内核完成 — rasdaemon 只观察;少量 EDAC PFA 写 sysfs 是例外
  6. SMCA 26 类 bank — 通过 (HWID, MCATYPE) 二元组路由
  7. CXL v6.6.0 没有 sparing 事件 — 后加在更新内核
  8. CXL 内核宏用 CXL_RAS_UC_*/CXL_RAS_CE_* 命名 — 不是 CXL_AER_UE_*
  9. arm_event 仅解 PEI 段 — 不解 SMMU/GIC/CCI/CCN 专用段
  10. hwpoison 是 page flag 而非 enum — 配合 mf_action_page_type 字符串一起用
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值