背景
MySQL 连接数爆满是生产环境中较为常见的紧急故障。当 Too many connections 错误出现时,意味着 MySQL 服务端无法为新的客户端请求分配连接句柄,新的业务请求将被直接拒绝。这类故障的诱因多样,可能是突发流量、慢查询阻塞、连接泄漏或参数配置不当。如果运维人员没有明确的排障思路,很容易在紧急状态下反复尝试重启服务,不仅治标不治本,还可能造成数据损坏或业务中断。
本文以 MySQL 8.0.39 为基准版本,结合 2026 年主流监控体系(Prometheus + AlertManager + Grafana),提供一套从紧急止血到根因定位再到长期治理的完整处理流程。内容覆盖 Linux 系统层诊断、MySQL 状态分析、连接泄漏定位、参数调优以及监控告警配置,力图做到“一文在手,排障无忧”。
前置知识要求: 熟悉 Linux 基础命令(top、ps、netstat)、了解 MySQL 基本架构(连接管理器、线程池、存储引擎)。
1. 问题现象与紧急评估
1.1 典型临床表现
MySQL 连接数爆满时,业务侧通常会收到以下几类反馈:
-
应用日志中出现
SQLSTATE[HY000] [2002] Can't connect to MySQL server或ERROR 1040 (HY000): Too many connections -
部分接口响应超时,但数据库直连查询正常
-
数据库主机本身可以登录,但普通账号无法建立新连接,只有
root@localhost尚可登录
紧急评估第一步:确认主机层是否存活。
# 检查 MySQL 进程是否存活
ps aux | grep mysqld | grep -v grep
# 检查端口是否监听
ss -tlnp | grep 3306
# 或
netstat -tlnp | grep 3306
# 检查主机负载
uptime
w
紧急评估第二步:确认连接数状态。
# 登录 MySQL(使用 socket 直连,不依赖 TCP 连接池)
mysql -u root -p -S /var/lib/mysql/mysql.sock -e "SHOW PROCESSLIST;" 2>/dev/null | wc -l
如果 SHOW PROCESSLIST 无法执行(连接数已满),但系统层面 MySQL 进程存活,可以尝试:
# 直接查看 MySQL error log 中的连接拒绝记录
tail -100 /var/log/mysql/error.log | grep -i "too many connections"
# 查看操作系统层建立的 MySQL 连接数
ss -ant | grep :3306 | wc -l
# 查看每个 IP 的连接数分布
ss -ant | grep :3306 | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn
紧急评估第三步:计算连接数与 max_connections 的比例。
连接数爆满不一定意味着达到 max_connections 上限。以下两种情况都表现为连接不足:
-
max_connections = 151(MySQL 5.7/8.0 默认值),当前连接 145 -
实际连接 90,但因
wait_timeout设置过短,80个连接处于 Sleep 状态占位
-- 查看当前连接数与最大值
SHOWSTATUSLIKE'Threads_connected';
SHOWVARIABLESLIKE'max_connections';
-- 查看连接状态分布
SHOWSTATUSLIKE'Threads%';
-- 输出示例:
-- Threads_connected: 147
-- Threads_running: 12
-- Threads_cached: 5
Threads_connected 与 Threads_running 的区别必须理解清楚:
-
Threads_connected:已建立但未关闭的连接数,包含睡眠连接 -
Threads_running:正在执行 SQL 的连接数(不包含等待 I/O 的连接)
真正危险的是 Threads_running 飙升。如果 Threads_running 接近或达到 max_connections,说明服务器资源已经严重过载。
2. 快速止血操作
警告: 以下操作具有侵入性,实施前请确认已有备份或已通知相关业务方。
2.1 杀掉空闲睡眠连接
大多数连接数爆满场景下,大量连接处于 Sleep 状态(等待客户端发送新请求)。这些连接如果没有设置 wait_timeout,可能长期占用连接句柄。杀掉睡眠连接是最快的止血手段。
-- 杀掉所有睡眠连接(执行前确认无长查询事务)
KILL CONNECTION IF EXISTS (
SELECT ID FROM INFORMATION_SCHEMA.PROCESSLIST
WHERE COMMAND = 'Sleep' AND TIME > 30
);
如果上述语法不支持,使用游标方式:
-- MySQL 8.0 版本
SELECT CONCAT('KILL ', ID, ';')
FROM INFORMATION_SCHEMA.PROCESSLIST
WHERE COMMAND = 'Sleep'
AND TIME > 30
INTO OUTFILE '/tmp/kill_sleep.sql';
SOURCE /tmp/kill_sleep.sql;
Shell 一键清理脚本:
#!/bin/bash
# kill_mysql_sleep_conn.sh
# 用途:清理 MySQL 超时睡眠连接
# 用法:./kill_mysql_sleep_conn.sh 30
SOCKET="/var/lib/mysql/mysql.sock"
THRESHOLD=${1:-30} # 睡眠超时秒数,默认30秒
mysql -u root -p"$(cat /etc/mysql_root_pwd 2>/dev/null || echo '')" -S "$SOCKET" -N -e "
SELECT CONCAT('KILL ', ID, ';')
FROM INFORMATION_SCHEMA.PROCESSLIST
WHERE COMMAND = 'Sleep'
AND TIME > $THRESHOLD;
" 2>/dev/null | mysql -u root -p"$(cat /etc/mysql_root_pwd 2>/dev/null || echo '')" -S "$SOCKET" 2>/dev/null
echo "$(date '+%Y-%m-%d %H:%M:%S') - 已清理睡眠连接"
2.2 临时提高 max_connections
如果清理睡眠连接后仍不够用,需要临时提高 max_connections。这是“边止血边排障”的策略,但不应作为长期方案。
-- 动态调整(无需重启 MySQL)
SET GLOBAL max_connections = 500;
也可以在 MySQL 配置文件 my.cnf 中临时修改:
# 备份原配置
cp /etc/mysql/my.cnf /etc/mysql/my.cnf.bak.$(date +%Y%m%d%H%M%S)
# 临时提高连接数
sed -i '/^\[mysqld\]/a max_connections=500' /etc/mysql/my.cnf
# 验证配置已生效
mysql -u root -S "$SOCKET" -e "SHOW VARIABLES LIKE 'max_connections';"
2.3 杀掉特定慢查询连接
如果连接数爆满是由少数慢查询引起的(常见于大事务或全表扫描),识别并杀掉这些查询可以快速缓解。
-- 识别执行时间最长的查询
SELECT
ID,
USER,
HOST,
DB,
COMMAND,
TIME,
STATE,
LEFT(INFO, 100) ASQUERY
FROM INFORMATION_SCHEMA.PROCESSLIST
WHERE COMMAND != 'Sleep'
ORDERBYTIMEDESC
LIMIT10;
#!/bin/bash
# kill_long_query.sh
# 用途:杀掉执行时间超过指定秒数的查询
# 用法:./kill_long_query.sh 60
SOCKET="/var/lib/mysql/mysql.sock"
QUERY_TIME=${1:-60}
mysql -u root -S "$SOCKET" -N -e "
SELECT CONCAT('KILL QUERY ', ID, ';')
FROM INFORMATION_SCHEMA.PROCESSLIST
WHERE COMMAND != 'Sleep'
AND TIME > $QUERY_TIME;
" 2>/dev/null | mysql -u root -S "$SOCKET" 2>/dev/null
echo "$(date) - 已终止执行时间超过 ${QUERY_TIME} 秒的查询"
2.4 紧急情况下的连接数实时监控脚本
#!/bin/bash
# mysql_conn_monitor.sh
# 用途:实时监控 MySQL 连接数,超过阈值时告警
# 建议配合 cron 使用:* * * * * /opt/scripts/mysql_conn_monitor.sh
SOCKET="/var/lib/mysql/mysql.sock"
WARN_THRESHOLD=100
CRIT_THRESHOLD=130
# 获取当前连接数
CURRENT_CONN=$(mysql -u root -S "$SOCKET" -N -e "SHOW STATUS LIKE 'Threads_connected';" 2>/dev/null | awk '{print $2}')
MAX_CONN=$(mysql -u root -S "$SOCKET" -N -e "SHOW VARIABLES LIKE 'max_connections';" 2>/dev/null | awk '{print $2}')
RUNNING=$(mysql -u root -S "$SOCKET" -N -e "SHOW STATUS LIKE 'Threads_running';" 2>/dev/null | awk '{print $2}')
echo"$(date '+%Y-%m-%d %H:%M:%S') - 连接数: ${CURRENT_CONN}/${MAX_CONN}, 运行中: ${RUNNING}"
if [ "$CURRENT_CONN" -ge "$CRIT_THRESHOLD" ]; then
echo"CRITICAL: 连接数 ${CURRENT_CONN} 超过临界值 ${CRIT_THRESHOLD}"
# 触发告警(根据实际告警渠道调整)
curl -X POST "http://alertmanager:9093/api/v1/alerts" \
-H "Content-Type: application/json" \
-d '[{"labels":{"alertname":"MySQL_Too_Many_Connections","severity":"critical"},"annotations":{"summary":"MySQL连接数告警"}}]' 2>/dev/null
elif [ "$CURRENT_CONN" -ge "$WARN_THRESHOLD" ]; then
echo"WARNING: 连接数 ${CURRENT_CONN} 超过警告值 ${WARN_THRESHOLD}"
fi
3. 根因分析
3.1 慢查询导致连接堆积
慢查询是连接数爆满最常见的根因。当一个查询执行时间过长(比如30秒),在这30秒内,该连接一直处于 Sending data 或 Sorting result 状态,新请求无法复用该连接,后续请求只能创建新连接。如果并发慢查询持续到来,连接数会快速打满。
排查慢查询的方法:
-- 查看最近的重查询(基于 performance_schema)
SELECT
DIGEST,
COUNT_STAR,
SUM_TIMER_WAIT / 1000000000000 AS SUM_SECONDS,
AVG_TIMER_WAIT / 1000000000000 AS AVG_SECONDS,
LEFT(SQL_TEXT, 200) AS SQL_SAMPLE
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 10;
# 查看 MySQL 慢查询日志
tail -100 /var/log/mysql/slow.log | grep -i "Query_time"
慢查询的典型场景:
-
大事务未提交:一个事务开启后执行了大量操作,但未提交或回滚,导致锁持有时间过长。
-- 查看未提交事务
SELECT
trx_id,
trx_state,
trx_started,
TIMESTAMPDIFF(SECOND, trx_started, NOW()) AS TRX_RUNNING_SEC,
trx_query,
trx_rows_modified,
trx_tables_locked
FROM INFORMATION_SCHEMA.INNODB_TRX
ORDER BY trx_started;
-
缺失索引导致全表扫描:一条查询扫描全表10万行数据返回10行结果。
-- 查找可能缺少索引的查询
SELECT
OBJECT_SCHEMA,
OBJECT_NAME,
INDEX_NAME,
SEQ_IN_INDEX,
COLUMN_NAME
FROM INFORMATION_SCHEMA.STATISTICS
WHERE OBJECT_SCHEMA NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys')
AND INDEX_NAME != 'PRIMARY'
ORDER BY OBJECT_SCHEMA, OBJECT_NAME, SEQ_IN_INDEX;
-
深分页查询:大量使用
LIMIT offset, n导致回表次数过多。
-- 检测深分页模式
SELECT
LEFT(INFO, 100) AS QUERY,
COUNT(*) AS EXEC_COUNT
FROM INFORMATION_SCHEMA.PROCESSLIST
WHERE INFO LIKE '%LIMIT%'
GROUP BY LEFT(INFO, 100)
ORDER BY EXEC_COUNT DESC
LIMIT 10;
3.2 连接泄漏(应用程序未关闭连接)
应用程序在使用完连接后未正确关闭,导致连接被浪费。如果应用程序每处理一个请求就泄漏一个连接,很快就会耗尽连接池。
诊断连接泄漏的脚本:
#!/bin/bash
# detect_conn_leak.sh
# 用途:检测是否存在连接泄漏(同一 IP/用户持续增长连接数)
SOCKET="/var/lib/mysql/mysql.sock"
mysql -u root -S "$SOCKET" -N -e "
SELECT
USER,
HOST,
COUNT(*) AS CONN_COUNT,
MAX(TIME) AS MAX_TIME,
MIN(TIME) AS MIN_TIME
FROM INFORMATION_SCHEMA.PROCESSLIST
WHERE COMMAND = 'Sleep'
GROUP BY USER, HOST
HAVING COUNT(*) > 10
ORDER BY CONN_COUNT DESC;
" 2>/dev/null
如果某个应用的连接数持续增长且 MAX_TIME 很大,说明该应用可能存在连接泄漏。
3.3 短连接频繁创建
某些语言(PHP、Python)的应用程序默认使用短连接模式,每次请求都创建新连接,用完即关闭。如果请求并发量突然上升,连接创建的速度可能超过 MySQL 的处理能力,导致连接队列积压。
判断是否为短连接问题:
-- 查看连接建立的峰值时间
SELECT
DATE_FORMAT(CREATED, '%Y-%m-%d %H:%i') AS MINUTE,
COUNT(*) AS NEW_CONNECTIONS
FROM INFORMATION_SCHEMA.PROCESSLIST
WHERE COMMAND = 'Sleep'
GROUP BY DATE_FORMAT(CREATED, '%Y-%m-%d %H:%i')
ORDER BY NEW_CONNECTIONS DESC
LIMIT 10;
# 查看 MySQL 的连接建立速率
mysql -u root -S "$SOCKET" -e "SHOW STATUS LIKE 'Connections';" 2>/dev/null
mysql -u root -S "$SOCKET" -e "SHOW STATUS LIKE 'Aborted_connects';" 2>/dev/null
如果 Aborted_connects 持续上升,说明大量连接建立失败,可能就是短连接被打满。
应对短连接问题的配置:
# my.cnf
[mysqld]
# 允许更多连接
max_connections = 1000
# 减少睡眠连接占用时间
wait_timeout = 600
interactive_timeout = 600
# 减少连接握手失败重试
connect_timeout = 10
# 启用连接池(MySQL 8.0 企业版)或使用 ProxySQL
3.4 max_connections 设置过小
MySQL 8.0 默认 max_connections = 151,这个值对于中等流量的生产环境远远不够。但盲目增大 max_connections 也会带来问题:每个连接都会占用 MySQL 线程栈内存(默认 thread_stack = 256KB),151个连接就占用约38MB,对于 8核16GB 的 MySQL 主机问题不大,但 10000 个连接就会占用约2.5GB 内存。
-- 查看当前 max_connections 和实际使用
SHOW VARIABLES LIKE 'max_connections';
SHOW STATUS LIKE 'Max_used_connections';
Max_used_connections 是 MySQL 启动以来同时使用的最大连接数。如果这个值接近 max_connections,说明连接数上限确实不够用。
合理计算 max_connections:
#!/bin/bash
# calc_max_connections.sh
# 根据服务器资源计算合理的 max_connections
TOTAL_MEM=$(free -m | awk '/^Mem:/{print $2}')
THREAD_STACK=$(mysql -u root -S /var/lib/mysql/mysql.sock -N -e "SHOW VARIABLES LIKE 'thread_stack';" 2>/dev/null | awk '{print $2}' | sed 's/k$//')
# 每个连接占用内存估算(thread_stack + 连接缓冲区)
CONN_MEM_KB=$((THREAD_STACK + 2048)) # 约2MB per connection
# MySQL 全局缓冲区占用(估算)
MYSQL_BUF_KB=4096000 # 约4GB
# 可用于连接的内存
AVAILABLE_MEM_KB=$((TOTAL_MEM * 1024 - MYSQL_BUF_KB))
# 计算最大连接数
MAX_CONN=$((AVAILABLE_MEM_KB / CONN_MEM_KB / 2)) # 留一半余量
echo"服务器总内存: ${TOTAL_MEM}MB"
echo"MySQL 全局缓冲: ${MYSQL_BUF_KB}KB"
echo"单连接内存占用: ${CONN_MEM_KB}KB"
echo"建议 max_connections: ${MAX_CONN}"
4. 监控与预警配置
4.1 Prometheus + MySQL Exporter 监控体系
MySQL 8.0 推荐使用 mysqld_exporter 或 mysqld_exporter 的继任者 Prometheus MySQL Exporter(Percona 出品)来采集指标。
安装 MySQL Exporter(基于 Prometheus 官方 2026 最新版):
#!/bin/bash
# install_mysql_exporter.sh
EXPORTER_VERSION="0.16.0"
DOWNLOAD_URL="https://github.com/prometheus/mysqld_exporter/releases/download/v${EXPORTER_VERSION}/mysqld_exporter-${EXPORTER_VERSION}.linux-amd64.tar.gz"
# 下载并安装
cd /tmp
curl -LO "$DOWNLOAD_URL"
tar xzf mysqld_exporter-${EXPORTER_VERSION}.linux-amd64.tar.gz
sudo mv mysqld_exporter-${EXPORTER_VERSION}.linux-amd64/mysqld_exporter /usr/local/bin/
sudo chmod +x /usr/local/bin/mysqld_exporter
# 创建专用监控用户(MySQL 8.0)
mysql -u root -e "
CREATE USER IF NOT EXISTS 'exporter'@'localhost' IDENTIFIED BY 'StrongExporterPass2026!' WITH MAX_USER_CONNECTIONS 3;
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
FLUSH PRIVILEGES;
"
# 配置 exporter 连接参数
sudo tee /etc/mysql_exporter.cnf > /dev/null <<'EOF'
[client]
user=exporter
password=StrongExporterPass2026!
host=localhost
port=3306
socket=/var/lib/mysql/mysql.sock
EOF
sudo chmod 644 /etc/mysql_exporter.cnf
# 创建 systemd 服务
sudo tee /etc/systemd/system/mysql-exporter.service > /dev/null <<'EOF'
[Unit]
Description=Prometheus MySQL Exporter
After=network.target mysql.service
[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/mysqld_exporter --config.my-cnf=/etc/mysql_exporter.cnf --collect.info_schema.processlist --collect.info_schema.innodb_tablespaces --collect.info_schema.innodb_metrics
Restart=always
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable mysql-exporter
sudo systemctl start mysql-exporter
4.2 AlertManager 告警规则
以下 Prometheus AlertManager 规则专门针对 MySQL 连接数告警:
# mysql_connection_alerts.yml
# 放置位置:/etc/prometheus/rules/mysql_connection_alerts.yml
groups:
-name:MySQL连接数告警规则
rules:
# 连接数警告
-alert:MySQLConnectionsHigh
expr:mysql_global_status_threads_connected/mysql_global_variables_max_connections>0.8
for:5m
labels:
severity:warning
annotations:
summary:"MySQL 连接数过高"
description:"当前连接数 {{ $value | humanizePercentage }},接近上限。连接数: {{ $value }}"
# 连接数紧急
-alert:MySQLConnectionsCritical
expr:mysql_global_status_threads_connected/mysql_global_variables_max_connections>0.95
for:1m
labels:
severity:critical
annotations:
summary:"MySQL 连接数即将打满"
description:"当前连接数 {{ $value | humanizePercentage }},立即处理!"
# 拒绝连接
-alert:MySQLConnectionRefused
expr:rate(mysql_global_status_connection_errors_total[5m])>10
for:5m
labels:
severity:critical
annotations:
summary:"MySQL 连接拒绝率上升"
description:"MySQL 连接错误率 {{ $value }}/s,请立即检查。"
# 慢查询告警
-alert:MySQLSlowQueries
expr:rate(mysql_global_status_slow_queries[5m])>5
for:5m
labels:
severity:warning
annotations:
summary:"MySQL 慢查询增加"
description:"慢查询速率 {{ $value }}/s,请检查慢查询日志。"
# 运行中连接数过高
-alert:MySQLThreadsRunningHigh
expr:mysql_global_status_threads_running>20
for:5m
labels:
severity:warning
annotations:
summary:"MySQL 活跃连接数过高"
description:"{{ $value }} 个连接正在执行查询,可能存在慢查询。"
4.3 Grafana 连接数监控面板
以下 JSON 可导入 Grafana 创建 MySQL 连接数专用面板:
{
"dashboard": {
"title": "MySQL 连接数监控面板",
"panels": [
{
"title": "当前连接数 / 最大连接数",
"type": "gauge",
"targets": [
{
"expr": "mysql_global_status_threads_connected",
"legendFormat": "当前连接"
},
{
"expr": "mysql_global_variables_max_connections",
"legendFormat": "最大连接"
}
]
},
{
"title": "连接数趋势",
"type": "graph",
"targets": [
{
"expr": "mysql_global_status_threads_connected",
"legendFormat": "已连接"
},
{
"expr": "mysql_global_status_threads_running",
"legendFormat": "运行中"
},
{
"expr": "mysql_global_status_threads_cached",
"legendFormat": "缓存"
}
]
},
{
"title": "连接建立速率",
"type": "graph",
"targets": [
{
"expr": "rate(mysql_global_status_connections[5m])",
"legendFormat": "连接速率"
}
]
}
]
}
}
4.4 企业微信/Slack 告警脚本
#!/bin/bash
# mysql_alert_notify.sh
# 用途:发送 MySQL 连接数告警到企业微信
WEBHOOK_URL="https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_KEY_HERE"
ALERT_NAME="$1"
ALERT_MSG="$2"
SEVERITY="$3"
case"$SEVERITY"in
critical) COLOR="FF0000" ;;
warning) COLOR="FFA500" ;;
*) COLOR="00FF00" ;;
esac
curl -X POST "$WEBHOOK_URL" \
-H "Content-Type: application/json" \
-d "{
\"msgtype\": \"markdown\",
\"markdown\": {
\"content\": \"**MySQL 告警 [${SEVERITY}]**\n**告警名称**: ${ALERT_NAME}\n**详情**: ${ALERT_MSG}\n**时间**: $(date '+%Y-%m-%d %H:%M:%S')\"
}
}" 2>/dev/null
5. 长期解决方案
5.1 连接池配置优化
连接泄漏和短连接问题的根本解决方案是在应用层配置连接池。推荐使用以下配置:
HikariCP(Java 应用):
// application.yml
spring:
datasource:
hikari:
maximum-pool-size: 20 # 最大连接数,建议 = (核心数 * 2) + 磁盘数
minimum-idle: 5 # 最小空闲连接
connection-timeout: 30000 # 获取连接超时30秒
idle-timeout: 600000 # 空闲超时10分钟
max-lifetime: 1800000 # 连接最大生命周期30分钟
connection-test-query: SELECT 1
PgBouncer(适用于 MySQL 的连接池):
; pgbouncer.ini
[databases]
mydb = host=127.0.0.1 port=3306 dbname=mydb
[pgbouncer]
listen_addr = 127.0.0.1
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 50 # 每个用户的连接数
min_pool_size = 10
reserve_pool_size = 5
reserve_pool_timeout = 5
server_idle_timeout = 600
# 启动 PgBouncer
pgbouncer -d /etc/pgbouncer/pgbouncer.ini
# 应用连接 PgBouncer
mysql -h 127.0.0.1 -P 6432 -u app_user -p mydb
5.2 max_connections 合理设置
根据实际业务量和服务器资源设置合理的 max_connections。建议通过公式估算:
max_connections = (可用内存 - MySQL全局缓冲) / (单连接内存 + 连接缓冲区)
典型推荐值:
|
服务器规格 |
max_connections 推荐值 |
|---|---|
|
4核8GB |
300-500 |
|
8核16GB |
500-800 |
|
16核32GB |
800-1500 |
|
32核64GB |
1500-3000 |
-- 永久设置 max_connections(my.cnf)
SET PERSIST max_connections = 800;
-- 或手动编辑 /etc/mysql/my.cnf
[mysqld]
max_connections = 800
5.3 wait_timeout 和 interactive_timeout 设置
这两个参数控制空闲连接的超时时间。设置过短会导致连接频繁断开重建,增加 MySQL 负担;设置过长则会导致空闲连接占用连接数。
-- 查看当前设置
SHOW VARIABLES LIKE '%timeout';
-- 推荐配置(根据业务特点调整)
SET PERSIST wait_timeout = 600; -- 10分钟
SET PERSIST interactive_timeout = 600; -- 10分钟
# my.cnf
[mysqld]
wait_timeout = 600
interactive_timeout = 600
5.4 应用程序连接管理规范
应用代码中必须遵循以下连接管理规范:
正确写法(Python mysql-connector):
import mysql.connector
from mysql.connector import pooling
# 创建连接池
connection_pool = pooling.MySQLConnectionPool(
pool_name="app_pool",
pool_size=10,
pool_reset_session=True,
host="localhost",
database="mydb",
user="app_user",
password="app_password"
)
def query_data(sql, params=None):
conn = None
cursor = None
try:
conn = connection_pool.get_connection()
cursor = conn.cursor(dictionary=True)
cursor.execute(sql, params or ())
return cursor.fetchall()
except mysql.connector.Error as e:
print(f"数据库错误: {e}")
raise
finally:
if cursor:
cursor.close()
if conn:
conn.close() # 关键:必须归还连接到池
错误写法(导致连接泄漏):
# 错误:未使用 with 语句,且未在 finally 中关闭连接
def query_data_bad(sql):
conn = mysql.connector.connect(
host="localhost",
database="mydb",
user="app_user",
password="app_password"
)
cursor = conn.cursor()
cursor.execute(sql)
result = cursor.fetchall()
# 忘记关闭 cursor 和 conn
return result
6. 排障命令速查表
6.1 紧急诊断命令
# 一行命令查看 MySQL 连接数关键指标
mysql -u root -S /var/lib/mysql/mysql.sock -e "
SELECT
'当前连接数' AS METRIC,
@@Threads_connected AS VALUE
UNION ALL
SELECT '最大连接数', @@max_connections
UNION ALL
SELECT '运行中线程', @@Threads_running
UNION ALL
SELECT '缓存线程', @@Threads_cached;
"
6.2 进程级诊断
-- 查看所有连接详情
SELECT
ID,
USER,
HOST,
DB,
COMMAND,
TIME,
STATE,
LEFT(INFO, 200) ASQUERY
FROM INFORMATION_SCHEMA.PROCESSLIST
WHERE COMMAND != 'Daemon'
ORDERBYTIMEDESC;
-- 按用户分组统计连接数
SELECT
USER,
COUNT(*) ASCOUNT,
MAX(TIME) AS MAX_TIME,
SUM(TIME) AS TOTAL_TIME
FROM INFORMATION_SCHEMA.PROCESSLIST
GROUPBYUSER
ORDERBYCOUNTDESC;
-- 查看正在执行的查询及其耗时
SELECT
ID,
USER,
LEFT(INFO, 100) ASQUERY,
TIME,
STATE
FROM INFORMATION_SCHEMA.PROCESSLIST
WHERE COMMAND = 'Query'
ORDERBYTIMEDESC
LIMIT20;
6.3 锁等待诊断
-- 查看当前锁等待
SELECT
r.trx_id,
r.trx_mysql_thread_id,
r.trx_query,
r.trx_state,
r.trx_started,
TIMESTAMPDIFF(SECOND, r.trx_started, NOW()) AS WAIT_SEC,
l.lock_index,
l.lock_table,
l.lock_type
FROM INFORMATION_SCHEMA.INNODB_TRX r
JOIN INFORMATION_SCHEMA.INNODB_LOCKS l ON r.trx_id = l.lock_trx_id;
-- 查看所有锁
SELECT
object_schema,
object_name,
index_name,
lock_type,
lock_mode,
lock_status,
lock_data
FROM INFORMATION_SCHEMA.INNODB_LOCKS;
6.4 性能诊断
-- 查看 Buffer Pool 命中率
SHOWSTATUSLIKE'Innodb_buffer_pool_read%';
-- 计算命中率
SELECT
(1 - (Variable_value / (Variable_value + Variable_value))) AS HIT_RATE
FROM (
SELECT
VARIABLE_VALUE
FROM performance_schema.global_status
WHERE VARIABLE_NAME = 'Innodb_buffer_pool_read_requests'
) requests,
(
SELECT
VARIABLE_VALUE
FROM performance_schema.global_status
WHERE VARIABLE_NAME = 'Innodb_buffer_pool_reads'
) reads;
-- 查看表锁争用
SHOWSTATUSLIKE'Table_locks%';
6.5 完整诊断脚本
#!/bin/bash
# mysql_diagnosis.sh
# MySQL 连接数爆满完整诊断脚本
SOCKET="/var/lib/mysql/mysql.sock"
OUTPUT="/tmp/mysql_diagnosis_$(date +%Y%m%d_%H%M%S).txt"
exec > >(tee "$OUTPUT") 2>&1
echo"========== MySQL 连接数爆满诊断报告 =========="
echo"生成时间: $(date)"
echo""
echo"【1. 连接数概览】"
mysql -u root -S "$SOCKET" -N -e "
SELECT 'Threads_connected:', Threads_connected FROM performance_schema.global_status WHERE variable_name='Threads_connected';
SELECT 'max_connections:', @@max_connections;
SELECT 'Threads_running:', Threads_running FROM performance_schema.global_status WHERE variable_name='Threads_running';
" 2>/dev/null
echo""
echo"【2. 连接数 TOP 10 用户】"
mysql -u root -S "$SOCKET" -N -e "
SELECT USER, COUNT(*) AS CNT FROM INFORMATION_SCHEMA.PROCESSLIST GROUP BY USER ORDER BY CNT DESC LIMIT 10;
" 2>/dev/null
echo""
echo"【3. 运行时间最长的查询 TOP 10】"
mysql -u root -S "$SOCKET" -N -e "
SELECT ID, USER, LEFT(INFO,100) AS QUERY, TIME, STATE FROM INFORMATION_SCHEMA.PROCESSLIST
WHERE COMMAND='Query' ORDER BY TIME DESC LIMIT 10;
" 2>/dev/null
echo""
echo"【4. 锁等待情况】"
mysql -u root -S "$SOCKET" -N -e "
SELECT COUNT(*) AS LOCK_WAIT_COUNT FROM INFORMATION_SCHEMA.INNODB_TRX WHERE trx_state='LOCK WAIT';
" 2>/dev/null
echo""
echo"【5. 慢查询统计(过去1小时)】"
mysql -u root -S "$SOCKET" -N -e "
SELECT COUNT(*) AS SLOW_QUERY_COUNT FROM mysql.slow_log WHERE start_time > DATE_SUB(NOW(), INTERVAL 1 HOUR);
" 2>/dev/null
echo""
echo"【6. 建议】"
CONN_RATIO=$(mysql -u root -S "$SOCKET" -N -e "SELECT (Threads_connected / @@max_connections) FROM performance_schema.global_status WHERE variable_name='Threads_connected';" 2>/dev/null)
if (( $(echo"$CONN_RATIO > 0.8" | bc -l) )); then
echo"- 当前连接数超过上限的 80%,建议立即清理或提高 max_connections"
fi
echo""
echo"报告已保存至: $OUTPUT"
7. 实战场景与案例分析
7.1 场景一:促销期间突发流量导致连接数打满
某电商平台在双十一大促期间,瞬时流量是平时的50倍,MySQL 连接数在几分钟内从200飙升到151上限。
故障现象:
# 监控告警
[CRITICAL] MySQL_Too_Many_Connections - 连接数 151/151,使用率 100%
# 应用日志
SQLSTATE[HY000] [2002] Can't connect to MySQL server on 'db-master'
ERROR 1040 (HY000): Too many connections
应急处理步骤:
# 1. 立即确认 MySQL 存活
ps aux | grep mysqld | grep -v grep
systemctl status mysql
# 2. 临时提高连接数上限(边止血边分析)
mysql -u root -S /var/lib/mysql/mysql.sock -e "SET GLOBAL max_connections = 500;"
# 3. 快速清理睡眠连接
mysql -u root -S /var/lib/mysql/mysql.sock -e "
SELECT CONCAT('KILL ', ID, ';')
FROM INFORMATION_SCHEMA.PROCESSLIST
WHERE COMMAND = 'Sleep' AND TIME > 10;
" | mysql -u root -S /var/lib/mysql/mysql.sock
# 4. 分析根因:确认是突发流量还是慢查询
mysql -u root -S /var/lib/mysql/mysql.sock -e "
SELECT COUNT(*) AS SLEEP_CONNECTIONS
FROM INFORMATION_SCHEMA.PROCESSLIST
WHERE COMMAND = 'Sleep';
"
根因分析:
-- 促销期间用户大量访问商品详情页
-- 应用层未启用连接池,每个请求创建一个新连接
-- 连接用完后未及时关闭,堆积大量 Sleep 连接
-- 查看连接来源分布
SELECT
USER,
HOST,
COUNT(*) AS CNT,
MAX(TIME) AS MAX_TIME
FROM INFORMATION_SCHEMA.PROCESSLIST
GROUPBYUSER, HOST
ORDERBY CNT DESC
LIMIT10;
解决方案:
# 应用配置:启用连接池
# application.yml
spring:
datasource:
hikari:
maximum-pool-size: 50 # 从默认10提高到50
minimum-idle: 10
connection-timeout: 5000 # 5秒超时
idle-timeout: 300000 # 5分钟空闲
7.2 场景二:慢查询导致连接堆积
某 SaaS 平台财务模块某次导出操作,触发了一条跨表联合查询,因缺少索引导致全表扫描,单次执行耗时超过3分钟。
故障现象:
# 监控告警
[WARNING] MySQL_Threads_running_High - 运行中连接数 45/151
# 应用日志
部分用户反馈导出操作超时
后台任务队列积压
诊断过程:
-- 查看执行时间最长的查询
SELECT
ID,
USER,
HOST,
DB,
COMMAND,
TIME,
STATE,
LEFT(INFO, 200) ASQUERY
FROM INFORMATION_SCHEMA.PROCESSLIST
WHERE COMMAND != 'Sleep'
ORDERBYTIMEDESC
LIMIT10;
-- 输出示例:
-- | ID | USER | HOST | DB | COMMAND | TIME | STATE | QUERY |
-- | 1234 | app | 10.0.0.100 | finance | Query | 185 | Copying to tmp table | SELECT ... FROM orders o JOIN products p ... |
定位慢查询:
-- performance_schema 分析
SELECT
DIGEST,
COUNT_STAR,
SUM_TIMER_WAIT / 1000000000000AS SUM_SECONDS,
AVG_TIMER_WAIT / 1000000000000AS AVG_SECONDS,
LEFT(SQL_TEXT, 200) AS SQL_SAMPLE
FROM performance_schema.events_statements_summary_by_digest
WHERE DIGEST_TEXT LIKE'%orders%products%'
ORDERBY SUM_TIMER_WAIT DESC
LIMIT5;
发现根因:
-- orders 表缺少 product_id 索引
SHOW INDEX FROM orders;
-- EXPLAIN 分析
EXPLAIN SELECT o.*, p.name
FROM orders o
JOIN products p ON o.product_id = p.id
WHERE o.user_id = 12345;
-- 输出显示:type=ALL(全表扫描),rows=5000000
解决方案:
-- 添加索引
ALTER TABLE orders ADD INDEX idx_product_id (product_id);
-- 验证索引
SHOW INDEX FROM orders;
EXPLAIN SELECT o.*, p.name
FROM orders o
JOIN products p ON o.product_id = p.id
WHERE o.user_id = 12345;
-- 现在 type=ref, rows=10
7.3 场景三:连接泄漏导致连接数持续增长
某 Java 应用部署在 Tomcat 容器中,偶发数据库连接泄漏,每次泄漏1-2个连接。
故障现象:
# 监控趋势
06:00 连接数: 50
08:00 连接数: 75
10:00 连接数: 100
12:00 连接数: 130
14:00 连接数: 151 -- 连接数打满
诊断过程:
#!/bin/bash
# detect_conn_leak.sh
mysql -u root -S /var/lib/mysql/mysql.sock -e "
SELECT
USER,
HOST,
COUNT(*) AS CONN_COUNT,
MAX(TIME) AS MAX_TIME,
MIN(TIME) AS MIN_TIME
FROM INFORMATION_SCHEMA.PROCESSLIST
WHERE COMMAND = 'Sleep'
GROUP BY USER, HOST
ORDER BY CONN_COUNT DESC;
"
# 输出示例:
# +------+--------------+------------+-----------+-----------+
# | USER | HOST | CONN_COUNT| MAX_TIME | MIN_TIME |
# +------+--------------+------------+-----------+-----------+
# | app | 10.0.0.50 | 45 | 3600 | 120 |
# | app | 10.0.0.51 | 40 | 3580 | 150 |
# +------+--------------+------------+-----------+-----------+
# app 用户连接数持续增长,且 TIME 很大,说明连接未正确关闭
根因定位:
应用代码中存在数据库连接未正确关闭的逻辑:
// 问题代码(会导致连接泄漏)
public void queryData() {
Connection conn = null;
try {
conn = dataSource.getConnection();
// 执行查询
return query(conn);
} catch (SQLException e) {
// 只记录异常,未关闭连接
log.error("Query failed", e);
}
// 方法结束时 conn 未关闭
}
解决方案:
// 正确代码:使用 try-with-resources
public void queryData() {
try (Connection conn = dataSource.getConnection()) {
return query(conn);
} catch (SQLException e) {
log.error("Query failed", e);
}
}
// Connection 会在方法结束时自动关闭
临时止血脚本:
#!/bin/bash
# scheduled_kill_leak.sh
# 定时清理泄漏的连接(配合修复代码使用)
SOCKET="/var/lib/mysql/mysql.sock"
THRESHOLD=3600 # 连接时间超过1小时的视为泄漏
while true; do
# 杀掉超时连接
mysql -u root -S "$SOCKET" -N -e "
SELECT CONCAT('KILL ', ID, ';')
FROM INFORMATION_SCHEMA.PROCESSLIST
WHERE COMMAND = 'Sleep'
AND TIME > $THRESHOLD;
" | mysql -u root -S "$SOCKET" 2>/dev/null
sleep 300 # 每5分钟执行一次
done
7.4 场景四:max_connections 设置过小
某数据仓库初期配置 max_connections = 151,随着业务增长,连接数经常接近上限。
评估过程:
#!/bin/bash
# evaluate_max_connections.sh
# 查看历史最大连接数
mysql -u root -e "SHOW STATUS LIKE 'Max_used_connections';"
# 查看服务器资源
TOTAL_MEM=$(free -m | awk '/^Mem:/{print $2}')
CPU_CORES=$(nproc)
echo"服务器总内存: ${TOTAL_MEM}MB"
echo"CPU核心数: ${CPU_CORES}"
# 计算合理上限
# MySQL 8.0,每个连接约占用 256KB stack + 4MB buffers
# 保守估算:单连接 ~4MB
# 可用于连接的内存:总内存 * 50%
AVAILABLE_MEM_MB=$((TOTAL_MEM / 2))
RECOMMENDED_MAX_CONN=$((AVAILABLE_MEM_MB / 4))
echo"建议 max_connections: ${RECOMMENDED_MAX_CONN}"
调整方案:
-- 临时调整
SET GLOBAL max_connections = 1000;
-- 永久调整(写入配置文件)
-- /etc/mysql/my.cnf
[mysqld]
max_connections = 1000
7.5 场景五:ProxySQL 连接池故障
使用 ProxySQL 作为中间层时,ProxySQL 本身成为连接瓶颈。
诊断:
# 登录 ProxySQL 管理界面
mysql -h 127.0.0.1 -P 6032 -u admin -pAdminPass
# 查看连接池状态
Admin> SELECT * FROM stats_mysql_connection_pool;
# 查看各后端 MySQL 的健康状态
Admin> SELECT * FROM mysql_servers;
# 查看连接错误
Admin> SELECT * FROM stats_mysql_errors;
常见 ProxySQL 问题:
-- 问题1: 后端 MySQL 连接数过多
-- 解决:调整 ProxySQL 的 pool_size
UPDATE mysql_servers SET max_connections=100 WHERE hostgroup_id=10;
LOAD MYSQL SERVERS TO RUNTIME;
-- 问题2: ProxySQL max_connections 过小
-- 解决:调整 ProxySQL 全局限制
SET VARIABLES max_connections=2000;
8. 总结
MySQL 连接数爆满是“症状”而非“病根”。运维工程师在处理此类故障时,必须按照以下优先级行动:
第一优先级(止血): 确认 MySQL 进程存活,通过 SHOW PROCESSLIST 定位占用连接的来源,杀掉长时间睡眠的连接和执行慢的查询。
第二优先级(定因): 分析是慢查询导致、连接泄漏、短连接冲击还是 max_connections 配置过小。不同根因对应不同处置方法。
第三优先级(治本): 从根源解决问题——优化慢查询、修复连接泄漏代码、引入连接池或调整参数配置。
第四优先级(预防): 配置 Prometheus + MySQL Exporter 监控体系,设置合理的告警阈值,将连接数监控纳入日常巡检。
在实际生产环境中,建议运维团队提前准备好上述脚本,放在 /opt/scripts/mysql/ 目录下,并将相关命令整理成标准操作手册(Standard Operation Procedure)。当故障真正发生时,团队成员可以按照手册步骤执行,避免在高压环境下出现操作失误。
本文基于 MySQL 8.0.39、Prometheus 3.0、Grafana 11.0 环境编写,测试于 CentOS Stream 9 和 Ubuntu 24.04 LTS。
3139

被折叠的 条评论
为什么被折叠?



