开头还是介绍一下群,如果感兴趣PolarDB ,MongoDB ,MySQL ,PostgreSQL ,Redis, OceanBase, Sql Server等有问题,有需求都可以加群群内有各大数据库行业大咖,可以解决你的问题。加群请联系 liuaustin3 ,(共2800人左右 1 + 2 + 3 + 4 +5 + 6 + 7 + 8 +9)(1 2 3 4 5 6 7群均已爆满,开8群260+ 9群)
在使用PolarDB for MySQL的过程中,我们遇到一个问题,PolarDB 8.02的小版本8.02.2.24 在添加索引的时候,会有部分情况无法添加索引,添加索引失败。具体表现是如下图:
在添加索引后会停滞,如果你等不及,直接KILL 添加索引的会话(实际上不KILL长时间也会索引添加失败),则BUG触发,稍后会返回如下报错信息,然后在怎么添加索引都不在报错而是直接无响应。
Fail to get table lock on replica; You can set polar_support_mdl_sync_preemption = ON;
但即使你打开了 polar_support_mdl_sync_preemption = ON; 也无济于事,还会在添加索引的时候报这个错误,同时也不是每次都会报错,而是在有大事务情况下,在读节点上很长时间无法运行完毕的情况下,会出现这个报错,而且只要出现这个报错,就会导致后续在添加索引的失败。解决方案为重启从节点。
这里提示:解决问题的关键是升级数据库版本,从8.02.2.24升级到 8.0.2.2.27,实际上8.0.2.2.24的BUG不少,这里列举一下。
1 优化文件系统在DDL过程中频繁创建和删除表文件时的性能稳定性。 2 新增高优先级DDL能力,保证DDL执行成功。
这两个部分是在8.02.2.25中修复的。
故障的原理是
1 在添加索引的情况下,基于POLARDB的原理,shared storage的架构,所以主从是一个存储,在这样的情况下,如果在主库将添加索引的SESSION KILL,会卡主,因为主库和从库之间有线程,从库不返回,主库就一直等导致的。
2 解决方案只能强行重启从节点,给主库一个信号,然后如果此时在建立索引则会建立索引立即成功。
在排查POLARDB FOR MySQL的数据库时,可以利用如下的命令
/ force_node='p-2zexq8l8pzx8m6i' / select * from information_schema.processlist where db = 'scm_suppl'; / force_node='p-2zec58b6uxm16a1' / select * from information_schema.processlist where db = 'scm_suppl'; / force_node='p-2zeij811u776s5m' / select * from information_schema.processlist where db = 'scm_supply';
在具体查询PolarDB的各个节点的信息,需要在语句前加入force_node 来具体查询各个节点的具体的状态信息。
查看DDL 工作的进度
SELECT THREAD_ID,EVENT_ID, EVENT_NAME, WORK_COMPLETED, WORK_ESTIMATED,(WORK_COMPLETED/WORK_ESTIMATED)*100 as PROGRESS FROM performance_schema.events_stages_current;
如一些常见的问题
解决方案 1 及时进行从节点的事务的COMMIT ,如何发现没有及时的COMMIT 在从节点执行,并发现和解决
1 在主节点运行 /
force_node='主节点名' / show processlist; 如果在其中发现了如图中的位置写着,wait for syncing with replicas。 那么就证明的DDL 在被从库卡着
2 此时你需要去各个从库来对
select * from information_schema.innodb_log_mdl_slot where slot_state = "SLOT_ACQUIRING"
来去找到从库那个语句在让主库的DDL等待的原因。
你可以开启 polar_slave_work_on_nonblock_mdl_mode = on
他的主要目的是为了解决从库中的长事务的问题,而不是长SQL的问题,长SQL这个解决不了,你还的自己在从库发现长时间运行的SQL并且杀死他,而不是依靠polar_slave_work_on_nonblock_mdl_mode 来解决问题。
同时在进行DDL操作可以通过POLARDB 中的如下语句来查看对应的操作的进度。
SELECT THREAD_ID,EVENT_ID, EVENT_NAME, WORK_COMPLETED, WORK_ESTIMATED,(WORK_COMPLETED/WORK_ESTIMATED)*100 as PROGRESS FROM performance_schema.events_stages_current; +-----------+----------+------------------------------------------------------+----------------+----------------+----------+ | THREAD_ID | EVENT_ID | EVENT_NAME | WORK_COMPLETED | WORK_ESTIMATED | PROGRESS | +-----------+----------+------------------------------------------------------+----------------+----------------+----------+ | 3057989 | 13 | stage/innodb/alter table ( read PK and internal sort) | 56634 | 330135 | 17.1548 | +-----------+----------+------------------------------------------------------+----------------+----------------+----------+ 1 row in set (0.00 sec)
然后通过如下的语句,将3057989放入可以查询到对应的语句
SELECT esc.THREAD_ID, esc.EVENT_NAME, esc.WORK_COMPLETED, esc.WORK_ESTIMATED, pl.INFO FROM performance_schema.events_stages_current esc LEFT JOIN performance_schema.threads th ON esc.thread_id = th.thread_id LEFT JOIN information_schema.PROCESSLIST pl ON th.PROCESSLIST_ID = pl.ID; +-----------+------------------------------------------------------+----------------+----------------+-----------------------------------------------------------------------------------------+ | THREAD_ID | EVENT_NAME | WORK_COMPLETED | WORK_ESTIMATED | INFO | +-----------+------------------------------------------------------+----------------+----------------+-----------------------------------------------------------------------------------------+ | 3057989 | stage/innodb/alter table ( read PK and internal sort) | 77034 | 330519 | ALTER TABLE test.test ALGORITHM=INPLACE, ADD testA VARCHAR(20) NOT NULL DEFAULT 'testA' | +-----------+------------------------------------------------------+----------------+----------------+-----------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
还可以通过获得 medtadata_locks 信息来查看具体performance_schema.threads 中持有的MDL线程
SELECT * FROM performance_schema.metadata_locks; +-------------+--------------------+------------------+-------------+-----------------------+---------------------+---------------+-------------+------------------------+-----------------+----------------+ | OBJECT_TYPE | OBJECT_SCHEMA | OBJECT_NAME | COLUMN_NAME | OBJECT_INSTANCE_BEGIN | LOCK_TYPE | LOCK_DURATION | LOCK_STATUS | SOURCE | OWNER_THREAD_ID | OWNER_EVENT_ID | +-------------+--------------------+------------------+-------------+-----------------------+---------------------+---------------+-------------+------------------------+-----------------+----------------+ | GLOBAL | NULL | NULL | NULL | 139949462878336 | INTENTION_EXCLUSIVE | STATEMENT | GRANTED | sql_base.cc:3103 | 3055785 | 1 | | TABLE | test | test | NULL | 139931318980224 | SHARED_WRITE | TRANSACTION | GRANTED | sql_parse.cc:6479 | 3055785 | 1 | | COMMIT | NULL | NULL | NULL | 139931318980480 | INTENTION_EXCLUSIVE | EXPLICIT | GRANTED | handler.cc:1669 | 3055785 | 1 | | TABLE | performance_schema | metadata_locks | NULL | 139934227366144 | SHARED_READ | TRANSACTION | GRANTED | sql_parse.cc:6479 | 3057612 | 1 | | GLOBAL | NULL | NULL | NULL | 139934216849664 | INTENTION_EXCLUSIVE | STATEMENT | GRANTED | sql_base.cc:5519 | 3057989 | 13 | | SCHEMA | test | NULL | NULL | 139934216849408 | INTENTION_EXCLUSIVE | TRANSACTION | GRANTED | sql_base.cc:5506 | 3057989 | 13 | | TABLE | test | test | NULL | 139934216848640 | SHARED_UPGRADABLE | TRANSACTION | GRANTED | sql_parse.cc:6479 | 3057989 | 13 | | BACKUP LOCK | NULL | NULL | NULL | 139934216849280 | INTENTION_EXCLUSIVE | TRANSACTION | GRANTED | sql_base.cc:5526 | 3057989 | 13 | | TABLESPACE | NULL | test / test | NULL | 139934216848384 | INTENTION_EXCLUSIVE | TRANSACTION | GRANTED | lock.cc:815 | 3057989 | 13 | | TABLE | test | #sql-17d9_2ea89a | NULL | 139934216848896 | EXCLUSIVE | STATEMENT | GRANTED | sql_table.cc:15054 | 3057989 | 13 | | GLOBAL | NULL | NULL | NULL | 139934216850176 | INTENTION_EXCLUSIVE | TRANSACTION | GRANTED | dictionary_impl.cc:416 | 3057989 | 13 | | TABLESPACE | NULL | test / test | NULL | 139934216849920 | EXCLUSIVE | TRANSACTION | GRANTED | dictionary_impl.cc:397 | 3057989 | 13 | +-------------+--------------------+------------------+-------------+-----------------------+---------------------+---------------+-------------+------------------------+-----------------+----------------+ 12 rows in set (0.00 sec) SELECT * FROM performance_schema.threads WHERE THREAD_ID = "3057612" in performance_schema.metadata_locks table ";
如果系统中存在Wait for syncing with replicas,PolarDB采用集群架构,因此在主节点上执行DDL操作时,需要等待所有只读节点释放对应的MDL锁。若您在使用show processlist时,观察到DDL操作处于Wait for syncing with replicas的状态,则说明在只读节点上持有了对应表的MDL锁。
在主库进行查找发现 Wait for syncing with replicas
/*force_node= 'pi-bp10k7631d6k3****' */ SHOW PROCESSLIST;
+-----------+-----------------+-----------------------+--------------------+----------------+---------+--------------------------------+-------------------------------------------------------------+ | Id | User | Host | db | Command | Time | State | Info | +-----------+-----------------+-----------------------+--------------------+----------------+---------+--------------------------------+-------------------------------------------------------------+ | 98 | event_scheduler | localhost | NULL | Daemon | 1307512 | Waiting on empty queue | NULL | | 109 | replicator | 11.111.XX.XX:62549 | NULL | Polar Log Dump | 1 | Reading log from innodb | NULL | | 113 | replicator | 11.111.XX.XX:62560 | NULL | Polar Log Ack | 1 | Receiving from client | NULL | | 133 | replicator | 10.13.64.70:42712 | NULL | Polar Log Dump | 1 | Reading log from innodb | NULL | | 138 | replicator | 10.13.64.70:42723 | NULL | Polar Log Ack | 0 | Receiving from client | NULL | 3064011 | root | 127.0.0.1:59703 | NULL | Sleep | 742 | | NULL | | 3064013 | root | 127.0.0.1:59710 | NULL | Sleep | 21 | | NULL | | 3064015 | root | 127.0.0.1:59713 | NULL | Sleep | 21 | | NULL | | 3064018 | root | 127.0.0.1:59716 | NULL | Sleep | 1 | | NULL | | 3067041 | zyg_root | 172.17.28.253:48594 | test | Query | 6 | Wait for syncing with replicas | alter table t1 add column d varchar(10),algorithm = inplace | +-----------+-----------------+-----------------------+--------------------+----------------+---------+--------------------------------+-------------------------------------------------------------+ 21 rows in set (0.00 sec)
然后转到从节点,通过metadata_locks 语句找到对应的进程在通过 performance_schema.threads表来进一步查找对应的线程信息。最后决定是如何将这些阻塞DDL的操作仅先处理。
/*force_node= 'pi-bp186ko4o21wl****' */ SELECT * FROM performance_schema.metadata_locks; +-------------+--------------------+----------------+-------------+-----------------------+---------------------+---------------+-------------+--------------------+-----------------+----------------+ | OBJECT_TYPE | OBJECT_SCHEMA | OBJECT_NAME | COLUMN_NAME | OBJECT_INSTANCE_BEGIN | LOCK_TYPE | LOCK_DURATION | LOCK_STATUS | SOURCE | OWNER_THREAD_ID | OWNER_EVENT_ID | +-------------+--------------------+----------------+-------------+-----------------------+---------------------+---------------+-------------+--------------------+-----------------+----------------+ | TABLE | test | t1 | NULL | 139394298895872 | SHARED_READ | TRANSACTION | GRANTED | sql_parse.cc:7688 | 3513381 | 1 | | TABLE | test | t1 | NULL | 139394298602240 | SHARED_READ | TRANSACTION | GRANTED | sql_parse.cc:7688 | 3519277 | 1 | | TABLE | test | t1 | NULL | 139917548369664 | SHARED_READ | TRANSACTION | GRANTED | sql_parse.cc:7688 | 3519279 | 1 | | TABLE | test | t1 | NULL | 139394296661888 | SHARED_READ | TRANSACTION | GRANTED | sql_parse.cc:7688 | 3519278 | 1 | | TABLE | test | t1 | NULL | 139394297595520 | SHARED_READ | TRANSACTION | GRANTED | sql_parse.cc:7688 | 3519276 | 1 | | SCHEMA | test | NULL | NULL | 139464322084864 | INTENTION_EXCLUSIVE | EXPLICIT | GRANTED | sql_table.cc:17404 | 57 | 1 | | TABLE | test | t1 | NULL | 139464322084992 | EXCLUSIVE | EXPLICIT | PENDING | sql_table.cc:17410 | 57 | 1 | | TABLE | performance_schema | metadata_locks | NULL | 139394296038784 | SHARED_READ | TRANSACTION | GRANTED | sql_parse.cc:7688 | 3518506 | 1 |
+-------------+--------------------+----------------+-------------+-----------------------+---------------------+---------------+-------------+--------------------+-----------------+----------------+ 8 rows in set (0.00 sec) /*force_node= 'pi-bp186ko4o21wl****' */SELECT * FROM performance_schema.threads WHERE THREAD_ID IN (3519278,3513381,3519279,3519276,3519277)\G *************************** 1. row *************************** THREAD_ID: 3513381 NAME: thread/sql/one_connection TYPE: FOREGROUND PROCESSLIST_ID: 538961413 PROCESSLIST_USER: zyg_root PROCESSLIST_HOST: 172.17.28.253 PROCESSLIST_DB: test PROCESSLIST_COMMAND: Connect PROCESSLIST_TIME: 103 PROCESSLIST_STATE: User sleep PROCESSLIST_INFO: select *,sleep(60) from t1 PARENT_THREAD_ID: NULL ROLE: NULL INSTRUMENTED: YES HISTORY: YES CONNECTION_TYPE: TCP/IP THREAD_OS_ID: 63826 RESOURCE_GROUP: NULL
这里我简短的画一个处理问题的流程图