Greenplum异常节点恢复,屋漏偏逢连夜雨?

> 单位内的一套greenplum集群出现了一点小问题:其中一台segment节点的主镜像出错,mirror镜像自动升级成了primary;另外一个节点提示数据库PID不存在,但是各功能都正常。今天就记录一下我们的修复过程。`(部分内容做了脱敏处理,显示会不完整)`

# 故障现象

在日常巡检中,[gpstate](.html "gpstate")是我们最常用的命令工具:显示有关正在运行的Greenplum数据库实例的信息。

```linux

$ gpstate -m

20250320:07:00:02:024403 gpstate:[INFO]:-Starting gpstate with args: -m

20250320:07:00:02:024403 gpstate:[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 6.8.1 build

20250320:07:00:02:024403 gpstate:[INFO]:-master Greenplum Version: 'PostgreSQL 9.4.24 (Greenplum Database 6.8.1 build on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 6.4.0, 64-b3:23:56'

20250320:07:00:02:024403 gpstate:[INFO]:-Obtaining Segment details from master...

20250320:07:00:02:024403 gpstate:[INFO]:--------------------------------------------------------------

20250320:07:00:02:024403 gpstate:[INFO]:--Current GPDB mirror list and status

20250320:07:00:02:024403 gpstate:[INFO]:--Type = Group

20250320:07:00:02:024403 gpstate:[INFO]:--------------------------------------------------------------

20250320:07:00:02:024403 gpstate:[INFO]:- Mirror Datadir Port Status Data Status

20250320:07:00:02:024403 gpstate:[INFO]:- mpp-02 /data1/m1/gpseg0 43000 Passive Synchronized

20250320:07:00:02:024403 gpstate:[INFO]:- mpp-02 /data1/m2/gpseg1 43001 Passive Synchronized

20250320:07:00:02:024403 gpstate:[INFO]:- mpp-03 /data1/m1/gpseg2 43000 Passive Synchronized

20250320:07:00:02:024403 gpstate:[INFO]:- mpp-03 /data1/m2/gpseg3 43001 Passive Synchronized

20250320:07:00:02:024403 gpstate:[INFO]:- mpp-04 /data1/m1/gpseg4 43000 Passive Synchronized

20250320:07:00:02:024403 gpstate:[INFO]:- mpp-04 /data1/m2/gpseg5 43001 Passive Synchronized

20250320:07:00:02:024403 gpstate:[INFO]:- mpp-05 /data1/m1/gpseg6 43000 Passive Synchronized

20250320:07:00:02:024403 gpstate:[INFO]:- mpp-05 /data1/m2/gpseg7 43001 Passive Synchronized

20250320:07:00:02:024403 gpstate:[INFO]:- mpp-06 /data1/m1/gpseg8 43000 Acting as Primary Not In Sync

20250320:07:00:02:024403 gpstate:[INFO]:- mpp-06 /data1/m2/gpseg9 43001 Acting as Primary Not In Sync

20250320:07:00:02:024403 gpstate:[INFO]:- mpp-01 /data1/m1/gpseg10 43000 Passive Synchronized

20250320:07:00:02:024403 gpstate:[INFO]:- mpp-01 /data1/m2/gpseg11 43001 Passive Synchronized

20250320:07:00:02:024403 gpstate:[INFO]:--------------------------------------------------------------

20250320:07:00:02:024403 gpstate:[WARNING]:-2 segment(s) configured as mirror(s) are acting as primaries

20250320:07:00:02:024403 gpstate:[WARNING]:-2 mirror segment(s) acting as primaries are not synchronized

$ gpstate -e

20250320:07:00:02:024653 gpstate:[INFO]:-Starting gpstate with args: -e

20250320:07:00:02:024653 gpstate:[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 6.8.1 build

20250320:07:00:02:024653 gpstate:[INFO]:-master Greenplum Version: 'PostgreSQL 9.4.24 (Greenplum Database 6.8.1 build on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 6.4.0, 64-b3:23:56'

20250320:07:00:02:024653 gpstate:[INFO]:-Obtaining Segment details from master...

20250320:07:00:02:024653 gpstate:[INFO]:-Gathering data from segments...

20250320:07:00:06:024653 gpstate:[WARNING]:-pg_stat_replication shows no standby connections

20250320:07:00:06:024653 gpstate:[WARNING]:-pg_stat_replication shows no standby connections

20250320:07:00:06:024653 gpstate:[INFO]:-----------------------------------------------------

20250320:07:00:06:024653 gpstate:[INFO]:-Segment Mirroring Status Report

20250320:07:00:06:024653 gpstate:[INFO]:-----------------------------------------------------

20250320:07:00:06:024653 gpstate:[INFO]:-Segments with Primary and Mirror Roles Switched

20250320:07:00:06:024653 gpstate:[INFO]:- Current Primary Port Mirror Port

20250320:07:00:06:024653 gpstate:[INFO]:- mpp-06 43000 znhcy-edcmpp-05 42000

20250320:07:00:06:024653 gpstate:[INFO]:- mpp-06 43001 znhcy-edcmpp-05 42001

20250320:07:00:06:024653 gpstate:[INFO]:-----------------------------------------------------

20250320:07:00:06:024653 gpstate:[INFO]:-Unsynchronized Segment Pairs

20250320:07:00:06:024653 gpstate:[INFO]:- Current Primary Port Mirror Port

20250320:07:00:06:024653 gpstate:[INFO]:- mpp-06 43000 znhcy-edcmpp-05 42000

20250320:07:00:06:024653 gpstate:[INFO]:- mpp-06 43001 znhcy-edcmpp-05 42001

20250320:07:00:06:024653 gpstate:[INFO]:-----------------------------------------------------

20250320:07:00:06:024653 gpstate:[INFO]:-Downed Segments (may include segments where status could not be retrieved)

20250320:07:00:06:024653 gpstate:[INFO]:- Segment Port Config status Status

20250320:07:00:06:024653 gpstate:[INFO]:- mpp-02 43000 Up Process error -- database process may be down

20250320:07:00:06:024653 gpstate:[INFO]:- mpp-02 43001 Up Process error -- database process may be down

20250320:07:00:06:024653 gpstate:[INFO]:- mpp-02 42000 Up Process error -- database process may be down

20250320:07:00:06:024653 gpstate:[INFO]:- mpp-02 42001 Up Process error -- database process may be down

20250320:07:00:06:024653 gpstate:[INFO]:- mpp-05 42000 Down Down in configuration

20250320:07:00:06:024653 gpstate:[INFO]:- mpp-05 42001 Down Down in configuration

```

从这里我们可以看到:

- mpp-02节点的数据库pid出现了问题,但是它和相关节点的镜像复制是正常

- mpp-05节点的primary gpseg8和gpseg9状态为DOWN,它在mpp-06节点上的mirror镜像升级成了Primary

# 故障修复

在greenplum中,[gprecoverseg](.html "gprecoverseg")工具用于恢复已标记为down的主Segment实例或镜像Segment实例。但是这里有个前提:`必须是启用了镜像的集群`

在mpp-02节点查看`ps -ef|grep postgres`发现相关进程是存在的,但是gpstate中又提示数据库PID不存在,当时想着mpp-02上面的gpseg都有对应的mirror并且同步状态正常,就kill了postgres的进程并且重启mpp-02节点。后来复盘时觉得这一步可能是多余的,因为这个操作引发了后面的另外一个问题。

## gprecoverseg恢复故障节点

mpp-02节点重启,开始gprecoverseg恢复mpp-02和mpp-05的gpseg

```linux

$ gprecoverseg

20250321:20:05:58:022067 gprecoverseg:-[INFO]:-Starting gprecoverseg with args:

20250321:20:05:58:022067 gprecoverseg:-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 6.8.1 build

20250321:20:05:58:022067 gprecoverseg:-[INFO]:-master Greenplum Version: 'PostgreSQL 9.4.24 (Greenplum Database 6.8.1 build on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 6.4.0, 64-bit compiled on Jun 11 2020 03:23:56'

20250321:20:05:58:022067 gprecoverseg:-[INFO]:-Obtaining Segment details from master...

20250321:20:05:58:022067 gprecoverseg:-[INFO]:-Heap checksum setting is consistent between master and the segments that are candidates for recoverseg

20250321:20:05:58:022067 gprecoverseg:-[INFO]:-Greenplum instance recovery parameters

20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:05:58:022067 gprecoverseg:-[INFO]:-Recovery type = Standard

20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:05:58:022067 gprecoverseg:-[INFO]:-Recovery 1 of 6

20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Synchronization mode = Incremental

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance host = mpp-02

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance address = mpp-02

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance directory = /data1/m1/gpseg0

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance port = 43000

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance host = mpp-01

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance address = mpp-01

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance directory = /data1/p1/gpseg0

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance port = 42000

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Target = in-place

20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:05:58:022067 gprecoverseg:-[INFO]:-Recovery 2 of 6

20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Synchronization mode = Incremental

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance host = mpp-02

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance address = mpp-02

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance directory = /data1/m2/gpseg1

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance port = 43001

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance host = mpp-01

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance address = mpp-01

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance directory = /data1/p2/gpseg1

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance port = 42001

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Target = in-place

20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:05:58:022067 gprecoverseg:-[INFO]:-Recovery 3 of 6

20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Synchronization mode = Incremental

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance host = mpp-02

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance address = mpp-02

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance directory = /data1/p1/gpseg2

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance port = 42000

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance host = mpp-03

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance address = mpp-03

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance directory = /data1/m1/gpseg2

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance port = 43000

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Target = in-place

20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:05:58:022067 gprecoverseg:-[INFO]:-Recovery 4 of 6

20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Synchronization mode = Incremental

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance host = mpp-02

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance address = mpp-02

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance directory = /data1/p2/gpseg3

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance port = 42001

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance host = mpp-03

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance address = mpp-03

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance directory = /data1/m2/gpseg3

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance port = 43001

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Target = in-place

20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:05:58:022067 gprecoverseg:-[INFO]:-Recovery 5 of 6

20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Synchronization mode = Incremental

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance host = mpp-05

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance address = mpp-05

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance directory = /data1/p1/gpseg8

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance port = 42000

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance host = mpp-06

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance address = mpp-06

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance directory = /data1/m1/gpseg8

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance port = 43000

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Target = in-place

20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:05:58:022067 gprecoverseg:-[INFO]:-Recovery 6 of 6

20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Synchronization mode = Incremental

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance host = mpp-05

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance address = mpp-05

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance directory = /data1/p2/gpseg9

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Failed instance port = 42001

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance host = mpp-06

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance address = mpp-06

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance directory = /data1/m2/gpseg9

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Source instance port = 43001

20250321:20:05:58:022067 gprecoverseg:-[INFO]:- Recovery Target = in-place

20250321:20:05:58:022067 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:06:01:022067 gprecoverseg:-[INFO]:-6 segment(s) to recover

20250321:20:06:01:022067 gprecoverseg:-[INFO]:-Ensuring 6 failed segment(s) are stopped

20250321:20:06:05:022067 gprecoverseg:-[INFO]:-3033: /data1/p1/gpseg8

20250321:20:06:08:022067 gprecoverseg:-[INFO]:-3035: /data1/p2/gpseg9

20250321:20:06:23:022067 gprecoverseg:-[INFO]:-Ensuring that shared memory is cleaned up for stopped segments

20250321:20:06:24:022067 gprecoverseg:-[INFO]:-Updating configuration with new mirrors

20250321:20:06:24:022067 gprecoverseg:-[INFO]:-Updating mirrors

20250321:20:06:24:022067 gprecoverseg:-[INFO]:-Running pg_rewind on required mirrors

20250321:20:13:45:022067 gprecoverseg:-[INFO]:-Starting mirrors

20250321:20:13:45:022067 gprecoverseg:-[INFO]:-era is None

20250321:20:13:45:022067 gprecoverseg:-[INFO]:-Commencing parallel segment instance startup, please wait...

20250321:20:18:38:022067 gprecoverseg:-[INFO]:-Process results...

20250321:20:18:38:022067 gprecoverseg:-[INFO]:-Triggering FTS probe

20250321:20:18:38:022067 gprecoverseg:-[INFO]:-******************************************************************

20250321:20:18:38:022067 gprecoverseg:-[INFO]:-Updating segments for streaming is completed.

20250321:20:18:38:022067 gprecoverseg:-[INFO]:-For segments updated successfully, streaming will continue in the background.

20250321:20:18:38:022067 gprecoverseg:-[INFO]:-Use gpstate -s to check the streaming progress.

20250321:20:18:38:022067 gprecoverseg:-[INFO]:-******************************************************************

$ gprecoverseg -r

20250321:20:19:45:023457 gprecoverseg:-[INFO]:-Starting gprecoverseg with args: -r

20250321:20:19:45:023457 gprecoverseg:-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 6.8.1 build commit:7118e8aca825b743dd9477d19406fcc06fa53852'

20250321:20:19:45:023457 gprecoverseg:-[INFO]:-master Greenplum Version: 'PostgreSQL 9.4.24 (Greenplum Database 6.8.1 build commit:7118e8aca825b743dd9477d19406fcc06fa53852) on x86_64-unknown-linux-gnu, compiled by piled on Jun 11 2020 03:23:56'

20250321:20:19:45:023457 gprecoverseg:-[INFO]:-Obtaining Segment details from master...

20250321:20:19:45:023457 gprecoverseg:-[INFO]:-Greenplum instance recovery parameters

20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:19:45:023457 gprecoverseg:-[INFO]:-Recovery type = Rebalance

20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:19:45:023457 gprecoverseg:-[INFO]:-Unbalanced segment 1 of 8

20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance host = mpp-03

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance address = mpp-03

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance directory = /data1/m1/gpseg2

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance port = 43000

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Balanced role = Mirror

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Current role = Primary

20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:19:45:023457 gprecoverseg:-[INFO]:-Unbalanced segment 2 of 8

20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance host = mpp-02

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance address = mpp-02

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance directory = /data1/p1/gpseg2

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance port = 42000

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Balanced role = Primary

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Current role = Mirror

20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:19:45:023457 gprecoverseg:-[INFO]:-Unbalanced segment 3 of 8

20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance host = mpp-03

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance address = mpp-03

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance directory = /data1/m2/gpseg3

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance port = 43001

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Balanced role = Mirror

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Current role = Primary

20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:19:45:023457 gprecoverseg:-[INFO]:-Unbalanced segment 4 of 8

20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance host = mpp-02

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance address = mpp-02

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance directory = /data1/p2/gpseg3

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance port = 42001

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Balanced role = Primary

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Current role = Mirror

20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:19:45:023457 gprecoverseg:-[INFO]:-Unbalanced segment 5 of 8

20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance host = mpp-06

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance address = mpp-06

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance directory = /data1/m1/gpseg8

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance port = 43000

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Balanced role = Mirror

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Current role = Primary

20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:19:45:023457 gprecoverseg:-[INFO]:-Unbalanced segment 6 of 8

20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance host = mpp-05

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance address = mpp-05

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance directory = /data1/p1/gpseg8

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance port = 42000

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Balanced role = Primary

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Current role = Mirror

20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:19:45:023457 gprecoverseg:-[INFO]:-Unbalanced segment 7 of 8

20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance host = mpp-06

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance address = mpp-06

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance directory = /data1/m2/gpseg9

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance port = 43001

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Balanced role = Mirror

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Current role = Primary

20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:19:45:023457 gprecoverseg:-[INFO]:-Unbalanced segment 8 of 8

20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance host = mpp-05

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance address = mpp-05

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance directory = /data1/p2/gpseg9

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Unbalanced instance port = 42001

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Balanced role = Primary

20250321:20:19:45:023457 gprecoverseg:-[INFO]:- Current role = Mirror

20250321:20:19:45:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:19:45:023457 gprecoverseg:-[WARNING]:-This operation will cancel queries that are currently executing.

20250321:20:19:45:023457 gprecoverseg:-[WARNING]:-Connections to the database however will not be interrupted.

20250321:20:19:47:023457 gprecoverseg:-[INFO]:-Getting unbalanced segments

20250321:20:19:47:023457 gprecoverseg:-[INFO]:-Stopping unbalanced primary segments...

20250321:20:20:48:023457 gprecoverseg:-[INFO]:-Triggering segment reconfiguration

20250321:20:20:52:023457 gprecoverseg:-[INFO]:-Starting segment synchronization

20250321:20:20:52:023457 gprecoverseg:-[INFO]:-=============================START ANOTHER RECOVER=========================================

20250321:20:20:52:023457 gprecoverseg:-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 6.8.1 build commit:7118e8aca825b743dd9477d19406fcc06fa53852'

20250321:20:20:52:023457 gprecoverseg:-[INFO]:-master Greenplum Version: 'PostgreSQL 9.4.24 (Greenplum Database 6.8.1 build commit:7118e8aca825b743dd9477d19406fcc06fa53852) on x86_64-unknown-linux-gnu, compiled by piled on Jun 11 2020 03:23:56'

20250321:20:20:52:023457 gprecoverseg:-[INFO]:-Obtaining Segment details from master...

20250321:20:20:52:023457 gprecoverseg:-[INFO]:-Heap checksum setting is consistent between master and the segments that are candidates for recoverseg

20250321:20:20:52:023457 gprecoverseg:-[INFO]:-Greenplum instance recovery parameters

20250321:20:20:52:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:20:52:023457 gprecoverseg:-[INFO]:-Recovery type = Standard

20250321:20:20:52:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:20:52:023457 gprecoverseg:-[INFO]:-Recovery 1 of 4

20250321:20:20:52:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Synchronization mode = Incremental

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance host = mpp-03

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance address = mpp-03

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance directory = /data1/m1/gpseg2

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance port = 43000

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance host = mpp-02

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance address = mpp-02

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance directory = /data1/p1/gpseg2

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance port = 42000

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Target = in-place

20250321:20:20:52:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:20:52:023457 gprecoverseg:-[INFO]:-Recovery 2 of 4

20250321:20:20:52:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Synchronization mode = Incremental

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance host = mpp-03

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance address = mpp-03

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance directory = /data1/m2/gpseg3

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance port = 43001

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance host = mpp-02

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance address = mpp-02

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance directory = /data1/p2/gpseg3

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance port = 42001

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Target = in-place

20250321:20:20:52:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:20:52:023457 gprecoverseg:-[INFO]:-Recovery 3 of 4

20250321:20:20:52:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Synchronization mode = Incremental

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance host = mpp-06

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance address = mpp-06

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance directory = /data1/m1/gpseg8

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance port = 43000

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance host = mpp-05

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance address = mpp-05

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance directory = /data1/p1/gpseg8

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance port = 42000

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Target = in-place

20250321:20:20:52:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:20:52:023457 gprecoverseg:-[INFO]:-Recovery 4 of 4

20250321:20:20:52:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Synchronization mode = Incremental

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance host = mpp-06

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance address = mpp-06

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance directory = /data1/m2/gpseg9

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Failed instance port = 43001

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance host = mpp-05

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance address = mpp-05

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance directory = /data1/p2/gpseg9

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Source instance port = 42001

20250321:20:20:52:023457 gprecoverseg:-[INFO]:- Recovery Target = in-place

20250321:20:20:52:023457 gprecoverseg:-[INFO]:----------------------------------------------------------

20250321:20:20:52:023457 gprecoverseg:-[INFO]:-4 segment(s) to recover

20250321:20:20:52:023457 gprecoverseg:-[INFO]:-Ensuring 4 failed segment(s) are stopped

20250321:20:20:56:023457 gprecoverseg:-[INFO]:-Ensuring that shared memory is cleaned up for stopped segments

20250321:20:20:56:023457 gprecoverseg:-[INFO]:-Updating configuration with new mirrors

20250321:20:20:56:023457 gprecoverseg:-[INFO]:-Updating mirrors

20250321:20:20:56:023457 gprecoverseg:-[INFO]:-Running pg_rewind on required mirrors

20250321:20:21:03:023457 gprecoverseg:-[INFO]:-Starting mirrors

20250321:20:21:03:023457 gprecoverseg:-[INFO]:-era is None

20250321:20:21:03:023457 gprecoverseg:-[INFO]:-Commencing parallel segment instance startup, please wait...

20250321:20:21:05:023457 gprecoverseg:-[INFO]:-Process results...

20250321:20:21:05:023457 gprecoverseg:-[INFO]:-Triggering FTS probe

20250321:20:21:05:023457 gprecoverseg:-[INFO]:-******************************************************************

20250321:20:21:05:023457 gprecoverseg:-[INFO]:-Updating segments for streaming is completed.

20250321:20:21:05:023457 gprecoverseg:-[INFO]:-For segments updated successfully, streaming will continue in the background.

20250321:20:21:05:023457 gprecoverseg:-[INFO]:-Use gpstate -s to check the streaming progress.

20250321:20:21:05:023457 gprecoverseg:-[INFO]:-******************************************************************

20250321:20:21:05:023457 gprecoverseg:-[INFO]:-==============================END ANOTHER RECOVER==========================================

20250321:20:21:05:023457 gprecoverseg:-[INFO]:-******************************************************************

20250321:20:21:05:023457 gprecoverseg:-[INFO]:-The rebalance operation has completed successfully.

20250321:20:21:05:023457 gprecoverseg:-[INFO]:-There is a resynchronization running in the background to bring all

20250321:20:21:05:023457 gprecoverseg:-[INFO]:-segments in sync.

20250321:20:21:05:023457 gprecoverseg:-[INFO]:-Use gpstate -e to check the resynchronization progress.

20250321:20:21:05:023457 gprecoverseg:-[INFO]:-**********************************************************************

```

这里日志内容太多,总结一下:6个seg异常(mpp-02重启导致其上面的2个primary和2个mirror的seg异常,加上原先的mpp-05上面的2个primary seg),使用`gprecoverseg`命令重新激活故障的Segment实例,然后`gprecoverseg -r`将Segment回到在系统初始化时为它们指定的首选角色。

此时使用gpstate查看集群状态一切正常

```linux

$ gpstate -m

20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:-Starting gpstate with args: -m

20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 6.8.1 build

20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:-master Greenplum Version: 'PostgreSQL 9.4.24 (Greenplum Database 6.8.1 build on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 6.4.0, 64-bit compiled on Jun 11 2020102123:56'

20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:-Obtaining Segment details from master...

20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:--------------------------------------------------------------

20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:--Current GPDB mirror list and status

20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:--Type = Group

20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:--------------------------------------------------------------

20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:- Mirror Datadir Port Status Data Status

20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:- mpp-02 /data1/m1/gpseg0 43000 Passive Synchronized

20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:- mpp-02 /data1/m2/gpseg1 43001 Passive Synchronized

20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:- mpp-03 /data1/m1/gpseg2 43000 Passive Synchronized

20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:- mpp-03 /data1/m2/gpseg3 43001 Passive Synchronized

20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:- mpp-04 /data1/m1/gpseg4 43000 Passive Synchronized

20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:- mpp-04 /data1/m2/gpseg5 43001 Passive Synchronized

20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:- mpp-05 /data1/m1/gpseg6 43000 Passive Synchronized

20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:- mpp-05 /data1/m2/gpseg7 43001 Passive Synchronized

20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:- mpp-06 /data1/m1/gpseg8 43000 Passive Synchronized

20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:- mpp-06 /data1/m2/gpseg9 43001 Passive Synchronized

20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:- mpp-01 /data1/m1/gpseg10 43000 Passive Synchronized

20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:- mpp-01 /data1/m2/gpseg11 43001 Passive Synchronized

20250321:21:47:01:016024 gpstate:mpp-01:-[INFO]:--------------------------------------------------------------

$ gpstate -c

20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:-Starting gpstate with args: -c

20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 6.8.1 build

20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:-master Greenplum Version: 'PostgreSQL 9.4.24 (Greenplum Database 6.8.1 build on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 6.4.0, 64-bit compiled on Jun 11 2020102123:56'

20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:-Obtaining Segment details from master...

20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:--------------------------------------------------------------

20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:--Current GPDB mirror list and status

20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:--Type = Group

20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:--------------------------------------------------------------

20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:- Status Data State Primary Datadir Port Mirror Datadir Port

20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:- Primary Active, Mirror Available Synchronized mpp-01 /data1/p1/gpseg0 42000 mpp-02 /data1/m1/gpseg0 43000

20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:- Primary Active, Mirror Available Synchronized mpp-01 /data1/p2/gpseg1 42001 mpp-02 /data1/m2/gpseg1 43001

20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:- Primary Active, Mirror Available Synchronized mpp-02 /data1/p1/gpseg2 42000 mpp-03 /data1/m1/gpseg2 43000

20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:- Primary Active, Mirror Available Synchronized mpp-02 /data1/p2/gpseg3 42001 mpp-03 /data1/m2/gpseg3 43001

20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:- Primary Active, Mirror Available Synchronized mpp-03 /data1/p1/gpseg4 42000 mpp-04 /data1/m1/gpseg4 43000

20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:- Primary Active, Mirror Available Synchronized mpp-03 /data1/p2/gpseg5 42001 mpp-04 /data1/m2/gpseg5 43001

20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:- Primary Active, Mirror Available Synchronized mpp-04 /data1/p1/gpseg6 42000 mpp-05 /data1/m1/gpseg6 43000

20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:- Primary Active, Mirror Available Synchronized mpp-04 /data1/p2/gpseg7 42001 mpp-05 /data1/m2/gpseg7 43001

20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:- Primary Active, Mirror Available Synchronized mpp-05 /data1/p1/gpseg8 42000 mpp-06 /data1/m1/gpseg8 43000

20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:- Primary Active, Mirror Available Synchronized mpp-05 /data1/p2/gpseg9 42001 mpp-06 /data1/m2/gpseg9 43001

20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:- Primary Active, Mirror Available Synchronized mpp-06 /data1/p1/gpseg10 42000 mpp-01 /data1/m1/gpseg10 43000

20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:- Primary Active, Mirror Available Synchronized mpp-06 /data1/p2/gpseg11 42001 mpp-01 /data1/m2/gpseg11 43001

20250321:21:47:11:016943 gpstate:mpp-01:-[INFO]:--------------------------------------------------------------

$ gpstate -e

20250321:21:47:14:017039 gpstate:mpp-01:-[INFO]:-Starting gpstate with args: -e

20250321:21:47:14:017039 gpstate:mpp-01:-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 6.8.1 build

20250321:21:47:14:017039 gpstate:mpp-01:-[INFO]:-master Greenplum Version: 'PostgreSQL 9.4.24 (Greenplum Database 6.8.1 build on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 6.4.0, 64-bit compiled on Jun 11 2020102123:56'

20250321:21:47:14:017039 gpstate:mpp-01:-[INFO]:-Obtaining Segment details from master...

20250321:21:47:14:017039 gpstate:mpp-01:-[INFO]:-Gathering data from segments...

20250321:21:47:14:017039 gpstate:mpp-01:-[INFO]:-----------------------------------------------------

20250321:21:47:14:017039 gpstate:mpp-01:-[INFO]:-Segment Mirroring Status Report

20250321:21:47:14:017039 gpstate:mpp-01:-[INFO]:-----------------------------------------------------

20250321:21:47:14:017039 gpstate:mpp-01:-[INFO]:-All segments are running normall21

```

正当一切可以收工时,监控同事说mpp-02节点上的5432端口还是为DOWN状态,纳尼?

赶紧查看一下

```linux

$ gpstate -f

20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:-Starting gpstate with args: -f

20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 6.8.1 build

20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:-master Greenplum Version: 'PostgreSQL 9.4.24 (Greenplum Database 6.8.1 build on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 6.4.0, 64-bit compiled on Jun 20250321:21:56:14:009720

20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:-Obtaining Segment details from master...

20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:-Standby master details

20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:-----------------------

20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:- Standby address = mpp-02

20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:- Standby data directory = /data1/master/gpseg-1

20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:- Standby port = 5432

20250321:21:56:14:009720 gpstate:mpp-01:-[WARNING]:-Standby PID = 0 <<<<<<<<

20250321:21:56:14:009720 gpstate:mpp-01:-[WARNING]:-Standby status = Standby process not running <<<<<<<<

20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:--------------------------------------------------------------

20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:--pg_stat_replication

20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:--------------------------------------------------------------

20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:-No entries found.

20250321:21:56:14:009720 gpstate:mpp-01:-[INFO]:--------------------------------------------------------------

```

果然standby master没启动。查看官方文档,可以使用[gpinitstandby](.html "gpinitstandby")恢复之。

```linux

$ gpinitstandby -n

20250321:57:21:09:010314 gpinitstandby:mpp-01:-[INFO]:-Validating environment and parameters for standby initialization...

20250321:57:21:09:010314 gpinitstandby:mpp-01:-[INFO]:-Checking for data directory /data1/master/gpseg-1 on mpp-02

20250321:57:21:09:010314 gpinitstandby:mpp-01:-[INFO]:------------------------------------------------------

20250321:57:21:09:010314 gpinitstandby:mpp-01:-[INFO]:-Greenplum standby master initialization parameters

20250321:57:21:09:010314 gpinitstandby:mpp-01:-[INFO]:------------------------------------------------------

20250321:57:21:09:010314 gpinitstandby:mpp-01:-[INFO]:-Greenplum master hostname = mpp-01

20250321:57:21:09:010314 gpinitstandby:mpp-01:-[INFO]:-Greenplum master data directory = /data1/master/gpseg-1

20250321:57:21:09:010314 gpinitstandby:mpp-01:-[INFO]:-Greenplum master port = 5432

20250321:57:21:09:010314 gpinitstandby:mpp-01:-[INFO]:-Greenplum standby master hostname = mpp-02

20250321:57:21:09:010314 gpinitstandby:mpp-01:-[INFO]:-Greenplum standby master port = 5432

20250321:57:21:09:010314 gpinitstandby:mpp-01:-[INFO]:-Greenplum standby master data directory = /data1/master/gpseg-1

20250321:57:21:09:010314 gpinitstandby:mpp-01:-[INFO]:-Greenplum update system catalog = On

20250321:57:21:11:010314 gpinitstandby:mpp-01:-[INFO]:-Syncing Greenplum Database extensions to standby

20250321:57:21:12:010314 gpinitstandby:mpp-01:-[INFO]:-The packages on mpp-02 are consistent.

20250321:57:21:12:010314 gpinitstandby:mpp-01:-[INFO]:-Adding standby master to catalog...

20250321:57:21:12:010314 gpinitstandby:mpp-01:-[INFO]:-Database catalog updated successfully.

20250321:57:21:12:010314 gpinitstandby:mpp-01:-[INFO]:-Updating pg_hba.conf file...

20250321:57:21:13:010314 gpinitstandby:mpp-01:-[INFO]:-pg_hba.conf files updated successfully.

20250321:57:21:16:010314 gpinitstandby:mpp-01:-[INFO]:-Starting standby master

20250321:57:21:16:010314 gpinitstandby:mpp-01:-[INFO]:-Checking if standby master is running on host: mpp-02 in directory: /data1/master/gpseg-1

20250321:57:22:42:010314 gpinitstandby:mpp-01:-[WARNING]:-Could not start standby master

20250321:57:22:42:010314 gpinitstandby:mpp-01:-[INFO]:-Cleaning up pg_hba.conf backup files...

20250321:57:22:43:010314 gpinitstandby:mpp-01:-[INFO]:-Backup files of pg_hba.conf cleaned up successfully.

20250321:57:22:43:010314 gpinitstandby:mpp-01:-[INFO]:-Successfully created standby master on mpp-02

```

然并卵,重新激活standby master无效。此时病急乱投医,激活不行就剔除standby master然后再添加mpp-02节点为standby节点,结果还是一样,无法启动mpp-02上的master进程!

## pg_hba.conf引发的坑

由于standby master一直无法启动,不过系统倒是正常运行,领导体谅干的太迟思路混乱就让先下班了。

第二天刚好是周末,可我不信邪为啥standby无法拉起,查看mpp-01和mpp-02的pg_log,发现了其中端倪

```linux

]# more gpdb-2025-03-21_212117.csv

2025-03-21 21:21:17.736282 CST,,,p25377,th167159936,,,,0,,,seg-1,,,,,"LOG","F0000","invalid authentication method ""0.0.0.0/0""",,,,,"line 107 of configuration file ""/data1/master/gpseg-1/pg_hba.conf""",,0,,"hba.c",1206,

2025-03-21 21:21:17.736434 CST,,,p25377,th167159936,,,,0,,,seg-1,,,,,"FATAL","F0000","could not load pg_hba.conf",,,,,,,0,,"postmaster.c",1460,

```

pg_hba.conf文件配置出错了!看下具体配置信息

```

...

local all backup 0.0.0.0/0 md5

...

```

果然配置出错了,询问相关同事后得知该配置是当时添加备份一体时增加的配置,当时后来配置一体机又没备份该数据库,如果不是这次阴差阳错,这个错误配置可能一直存在下去。后面就绪注释掉该配置,`gpinitstandby -n`重新激活standby master

# 复盘

在当时发现mpp-02节点pid进程以后时,是不是也可以使用`gprecoverseg`解决故障?

> greenplum我也是半路接手,如果有大佬知道该问题有更好的处理方法,希望可以私信告知,谢谢