[Drbd-dev] DRBD Bug Report

Discussion:

Guo, Lei

2016-10-25 01:23:42 UTC

Version£º9.0.5/9.0.4
File£ºdrbd_state.c

Two nodes are setup, Node 1 is primary, node 2 is secondary.
On node 2 , command ¡°drbdadm down r0¡± returns error.

[error] Failed to disconnected or detach the r0.
cmd_result=[11], cmd_output=[r0: State change failed: (-10) State change was refused by peer node
additional info from kernel:failed to disconnect

The possible bug is as follows.
static enum outdate_what outdate_on_disconnect(struct drbd_connection *connection)
{
struct drbd_resource *resource = connection->resource;

if (connection->fencing_policy >= FP_RESOURCE &&
resource->role[NOW] != connection->peer_role[NOW]) {
if (resource->role[NOW] == R_PRIMARY)
return OUTDATE_PEER_DISKS;
if (connection->peer_role[NOW] != R_PRIMARY) <--------- should be ¡°if (connection->peer_role[NOW] == R_PRIMARY)¡±
return OUTDATE_DISKS;
}
return OUTDATE_NOTHING;
}

ÒÔÉÏ¡¢€è€í€·€¯€ªî€€€·€Þ€¹¡£
-------
¹ùÀÚ Guo Lei
Development Dept.III (3-2-3)
Nanjing Fujitsu Nanda Software Tech. Co., Ltd.(FNST)
TEL: +86+25-86630566-9437
E-mail: guol-***@cn.fujistu.com<mailto:guol-***@cn.fujistu.com>

Lars Ellenberg

2016-10-31 10:58:52 UTC

Permalink

Version：9.0.5/9.0.4
File：drbd_state.c
Two nodes are setup, Node 1 is primary, node 2 is secondary.
On node 2 , command “drbdadm down r0” returns error.
[error] Failed to disconnected or detach the r0.
cmd_result=[11], cmd_output=[r0: State change failed: (-10) State change was refused by peer node
additional info from kernel:failed to disconnect
The possible bug is as follows.
static enum outdate_what outdate_on_disconnect(struct drbd_connection *connection)
{
struct drbd_resource *resource = connection->resource;
if (connection->fencing_policy >= FP_RESOURCE &&
resource->role[NOW] != connection->peer_role[NOW]) {
if (resource->role[NOW] == R_PRIMARY)
return OUTDATE_PEER_DISKS;
if (connection->peer_role[NOW] != R_PRIMARY) <--------- should be “if (connection->peer_role[NOW] == R_PRIMARY)”

Yes.
And I thought I fixed that some weeks ago.
But apparently I did not push a test case,
and it got lost during the last merge/release cycle.
Pushed now.

Note though that this is by far not the only thing
that is broken with enabled fencing-policies on DRBD 9.

I was operating on the assumption that there had been only a few missing
missing parts regarding pacemaker (and other) integration. Turned out I
was wrong, and DRBD 9 + fencing policies is pretty much completely
broken in the module itself still. We are working on it.

--
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R&D, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT

Farhan Khan

2018-04-04 07:04:49 UTC

Permalink

Hello, I would like to report a bug:

Scenario:
- 2 nodes running as primary and secondary on CentOS 7.4 with pacemaker
and corosync with dedicated link Gigabit link between nodes

Bug description and replication:
- Disconnect the link between nodes
     --> fence-peer is called and secondary node is correctly fenced
- Do not write any new data to primary while the link is broken
-Reconnect the link
     --> nodes reconnect and re-sync correctly
     --> after-resync-target is NOT called and location constraint remains

Further steps to manually trigger after-resync-target handler:
- wrote some data to primary
     --> correctly syncs with secondary
     --> after-resync-target is still not called

Temporary workaround:
- write a script to to write some data to primary node when it gets
disconnected using something like "dd if=/dev/zero of=speetest bs=64k
count=1 conv=fdatasync"
     -->this triggers after-resync-target and location constraint is
successfully removed

my config:
resource r0 {
    device               /dev/drbd0 minor 0;
    disk                 /dev/sda4;
    meta-disk            internal;
    on storage1 {
        node-id 0;
        address          ipv4 172.26.1.1:7790;
    }
    on storage2 {
        node-id 1;
        address          ipv4 172.26.1.2:7790;
    }
    net {
        protocol           C;
        sndbuf-size        0;
        max-buffers      8000;
        max-epoch-size   8000;
        after-sb-0pri    discard-least-changes;
        after-sb-1pri    discard-secondary;
        after-sb-2pri    call-pri-lost-after-sb;
        fencing          resource-only;
    }
    disk {
        md-flushes        no;
        al-extents       3833;
        resync-rate      90M;
        c-plan-ahead       2;
        c-fill-target     2M;
        c-max-rate       100M;
        c-min-rate       25M;
    }
    handlers {
        fence-peer       /usr/lib/drbd/crm-fence-peer.9.sh;
        after-resync-target /usr/lib/drbd/crm-unfence-peer.9.sh;
        pri-lost-after-sb /usr/lib/drbd/notify-pri-lost-after-sb.sh;
    }
}

Lars Ellenberg

2018-04-04 10:37:26 UTC

Permalink

- 2 nodes running as primary and secondary on CentOS 7.4 with pacemaker and
corosync with dedicated link Gigabit link between nodes
- Disconnect the link between nodes
     --> fence-peer is called and secondary node is correctly fenced
- Do not write any new data to primary while the link is broken
-Reconnect the link
     --> nodes reconnect and re-sync correctly
     --> after-resync-target is NOT called and location constraint remains

Becaquse it likely did not sync anything, because there was nothing to
sync, it did not become sync target, so there was nothing to do for an
"after" resync target handler?

You should use the "unfence-peer" handler to unfence.