Have the following network topology between Linux (RHEL 8.6) hosts, no physical network appliances between the hosts:
bridge link show
command output on Host A:
...
33: br0-1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0 state forwarding priority 32 cost 100
33: br0-1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0
34: br1-1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br1 state forwarding priority 32 cost 100
34: br1-1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br1
...
36: br0-2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0 state forwarding priority 32 cost 100
36: br0-2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0
37: br1-2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br1 state blocking priority 32 cost 100
37: br1-2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br1
...
39: br0-3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0 state forwarding priority 32 cost 100
39: br0-3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0
40: br1-3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br1 state blocking priority 32 cost 100
40: br1-3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br1
- The SRIOV VFs are separated by VLANs, those SRIOV VFs has unique MAC addresses.
- No bridge setup on Host B/C/D as those SRIOV VFs are normal Linux network interfaces, they have IP addresses within the same subnet as the connected bridge on Host A.
- When physically disconnecting Host B from Host A, the blocking between Host A and Host C (SRIOV VF1) become Forwarding. Subsequently, when physically disconnecting Host C from Host A, the blocking between Host A and Host D (SRIOV VF1) become Forwarding.
- It’s really random which of the VF0/VF1 links become blocking/forwarding between Host A and Host C/D, after reboot it could be links between VF0 instead of VF1 that become blocking between Host A and Host C/D (and VF1 link become forwarding)
- When disabling the STP using
nmcli
on the bridges, they all become forwarding for a short while before the node become inaccessible with console output of:
bnxt_en <PCI address> br0-1: TX timeout detected, starting reset!
bnxt_en <PCI address> br1-3: TX timeout detected, starting reset!
bnxt_en <PCI address> br0-2: TX timeout detected, starting reset!
...
So seems there is really loop introduced when disabling the STP.
Then question is how can we fix it witout disabling STP?