Updated: May 21, 2019
The last post introduced basic BGP bringup on a Cumulus box with OSPF as the IGP. Let's now move towards a BGP unnumbered design and understand how that works.
We will use the same network topology as before:
The idea behind BGP unnumbered is to use the IPv6 link local addressing on hop by hop basis. When you're building a L3 fabric, what is the goal of the underlay? Outside of any multicast replication that may be required, the main goal (from a unicast perspective) is to provide connectivity from one tunnel end point to another. Typically, you would use something like OSPF or IS-IS to advertise the loopbacks of the tunnel endpoints and thus, provide connectivity from one loopback to another.
Now, with that premise in mind, let's break it down some more. What is really done on a per hop basis? Each node is simply doing a L3 lookup, resolving the next hop's address, rewriting the L2 header and forwarding it on towards the next hop. This entire process can be lifted away from an IGP and done via BGP itself, by utilizing link local IPv6 addressing and RFC 5549, which allows you to advertise an IPv4 NLRI with an IPv6 next hop. And how do you resolve the IPv6 next hop? Using the IPv6 neighbor discovery process.
Let's start putting some of these pieces together now. First, we enable IPv6 ND on the point to point links and disable RA suppression (which appears to be enabled by default).
This adds the following to '/etc/frr/frr.conf' file:
From 'net show interface <>' you can confirm the link local IPv6 address, the mac address associated with this interface, Router Advertisement (abbreviated to 'RA' going forward) interval and so on:
Let's mimic the configuration on SPINE1 as well now.
Again, look at 'net show interface swp1' to confirm the mac address and the IPv6 link local address:
Take a look at the neighbor discovery process between LEAF1 and SPINE1 now.
First, SPINE1 sends out a neighbor solicitation (abbreviated to 'NS' going forward) message with a target address of itself. This is sent to a well known multicast address:
After a back and forth RA, another NS is sent by SPINE1 but this time, with a target address of 'fe80::5200:ff:fe03:1', which corresponds to the link local IPv6 address assigned to swp1 of LEAF1. Notice how the ICMPv6 option also specifies the link-layer address, which corresponds to SPINE1, swp1's mac address.
LEAF1 responds to this with a Neighbor Advertisement (abbreviated to NA going forward) message.
Notice that the link-layer address in the NA sent by LEAF1 is the mac address of its port, swp1. SPINE1 can now use this information to build its IP neighbor table. The same process happens the other way around, with LEAF1 sending a NS and SPINE1 responding back with a NA. At the end of this, both should have their IP neighbor tables correctly populated.
You can confirm this using:
Let's bring up BGP over this link now. The configuration needs to be modified a little bit since these links do not have any IPv4 address anymore (apart from their default link local IPv4 addresses). Instead of specifying an IP address in the BGP neighbor statement, Cumulus allows you to specify the port number.
A packet capture shows us the bringup sequence for BGP between these two boxes:
Let's break this down quickly - initially we see several TCP resets. Why is that? Because BGP port is not open yet on SPINE1 (it was not configured at that point in time), hence any TCP SYN coming for a destination port of 179 (BGP) would be rejected by SPINE1. Once the configuration is complete on both sides, we see the 3-way TCP handshake complete and the OPEN messages being sent.
Among other capabilities exchanged in the OPEN message, an important one is highlighted in the capture - the extended next hop encoding. This allows for an IPv4 NLRI to have an IPv6 next hop. You need to make sure this capability is exchanged. To force this, you can use this 'net add bgp neighbor <interface> capability extended-nexthop' command on a Cumulus box.
Using a similar approach, we can complete our BGP peerings for this entire infrastructure. Remember to make each LEAF switch a route reflector client of the SPINE switches (otherwise an update from a LEAF switch will not be sent to the other LEAF switches by the SPINE because of iBGP peering rules). At the end of this, each SPINE should see three peerings - one to each of the LEAF switches:
Each SPINE has three BGP neighbors, as expected. We have not advertised the host subnets yet so let's do that and take a packet capture to analyze how this is advertised.
Take a look at the following capture taken on LEAF2 as it receives a BGP update from SPINE1:
The NLRI describes an IPv4 subnet but the next hop is an IPv6 address. How cool is that? Look at the RIB/FIB on LEAF2 to confirm how this is installed:
The RIB installs the prefix against the link local IPv6 address while the FIB installs them against the link local IPv4 address.
After advertising all host subnets, LEAF1s RIB looks like this:
PC1 should be able to ping PC2 and PC3 now:
And there it is. A thing of beauty!