Cumulus Basics Part VI - VXLAN L2VNIs with BGP EVPN

Aninda Chatterjee

2021-12-10

Page content

In this post, we introduce BGP EVPN and a VXLAN fabric in Cumulus Linux, with L2VNIs.

Introduction

Now that we’ve covered the basics of BGP unnumbered in the last post, we’ll start building a VXLAN based fabric with BGP EVPN.

Topology

Our topology remains the same, with a minor change - PC2 has now been moved to the same subnet as PC1, 10.1.1.0/24:

l2vni1

The PCs are lightweight emulations here. Take note of the mac addresses assigned to them as they will be important and constantly referenced through the post:

PC1> show ip

NAME        : PC1[1]
IP/MASK     : 10.1.1.1/24
GATEWAY     : 10.1.1.254
DNS         : 
MAC         : 00:50:79:66:68:06
LPORT       : 20000
RHOST:PORT  : 127.0.0.1:30000
MTU         : 1500

PC2> show ip

NAME        : PC2[1]
IP/MASK     : 10.1.1.2/24
GATEWAY     : 10.1.1.254
DNS         : 
MAC         : 00:50:79:66:68:07
LPORT       : 20000
RHOST:PORT  : 127.0.0.1:30000
MTU         : 1500

Start by activating the neighbors against the L2VPN EVPN AFI/SAFI. Outside of this, I am re-doing some of the configurations from the previous post for completeness sake. We are adding a loopback per switch, adding BGP neighbors and marking each LEAF switch as a route reflector client on the SPINEs

// SPINE1

cumulus@SPINE1:~$ net add loopback lo ip address 11.11.11.11/32
cumulus@SPINE1:~$ net add bgp autonomous-system 1
cumulus@SPINE1:~$ net add bgp neighbor swp1-3 remote-as internal
cumulus@SPINE1:~$ net add bgp neighbor swp1-3 route-reflector-client
cumulus@SPINE1:~$ net add bgp l2vpn evpn neighbor swp1-3 activate
cumulus@SPINE1:~$ net add bgp l2vpn evpn neighbor swp1-3 route-reflector-client

// SPINE2 

cumulus@SPINE2:~$ net add loopback lo ip address 22.22.22.22/32
cumulus@SPINE2:~$ net add bgp autonomous-system 1
cumulus@SPINE2:~$ net add bgp neighbor swp1-3 remote-as internal
cumulus@SPINE2:~$ net add bgp neighbor swp1-3 route-reflector-client
cumulus@SPINE2:~$ net add bgp l2vpn evpn neighbor swp1-3 activate
cumulus@SPINE2:~$ net add bgp l2vpn evpn neighbor swp1-3 route-reflector-client

// LEAF1

cumulus@LEAF1:~$ net add loopback lo ip add 1.1.1.1/32
cumulus@LEAF1:~$ net add bgp autonomous-system 1
cumulus@LEAF1:~$ net add bgp neighbor swp1-2 remote-as internal 
cumulus@LEAF1:~$ net add bgp l2vpn evpn neighbor swp1-2 activate 
cumulus@LEAF1:~$ net commit

The LEAF1 configuration must be appropriately replicated on the other two LEAF switches (modify interfaces as needed and add unique loopback IPs).

Post this, confirm that BGP peerings for L2VPN EVPN is up. Each SPINE should have three peerings, one to each LEAF switch:

// SPINE1 peerings:

cumulus@SPINE1:~$ net show bgp l2vpn evpn summary 
BGP router identifier 11.11.11.11, local AS number 1 vrf-id 0
BGP table version 0
RIB entries 0, using 0 bytes of memory
Peers 3, using 58 KiB of memory

Neighbor        V         AS MsgRcvd MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd
LEAF1(swp1)     4          1      63      64        0    0    0 00:02:57            0
LEAF2(swp2)     4          1      42      43        0    0    0 00:01:56            0
LEAF3(swp3)     4          1      29      30        0    0    0 00:01:16            0

Total number of neighbors 3 

// SPINE2 peerings:

cumulus@SPINE2:~$ net show bgp l2vpn evpn summary 
BGP router identifier 22.22.22.22, local AS number 1 vrf-id 0
BGP table version 0
RIB entries 0, using 0 bytes of memory
Peers 3, using 58 KiB of memory

Neighbor        V         AS MsgRcvd MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd
LEAF2(swp1)     4          1      47      48        0    0    0 00:02:10            0
LEAF1(swp2)     4          1      70      71        0    0    0 00:03:19            0
LEAF3(swp3)     4          1      36      36        0    0    0 00:01:38            0

Total number of neighbors 3

Remember that the underlay provides reachability from one tunnel endpoint to another. Since we are using BGP unnumbered as the underlay itself, we need to advertise the loopbacks of each LEAF switch into the IPv4 address family.

// LEAF1

cumulus@LEAF1:~$ net add bgp network 1.1.1.1/32
cumulus@LEAF1:~$ net commit

// LEAF2

cumulus@LEAF2:~$ net add bgp network 2.2.2.2/32
cumulus@LEAF2:~$ net commit

// LEAF3

cumulus@LEAF3:~$ net add bgp network 3.3.3.3/32
cumulus@LEAF3:~$ net commit

Each LEAF switch should now be aware of the loopbacks of the other to LEAF switches.

	
// LEAF1

cumulus@LEAF1:~$ net show bgp ipv4 unicast 
BGP table version is 7, local router ID is 1.1.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
              i internal, r RIB-failure, S Stale, R Removed
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 1.1.1.1/32       0.0.0.0                  0         32768 i
*=i2.2.2.2/32       swp2                     0    100      0 i
*>i                 swp1                     0    100      0 i
*=i3.3.3.3/32       swp2                     0    100      0 i
*>i                 swp1                     0    100      0 i

Displayed  3 routes and 5 total paths 

// LEAF2

cumulus@LEAF2:~$ net show bgp ipv4 unicast 
BGP table version is 6, local router ID is 2.2.2.2
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
              i internal, r RIB-failure, S Stale, R Removed
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*>i1.1.1.1/32       swp2                     0    100      0 i
*=i                 swp1                     0    100      0 i
*> 2.2.2.2/32       0.0.0.0                  0         32768 i
*>i3.3.3.3/32       swp2                     0    100      0 i
*=i                 swp1                     0    100      0 i

Displayed  3 routes and 5 total paths 

// LEAF3

cumulus@LEAF3:~$ net show bgp ipv4 unicast 
BGP table version is 6, local router ID is 3.3.3.3
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
              i internal, r RIB-failure, S Stale, R Removed
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*>i1.1.1.1/32       swp3                     0    100      0 i
*=i                 swp1                     0    100      0 i
*>i2.2.2.2/32       swp3                     0    100      0 i
*=i                 swp1                     0    100      0 i
*> 3.3.3.3/32       0.0.0.0                  0         32768 i

Displayed  3 routes and 5 total paths

Confirm reachability from loopback to loopback. Running this test from one of the LEAF switches should be enough:

// LEAF1 to LEAF2 reachability

cumulus@LEAF1:~$ ping 2.2.2.2 -I 1.1.1.1
PING 2.2.2.2 (2.2.2.2) from 1.1.1.1 : 56(84) bytes of data.
64 bytes from 2.2.2.2: icmp_seq=1 ttl=63 time=1.94 ms
64 bytes from 2.2.2.2: icmp_seq=2 ttl=63 time=2.05 ms
64 bytes from 2.2.2.2: icmp_seq=3 ttl=63 time=2.73 ms
64 bytes from 2.2.2.2: icmp_seq=4 ttl=63 time=1.91 ms
64 bytes from 2.2.2.2: icmp_seq=5 ttl=63 time=1.01 ms
^C
--- 2.2.2.2 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4008ms
rtt min/avg/max/mdev = 1.012/1.931/2.730/0.550 ms

// LEAF1 to LEAF3 reachability

cumulus@LEAF1:~$ ping 3.3.3.3 -I 1.1.1.1
PING 3.3.3.3 (3.3.3.3) from 1.1.1.1 : 56(84) bytes of data.
64 bytes from 3.3.3.3: icmp_seq=1 ttl=63 time=18.2 ms
64 bytes from 3.3.3.3: icmp_seq=2 ttl=63 time=1.78 ms
64 bytes from 3.3.3.3: icmp_seq=3 ttl=63 time=2.41 ms
64 bytes from 3.3.3.3: icmp_seq=4 ttl=63 time=1.85 ms
64 bytes from 3.3.3.3: icmp_seq=5 ttl=63 time=2.28 ms
^C
--- 3.3.3.3 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4008ms
rtt min/avg/max/mdev = 1.788/5.314/18.230/6.462 ms

Now, we start with a simple premise - PC1 cannot ping PC2:

PC1> ping 10.1.1.2

host (10.1.1.2) not reachable

Our goal is to allow PC1 to talk to PC2 via a VXLAN fabric. The idea is straightforward - you map your VLAN to a VNID (VXLAN network ID) and associate this to a virtual interface that acts as your entry/exit point for VXLAN packets.

The configuration on a Cumulus box for this is like so (scroll right to understand what each configuration is doing):

cumulus@LEAF1:~$ net add vxlan vni10 vxlan id 10010 			// adds VNID to virtual interface, vni10
cumulus@LEAF1:~$ net add vxlan vni10 bridge access 10			// adds VLAN 10 to vni10
cumulus@LEAF1:~$ net add vxlan vni10 bridge learning off		// disables mac learning on vni10
cumulus@LEAF1:~$ net add vxlan vni10 vxlan local-tunnelip 1.1.1.1	// specifies local source IP to be 1.1.1.1 for VXLAN packets

It is also important to understand what changes the above commands translate to, so take a look at that when you commit them:

	
cumulus@LEAF1:~$ net commit 
--- /etc/network/interfaces     2019-05-13 05:23:36.667000000 +0000
+++ /run/nclu/ifupdown2/interfaces.tmp  2019-05-13 05:29:13.613000000 +0000
@@ -18,20 +18,29 @@
 
 auto swp2
 iface swp2
 
 auto swp3
 iface swp3
     bridge-access 10
 
 auto bridge
 iface bridge
-    bridge-ports swp3
+    bridge-ports swp3 vni10
     bridge-vids 10
     bridge-vlan-aware yes
 
 auto vlan10
 iface vlan10
     address 10.1.1.254/24
     vlan-id 10
     vlan-raw-device bridge
 
+auto vni10
+iface vni10
+    bridge-access 10
+    bridge-learning off
+    mstpctl-bpduguard yes
+    mstpctl-portbpdufilter yes
+    vxlan-id 10010
+    vxlan-local-tunnelip 1.1.1.1
+



net add/del commands since the last "net commit"
================================================

User     Timestamp                   Command
-------  --------------------------  ------------------------------------------------
cumulus  2019-05-13 05:27:26.613573  net add vxlan vni10 vxlan id 10010
cumulus  2019-05-13 05:28:16.336833  net add vxlan vni10 bridge access 10
cumulus  2019-05-13 05:28:27.454434  net add vxlan vni10 bridge learning off
cumulus  2019-05-13 05:29:08.227404  net add vxlan vni10 vxlan local-tunnelip 1.1.1.1

The changes we made creates a new virtual interface with the name ‘vni10’ (we gave this name in the first command), associates VLAN 10 and a VXLAN network identifier (VNID) of 10010 to it, disables mac learning on this, associates a source IP of 1.1.1.1 to this and adds this interface to the VLAN-aware bridge.

Mimic this configuration on LEAF2 - the only change you need to make there is the local tunnel-ip that is used (configure that as LEAF2s loopback, 2.2.2.2).

Control-plane flow

Let’s try and understand the flow of control-plane learning now. As a baseline, we have not learnt PC1/PC2 macs nor are there any type-2 routes in the BGP table:

// LEAF1

cumulus@LEAF1:~$ net show bridge macs vlan 10

VLAN  Master  Interface  MAC                TunnelDest  State      Flags  LastSeen
----  ------  ---------  -----------------  ----------  ---------  -----  --------
  10  bridge  bridge     50:00:00:04:00:03              permanent         02:40:27

cumulus@LEAF1:~$ net show bgp l2vpn evpn route type macip 
No EVPN prefixes (of requested type) exist 

// LEAF2

cumulus@LEAF2:~$ net show bridge macs vlan 10

VLAN  Master  Interface  MAC                TunnelDest  State      Flags  LastSeen
----  ------  ---------  -----------------  ----------  ---------  -----  --------
  10  bridge  bridge     50:00:00:05:00:03              permanent         02:40:41

cumulus@LEAF2:~$ net show bgp l2vpn evpn route type macip
No EVPN prefixes (of requested type) exist

I am going to restart PC1 now. This forces a GARP to be sent out when it comes up and this GARP causes PC1s mac to be learnt on swp3 of LEAF1:

cumulus@LEAF1:~$ net show bridge macs vlan 10

VLAN  Master  Interface  MAC                TunnelDest  State      Flags  LastSeen
----  ------  ---------  -----------------  ----------  ---------  -----  --------
  10  bridge  bridge     50:00:00:04:00:03              permanent         02:49:50
  10  bridge  swp3       00:50:79:66:68:06                                00:00:17

Once the mac is learnt, it gets pushed into the BGP table as well:

cumulus@LEAF1:~$ net show bgp l2vpn evpn route type macip 
BGP table version is 13, local router ID is 1.1.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-2 prefix: [2]:[ESI]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
EVPN type-5 prefix: [5]:[ESI]:[EthTag]:[IPlen]:[IP]

   Network          Next Hop            Metric LocPrf Weight Path
Route Distinguisher: 1.1.1.1:2
*> [2]:[0]:[0]:[48]:[00:50:79:66:68:06]
                    1.1.1.1                            32768 i
*> [2]:[0]:[0]:[48]:[00:50:79:66:68:06]:[32]:[10.1.1.1]
                    1.1.1.1                            32768 i

Displayed 2 prefixes (2 paths) (of requested type)

This will now be sent as an update to the SPINEs and the SPINEs reflect it to LEAF2. A packet capture on LEAF2 shows this BGP UPDATE message:

l2vni2

There is a lot of interesting information in this update. The NLRI is an EVPN NLRI, describing a mac advertisement route (type-2). The RD is 1.1.1.1:2 and you can see the mac address inside this NLRI with no IP address included. Also, look at the final attribute which is for extended communities - this includes the RT (which is a combination of the AS and the VNID itself). The extended community also describes the encapsulation which is of type VXLAN.

Immediately after this update, another update is sent with the IP address included this time:

l2vni3

LEAF2 gets this update and installs it in its BGP table as well against the RD that was in the update (1.1.1.1:2):

cumulus@LEAF2:~$ net show bgp l2vpn evpn route type macip
BGP table version is 16, local router ID is 2.2.2.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-2 prefix: [2]:[ESI]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
EVPN type-5 prefix: [5]:[ESI]:[EthTag]:[IPlen]:[IP]

   Network          Next Hop            Metric LocPrf Weight Path
Route Distinguisher: 1.1.1.1:2
*>i[2]:[0]:[0]:[48]:[00:50:79:66:68:06]
                    1.1.1.1                  0    100      0 i
*>i[2]:[0]:[0]:[48]:[00:50:79:66:68:06]:[32]:[10.1.1.1]
                    1.1.1.1                  0    100      0 i

Displayed 2 prefixes (2 paths) (of requested type)

From here, the type-2 route gets pushed into the mac address table and associated to vni10, with the ‘offload’ flag set and the VXLAN destination IP set to 1.1.1.1:

cumulus@LEAF2:~$ net show bridge macs | grep 00:50:79:66:68:06
VLAN      Master  Interface  MAC                TunnelDest  State      Flags          LastSeen
--------  ------  ---------  -----------------  ----------  ---------  -------------  --------
10        bridge  vni10      00:50:79:66:68:06                         offload        00:16:43
untagged          vni10      00:50:79:66:68:06  1.1.1.1                self, offload  00:16:43

You can see that with the BGP EVPN control-plane, the data-plane is already built and ready without any conversation between the hosts.

Data-plane flow

From a data-plane perspective, when PC1 pings PC2, it tries to resolve the IP address first. This is achieved via an ARP request that goes through the fabric. This is typically done in two ways - either with headend replication on the ingress tunnel endpoint (meaning the ARP is packaged into a unicast packet sent over the VXLAN fabric) or via multicast (the underlay needs to be multicast aware). We are not going to go into BUM (broadcast/unknown unicast/multicast) traffic handling right now since that needs to be a separate post in itself.

Assuming that ARP is resolved, PC1 generates a unicast ICMP request message that hits LEAF1. LEAF1 does a mac table lookup and finds that this needs to be sent over vni10, with a VXLAN encapsulation, a VNID of 10010 and a destination IP of 2.2.2.2:

cumulus@LEAF1:~$ net show bridge macs 00:50:79:66:68:07

VLAN      Master  Interface  MAC                TunnelDest  State  Flags          LastSeen
--------  ------  ---------  -----------------  ----------  -----  -------------  --------
10        bridge  vni10      00:50:79:66:68:07                     offload        00:59:05
untagged          vni10      00:50:79:66:68:07  2.2.2.2            self, offload  00:59:05

On the wire, the packets look like this:

l2vni4

It is an ICMP header inside an IP header inside a VXLAN header inside a UDP header inside an IP header. The logic is that the top most IP header carries the packet from one VXLAN tunnel endpoint to the other. At the other end, the top most IP header is stripped off (since it owns the destination IP address), revealing the UDP and the VXLAN headers which are subsequently stripped off as well. Another mac lookup is done on the packet and a forwarding decision is made based on that:

cumulus@LEAF2:~$ net show bridge macs 00:50:79:66:68:07

VLAN  Master  Interface  MAC                TunnelDest  State  Flags  LastSeen
----  ------  ---------  -----------------  ----------  -----  -----  --------
  10  bridge  swp3       00:50:79:66:68:07                            00:00:03

In this case, the mac table says that the packet needs to be forwarded out of swp3. This gets the packet to PC2 which responds back with an ICMP reply. The entire data-plane process happens again, in the reverse direction with LEAF2 now being the source of the VXLAN encapsulation:

l2vni5

Notice how the top most IP header now has a source IP address of 2.2.2.2. Effectively, you can visualize this as a bi-directional tunnel carrying traffic over a routed infrastructure between two hosts in the same subnet.

l2vni6

I hope this was informative. I’ll see y’all in the next post!