Cumulus Basics Part VI - VXLAN L2VNIs with BGP EVPN

Updated: May 21, 2019

Now that we've covered the basics of BGP unnumbered in the last post, we'll start building a VXLAN based fabric with BGP EVPN. Our topology remains the same, with a minor change - PC2 has now been moved to the same subnet as PC1, 10.1.1.0/24:





The PCs are lightweight emulations here. Take note of the mac addresses assigned to them as they will be important and constantly referenced through the post:



Start by activating the neighbors against the L2VPN EVPN AFI/SAFI. Outside of this, I am re-doing some of the configurations from the previous post for completeness sake. We are adding a loopback per switch, adding BGP neighbors and marking each LEAF switch as a route reflector client on the SPINEs.



The LEAF1 configuration must be appropriately replicated on the other two LEAF switches (modify interfaces as needed and add unique loopback IPs).


Post this, confirm that BGP peerings for L2VPN EVPN is up. Each SPINE should have three peerings, one to each LEAF switch:



Remember that the underlay provides reachability from one tunnel endpoint to another. Since we are using BGP unnumbered as the underlay itself, we need to advertise the loopbacks of each LEAF switch into the IPv4 address family.



Each LEAF switch should now be aware of the loopbacks of the other to LEAF switches.



Confirm reachability from loopback to loopback. Running this test from one of the LEAF switches should be enough:



Now, we start with a simple premise - PC1 cannot ping PC2:



Our goal is to allow PC1 to talk to PC2 via a VXLAN fabric. The idea is straightforward - you map your VLAN to a VNID (VXLAN network ID) and associate this to a virtual interface that acts as your entry/exit point for VXLAN packets.


The configuration on a Cumulus box for this is like so (scroll right to understand what each configuration is doing):



It is also important to understand what changes the above commands translate to, so take a look at that when you commit them:



The changes we made creates a new virtual interface with the name 'vni10' (we gave this name in the first command), associates VLAN 10 and a VXLAN network identifier (VNID) of 10010 to it, disables mac learning on this, associates a source IP of 1.1.1.1 to this and adds this interface to the VLAN-aware bridge.


Mimic this configuration on LEAF2 - the only change you need to make there is the local tunnel-ip that is used (configure that as LEAF2s loopback, 2.2.2.2).


Control-plane flow:

Let's try and understand the flow of control-plane learning now. As a baseline, we have not learnt PC1/PC2 macs nor are there any type-2 routes in the BGP table:



I am going to restart PC1 now. This forces a GARP to be sent out when it comes up and this GARP causes PC1s mac to be learnt on swp3 of LEAF1:



Once the mac is learnt, it gets pushed into the BGP table as well:



This will now be sent as an update to the SPINEs and the SPINEs reflect it to LEAF2. A packet capture on LEAF2 shows this BGP UPDATE message:



There is a lot of interesting information in this update. The NLRI is an EVPN NLRI, describing a mac advertisement route (type-2). The RD is 1.1.1.1:2 and you can see the mac address inside this NLRI with no IP address included. Also, look at the final attribute which is for extended communities - this includes the RT (which is a combination of the AS and the VNID itself). The extended community also describes the encapsulation which is of type VXLAN.


Immediately after this update, another update is sent with the IP address included this time:



LEAF2 gets this update and installs it in its BGP table as well against the RD that was in the update (1.1.1.1:2):



From here, the type-2 route gets pushed into the mac address table and associated to vni10, with the 'offload' flag set and the VXLAN destination IP set to 1.1.1.1:



You can see that with the BGP EVPN control-plane, the data-plane is already built and ready without any conversation between the hosts.


Data-plane flow:

From a data-plane perspective, when PC1 pings PC2, it tries to resolve the IP address first. This is achieved via an ARP request that goes through the fabric. This is typically done in two ways - either with headend replication on the ingress tunnel endpoint (meaning the ARP is packaged into a unicast packet sent over the VXLAN fabric) or via multicast (the underlay needs to be multicast aware). We are not going to go into BUM (broadcast/unknown unicast/multicast) traffic handling right now since that needs to be a separate post in itself.


Assuming that ARP is resolved, PC1 generates a unicast ICMP request message that hits LEAF1. LEAF1 does a mac table lookup and finds that this needs to be sent over vni10, with a VXLAN encapsulation, a VNID of 10010 and a destination IP of 2.2.2.2:



On the wire, the packets look like this:



It is an ICMP header inside an IP header inside a VXLAN header inside a UDP header inside an IP header. The logic is that the top most IP header carries the packet from one VXLAN tunnel endpoint to the other. At the other end, the top most IP header is stripped off (since it owns the destination IP address), revealing the UDP and the VXLAN headers which are subsequently stripped off as well. Another mac lookup is done on the packet and a forwarding decision is made based on that:



In this case, the mac table says that the packet needs to be forwarded out of swp3. This gets the packet to PC2 which responds back with an ICMP reply. The entire data-plane process happens again, in the reverse direction with LEAF2 now being the source of the VXLAN encapsulation:



Notice how the top most IP header now has a source IP address of 2.2.2.2. Effectively, you can visualize this as a bi-directional tunnel carrying traffic over a routed infrastructure between two hosts in the same subnet.





I hope this was informative. I'll see y'all in the next post!


For any queries, concerns or conversations, please email contact@theasciiconstruct.com