We continue on with the same topology as the last post:
Our goal is to have PC1 talk to PC3 over the VXLAN fabric that we have built out. VXLAN routing can broadly be done using two methods - asymmetric and symmetric IRB. In this post, we will discuss the asymmetric methodology.
The logic behind asymmetric IRB is that the routing between VXLANs/VNIs is done at each LEAF. This is analogous to inter-VLAN routing where the routing from one VLAN to another happens on the first routed hop itself. To facilitate said inter-VLAN routing, you are required to have the L3 interfaces for both VLANs defined on the first hop (LEAF1 here, for example), so that the box is capable of taking in the packet from one VLAN, routing it into the other VLAN where it can then ARP for the end host directly.
Following the same logic, for inter-VXLAN routing, you are required to have the L3 interface for both VNIs on the first routed hop. Let's break this down into chunks and start our configuration.
First, we need to have L3 interfaces for the VLANs themselves that will act as gateways for the end hosts:
Notice how the same IP address is defined at both LEAF switches for the same VLAN. Typically, what you should do here is define another virtual address with a virtual mac address common to all LEAF switches sharing this VLAN (VRR, in Cumulus terms). However, for simplicity sake, we are not doing that here.
Next, the routed interfaces for the corresponding VXLANs (or rather VNIs) are also created on both LEAF1 and LEAF3:
You can confirm the RD/RT values that are auto-generated for these VNIs using:
Remember, the default gateway of the hosts is the corresponding SVI IP address of the LEAF switch:
So, how does this all work?
PC1 initiates a ping. It first does an 'and' operation to determine if the destination is in the same subnet or not. In this case, it finds out that it is not, so it forwards the packet to its default gateway. In order to do this, it must first ARP for the default gateway, resolve that and then build the ICMP request packet.
This ARP request causes PC1s mac address to be learnt on swp3 of LEAF1. 'Zebra' notifies this to BGP; BGP adds this into its EVPN table and sends an update regarding this to its peers. This BGP update is carried to LEAF3, where BGP populates its EVPN table and then informs 'Zebra', which pushes this into the L2 table. This entire control-plane learning process can be visualized like so:
Debugs on the box can confirm this process. This is the first time we are introducing debugging on Cumulus boxes, so it's good to make a note of this. Debugs are typically redirected to the '/var/log/frr/frr.log' file but you need to set your logging level correctly first:
'sudo vtysh' allows you to enter a Cisco IOS type prompt. From here, you set the syslog level to 'debug'. This allows for debug logs to be logged in the /var/log/frr/frr.log file. Enable the relevant debugs now:
From the log file, I have taken relevant snippets of the debugs. On LEAF1, when PC1s mac address is first learnt:
Similar debugs from LEAF3, run in parallel:
You can now look at the BGP EVPN table to confirm what was installed:
Also confirm the entries in the mac address table, which should now include the mac address of the the host learnt via BGP EVPN:
Remember that the presence of the mac address in the BGP EVPN table is purely controlled by the presence of the same mac address in the L2 table. When the mac is added to the L2 table, a notification is sent to BGP to add it to its EVPN table and advertise the NLRI. The mac address is subjected to its normal mac ageing timers in the L2 table - if/when the mac expires and it gets deleted from the L2 table, again, BGP is notified of the same and it sends a withdraw message for that mac address.
This concludes the workings of the control-plane.
Once PC1 resolves its default gateway, it generates the native ICMP packet. LEAF1 gets this:
Since the destination mac address is of LEAF1, it strips off the Ethernet header and does a route lookup on the destination IP address in the IP header.
The destination is directly connected, so LEAF1 can ARP for it. Here's where things get a bit interesting - if you take a packet capture, you would see no ARPs generated. This is because of an enhancement that reduces flooding within the VXLAN fabric (called ARP suppression). These VTEPs (VXLAN tunnel endpoints) maintain an arp-cache that can be populated via a type-2 mac+ip route. This local ARP cache can be used to proxy ARP (note - this is supposed to be disabled by default and enabled using the command 'net add vxlan <name> bridge arp-nd-suppress on' but the behavior on 3.7.5 appears to indicate that it is enabled by default).
You can confirm what is there in the arp-cache using the following command:
LEAF1 has all the required information to forward this packet. It encapsulates this with the relevant headers and forwards the packet. Assuming that the link to SPINE1 is chosen as the hash, the following packet is sent:
The VXLAN header contains the egress VNI - 10030 in this case. The outer IP header has a source IP address of the VTEP encapsulating the packet (18.104.22.168, in this case) and a destination IP address of the peer VTEP (22.214.171.124, in this case).
Assuming SPINE1 gets it, it simply does a route lookup for the destination (since the destination mac address belongs to itself):
It rewrites the layer2 header and forwards the packet to LEAF3. LEAF3 decapsulates it and forwards the native ICMP packet to PC3. The data-plane process can be visualized like so:
The same process happens backwards from PC3 to PC1 now. The egress VNI (once LEAF3 encapsulates the ICMP reply) would be 10010 in this case. The return packet from LEAF3 is this:
This has been fun, hasn't it? We looked at the basic workflows for both data-plane and control-plane of an asymmetric IRB based VXLAN fabric and successfully routed traffic between VNIs 10010 and 10030.