MLAG or MC-LAG (multi-chassis link aggregation) is a fairly common deployment model at the access/leaf layer of both Enterprise and Data Center networks, typically offered by most leading vendors (with different terminologies - vPC, VSS, stackwise-virtual and so on).
The general idea is to offer redundancy at the access layer by pairing together two access/leaf switches into a common logical switch, from the perspective of any devices downstream. Details of Cumulus' implementation can be found here.
For this post, we're going to be using the following topology (tested with Cumulus VX 4.2):
We have three servers, Server5, Server6 and Server3 in VLAN 10, with another server, Server2, in VLAN 20. Server5 is uplinked to both MLAG peers, while Server6 is an orphan device, off of Leaf4 only.
We also have external connectivity via PE1, again, connected only to one of the MLAG peers - Leaf3, in this case. PE1 is advertising 18.104.22.168/32, an external network, to Leaf3.
Logically, for BGP peering, the spines share a common AS, while each leaf is it's own unique AS. This is a standard eBGP type design, meant to avoid BGP path hunting issues.
Each of these devices have a loopback configured. For the leaf's, these loopbacks are the VTEP IPs. The MLAG pair have unique loopacks, but also an anycast CLAG VTEP IP that is configured (similar to a secondary IP):
This CLAG anycast IP is configured under the loopback itself, for both Leaf3 and Leaf4:
Each of the MLAG peers form an eBGP peering with the spines, and an iBGP peering with each other. This iBGP peering is important for failure conditions (we'll look at this in a little bit).
When a MAC address is learnt over the MLAG, it is synced to the MLAG peer as well. Both the peers would insert the entry in their BGP EVPN tables and advertise it out. As an example, Server5s MAC address is learnt by both Leaf3 and Leaf4 and advertised via BGP EVPN to the spines and to each other, over the iBGP peering.
There are two big things to remember with MLAG and BGP EVPN advertisements:
Type-2 EVPN prefixes are sent using the anycast VTEP IP address as the next-hop.
Type-5 EVPN prefixes are sent using the local VTEP IP address (default behavior in Cumulus Linux, other vendors provide a knob to optionally enable it).
The first big why - why do we need an anycast VTEP IP address for the MLAG peers? This just allows for easy BGP filtering - remember, when Leaf3 advertises a prefix into BGP EVPN, it adds it's own AS number, since the update is being sent to eBGP peers (Spine1/Spine2). When Leaf4 gets this, the update is denied because it sees it's own AS - basic BGP loop prevention.
However, this doesn't apply to the iBGP peering that is created over the peer-link. This is where the anycast VTEP IP is useful - because the next-hop is owned by the peers, they will drop any BGP NLRI which has this next-hop (due to self next-hop/martian check). This is important because we wouldn't want the MLAG peers to see each other as next hops (over VXLAN) for locally attached hosts.
With a simple BGP updates debug, you can confirm that the peers drop this because of the self next-hop check:
Remember, even orphan devices are sent with this anycast VTEP address. For example, in our case, Server6 is an orphan device. Leaf4 sends the BGP EVPN update with the anycast VTEP address:
On remote VTEPs (taking Leaf5, as an example), this is installed with 22.214.171.124 as the next-hop:
This can cause traffic for this prefix to hash towards Leaf3 (which does not have a direct connection to Server6). Let's take an example of Server3 pinging Server6.
Because this is a same subnet ping, Server3 tries to ARP for the destination directly. Leaf5 responds back because it already has an entry for Server6 in it's EVPN ARP cache:
Server3 can now send the ICMP request to Leaf5. The destination MAC address is 00:00:00:00:00:06. Leaf5 does a lookup in it's MAC address table, and sends the packet out with VNI 10010, towards a destination of 126.96.36.199 (the anycast VTEP address, owned by both Leaf3 and Leaf4):
A packet capture confirms the L2VNI:
This can be hashed towards Leaf3. Leaf3 simply does a MAC address lookup (for 00:00:00:00:00:06) now, and find's that is it reachable via the peer-link:
Thus, visually, the path of the packet in this case would be:
The next big why - why are type-5 EVPN routes sent with the local VTEP address, instead of the anycast VTEP address? It is quite common to see external prefixes advertised via one of the MLAG peers only, and not both. In such cases, you can potentially black hole traffic by advertising these type-5 prefixes with the anycast VTEP address (because traffic may be ECMP'd to the peer which does not have a route to these external prefixes). Of course, this can be fixed by having per VRF iBGP peering between the MLAG peers but it doesn't scale well and is a lot of administrative overhead.
In general, the 'advertise-pip' BGP EVPN option is used for this - the local VTEP IP address is the 'Primary IP'. Cumulus Linux introduced this in their 4.0 release and it is enabled by default.
However, it needs to be configured in a particular way for it to work. Prior to 4.0, you could use the 'hwaddress' option to specify a MAC address for an interface. From 4.0 onward, you need to use the 'address-virtual' option to create the common router MAC that is shared between the two MLAG peers. This allows for each MLAG peer to retain it's unique system MAC and share this common router MAC.
This change is done under the SVI that maps to the L3VNI:
You should now see the router MAC changed to this common anycast MAC address (general practice is to just set it as the CLAG MAC address), while the system MAC is retained.
This gives a lot of good information - it tells you that 'advertise-pip' is enabled, what the system IP is, what the system MAC and the router MAC is. Thus, for type-5 prefixes, the system IP and the system MAC are used, and for type-2 prefix (regardless of the host being an orphan host), the router MAC and the anycast VTEP IP address is used.
The type-5 routes are now generated using the system IP and MAC itself.
This is advertised only with the L3VNI. On other VTEPs, this should be installed in the VRF table:
Before we wrap this up, it is important to talk about some failure scenarios with MLAG. An important design consideration (and we'll see this more prominently when we talk about Ethernet Segments in EVPN), is that losing a downlink (towards the server itself) has no control-plane implications. There is absolutely no control-plane convergence because of this - it is purely a data plane forwarding change.
For example, say Leaf3s interface going to Server5 goes down. Traffic can still be hashed towards Leaf3, destined for Server5. It would just be forwarded over the peer-link. This is why capacity planning of the peer-link is equally important.
A second failure scenario to consider is what would happen if all fabric links go down. So, for example, Leaf3 loses all its spine facing links and the packet from Server5 (destined to Server2) is hashed to Leaf3.
This is where the iBGP peering is useful. Prior to this event, the route is received via the spines, and via the iBGP peering to the MLAG peer.
After the link down event (and route withdrawals), the traffic is simply routed via the peer-link. If the iBGP peering was missing, traffic would be blackholed on Leaf3.
In the next post, we'll look at how Ethernet Segment based EVPN multi-homing acts as an alternative to MLAGs.