Asymmetric routing with SR Linux in EVPN VXLAN fabrics
This post dives deeper into the asymmetric routing model on SR Linux. The topology in use is a 3-stage Clos fabric with BGP EVPN and VXLAN, with server s1 single-homed to leaf1, s2 dual-homed to leaf2 and leaf3 and s3 single-homed to leaf4. Hosts s1 and s2 are in the same subnet, 172.16.10.0/24 while s3 is in a different subnet, 172.16.20.0/24. Thus, this post demonstrates Layer 2 extension over a routed fabric as well as how Layer 3 services are deployed over the same fabric, with an asymmetric routing model.
The physical topology is shown below:
The Containerlab file used for this is shown below:
name: srlinux-asymmetric-routing
topology:
nodes:
spine1:
kind: nokia_srlinux
image: ghcr.io/nokia/srlinux:24.7.1
spine2:
kind: nokia_srlinux
image: ghcr.io/nokia/srlinux:24.7.1
leaf1:
kind: nokia_srlinux
image: ghcr.io/nokia/srlinux:24.7.1
leaf2:
kind: nokia_srlinux
image: ghcr.io/nokia/srlinux:24.7.1
leaf3:
kind: nokia_srlinux
image: ghcr.io/nokia/srlinux:24.7.1
leaf4:
kind: nokia_srlinux
image: ghcr.io/nokia/srlinux:24.7.1
s1:
kind: linux
image: ghcr.io/srl-labs/network-multitool
exec:
- ip addr add 172.16.10.1/24 dev eth1
- ip route add 172.16.20.0/24 via 172.16.10.254
s2:
kind: linux
image: ghcr.io/srl-labs/network-multitool
exec:
- ip link add bond0 type bond mode 802.3ad
- ip link set eth1 down
- ip link set eth2 down
- ip link set eth1 master bond0
- ip link set eth2 master bond0
- ip addr add 172.16.10.2/24 dev bond0
- ip link set eth1 up
- ip link set eth2 up
- ip link set bond0 up
- ip route add 172.16.20.0/24 via 172.16.10.254
s3:
kind: linux
image: ghcr.io/srl-labs/network-multitool
exec:
- ip addr add 172.16.20.3/24 dev eth1
- ip route add 172.16.10.0/24 via 172.16.20.254
links:
- endpoints: ["leaf1:e1-1", "spine1:e1-1"]
- endpoints: ["leaf1:e1-2", "spine2:e1-1"]
- endpoints: ["leaf2:e1-1", "spine1:e1-2"]
- endpoints: ["leaf2:e1-2", "spine2:e1-2"]
- endpoints: ["leaf3:e1-1", "spine1:e1-3"]
- endpoints: ["leaf3:e1-2", "spine2:e1-3"]
- endpoints: ["leaf4:e1-1", "spine1:e1-4"]
- endpoints: ["leaf4:e1-2", "spine2:e1-4"]
- endpoints: ["leaf1:e1-3", "s1:eth1"]
- endpoints: ["leaf2:e1-3", "s2:eth1"]
- endpoints: ["leaf3:e1-3", "s3:eth2"]
- endpoints: ["leaf4:e1-3", "s3:eth1"]
Note
The server/host (image used is ghcr.io/srl-labs/network-multitool) login credentials are user/multit00l.
The end goal of this post is to ensure that host s1 can communicate with both s2 (same subnet) and s3 (different subnet) using an asymmetric routing model. To that end, the following IPv4 addressing is used (with the IRB addressing following a distributed, anycast model):
Resource
IPv4 scope
Underlay
198.51.100.0/24
system0 interface
192.0.2.0/24
VNI 10010
172.16.10.0/24
VNI 10020
172.16.20.0/24
server s1
172.16.10.1/24
server s2
172.16.10.2/24
server s3
172.16.20.3/24
irb0.10 interface
172.16.10.254/24
irb0.20 interface
172.16.20.254/24
Reviewing the asymmetric routing model
When routing between VNIs, in a VXLAN fabric, there are two major routing models that can be used - asymmetric and symmetric. Asymmetric routing, which is the focus of this post, uses a bridge-route-bridge model, implying that the ingress leaf bridges the packet into the Layer 2 domain, routes it from one VLAN/VNI to another and then bridges the packet across the VXLAN fabric to the destination. The asymmetry is in the the number of lookups needed on the ingress and the egress leafs - on the ingress leaf, a MAC lookup, an IP lookup and then another MAC lookup is performed while on the egress leaf, only a MAC lookup is performed. Since the ingress leaf routes the packet locally, there is asymmetry in the VNI as well - the VNI in the packet from the source to the destination is different from the VNI when the destination responds back to the source.
Such a design naturally implies that both the source and the destination IRBs (and the corresponding Layer 2 domains and bridge tables) must exist on all leafs hosting servers that need to communicate with each other. While this increases the operational state on the leafs themselves (ARP state and MAC address state is stored everywhere), it does offer configuration and operational simplicity.
Configuration walkthrough
With a basic understanding of the asymmetric routing model, let's start to configure this fabric. This configuration walkthrough includes building out the entire fabric from scratch - only the base configuration, loaded with Containerlab by default, exists on all nodes.
Point-to-point interfaces
The underlay of the fabric includes the physically connected point-to-point interfaces between the leafs and the spines, the IPv4/IPv6 addressing used for these interfaces and a routing protocol, deployed to distribute the loopback (system0) addresses across the fabric, with the simple end goal of achieving reachability between these loopback addresses. The configuration for these point-to-point addresses is shown below from all the nodes.
Notice that configuration for multiple interfaces are shown with a single command using the concept of ranges. Different ways of doing this are shown with one style used for the leafs and another for the spines. With interface ethernet-1{1,2}, the comma-separation allows the user to enter any set of numbers (contiguous or not), which are subsequently expanded. Thus, this expands to interface ethernet-1/1 and interface ethernet-1/2. On the other hand, you can also provide a contiguous range of numbers by using .., as shown for the spines. In that case, interface ethernet-1/{1..4} implies ethernet-1/1 through ethernet-1/4.
Note
Remember, by default, there is no global routing instance/table in SR Linux. A network-instance of type default must be configured and these interfaces, including the system0 interface need to be added to this network instance for point-to-point connectivity.
Underlay and overlay BGP
For the underlay, eBGP is used to advertise the system0 interface addresses. However, since SR Linux has adapted eBGP behavior specifically for the L2VPN EVPN AFI/SAFI (no modification of next-hop address at every eBGP hop and the default use of system0 interface address as the next-hop when originating a route instead of the Layer 3 interface address over which the peering is formed), we can simply enable this address-family over the same peering (leveraging MP-BGP functionality). BGP is configured under the default network-instance since this is for the underlay in the global routing table.
The BGP configuration from all nodes is shown below:
On the spines, the configuration option inter-as-vpn must be set to true under the protocols bgp afi-safi evpn evpn hierarchy. Since the spines are not configured as VTEPs and act as pure IP forwarders in this design, there are no Layer 2 or Layer 3 VXLAN constructs created on the spines, associated to any route targets for EVPN route import. By default, such routes (which have no local route target for import) will be rejected and not advertised to other leafs. The inter-as-vpn configuration option overrides this behavior.
The BGP configuration defines a peer-group called spine on the leafs and leaf on the spines to build out common configuration that can be applied across multiple neighbors. These peer-groups enable both the IPv4-unicast and EVPN address-families, using MP-BGP to establish a single peering for both families. In addition to this, export and import policies are defined, controlling what routes are exported and imported.
The following packet capture also confirms the MP-BGP capabilities exchanged with the BGP OPEN messages, where both IPv4 unicast and L2VPN EVPN capabilities are advertised:
Routing policies for the underlay and overlay
The configuration of the routing policies used for export and import of BGP routes is shown below. Since the policies for the leafs are the same across all leafs and the policies for the spines are the same across all spines, the configuration is only shown from two nodes, leaf1 and spine1, using them as references.
Similar to how ranges can be used to pull configuration state from multiple interfaces as an example, in this case a wildcard * is used to select multiple routing-policies. The wildcard spine-* matches both policies named spine-import and spine-export.
Host connectivity and ESI LAG
With BGP configured, we can start to deploy the connectivity to the servers and configure the necessary VXLAN constructs for end-to-end connectivity. The interfaces, to the servers, are configured as untagged interfaces. Since server s2 is multi-homed to leaf2 and leaf3, this segment is configured as an Ethernet Segment mapped to a LAG interface. This includes:
Mapping the physical interface to a LAG interface (lag1, in this case).
The LAG interface configured with the required LACP properties - mode active and a system-mac of 00:00:00:00:23:23. This LAG interface is also configured with a subinterface of type bridged.
An Ethernet Segment defined under the system network-instance protocols evpn ethernet-segments hierarchy.
On each leaf, VXLAN tunnel-interfaces are created next. In this case, two logical interfaces are created, one for VNI 10010 and another for VNI 10020 (since this is asymmetric routing, all VNIs must exist on all leafs that want to route between the respective VNIs). Since the end-goal is to have server s1 communicate with s2 and s3, only leaf1 and leaf4 are configured with VNI 10020 as well, while leaf2 and leaf3 are only configured with VNI 10010.
IRBs are deployed using an anycast, distributed gateway model, impplying that all leafs are configured with the same IP address and MAC address for a specific IRB subinterface. These IRB subinterfaces act as the default gateway for the endpoints. For our topology, we will create two subinterfaces irb0.10 and irb0.20 corresponding to hosts mapped to VNIs 10010 and 10020, respectively. The configuration of these IRB interfaces is shown below:
There is a lot going on here, so let's breakdown some of the configuration options:
anycast-gw [true|false]
When this is set to true, the IPv4 address is associated to the anycast gateway MAC address and this MAC address is used to respond to any ARP requests for that IPv4 address. This also allows the same IPv4 address to be configured on other nodes for the same broadcast domain, essentially suppressing duplicate IP detection.
anycast-gw anycast-gw-mac [mac-address]
The MAC address configured with this option is the anycast gateway MAC address and is associated to the IP address for that subinterface. If this is ommitted, the anycast gateway MAC address is auto-derived from the VRRP MAC address group range.
arp learn-unsolicited [true|false]
This enables the node to learn the IP-to-MAC binding from any ARP packet and not just ARP requests.
arp host-route populate dynamic
This enables the node to insert a host route (/32 for IPv4 and /128 for IPv6) in the routing table from dynaimc ARP entries.
arp evpn advertise [dynamic|static]
This enables the node to advertise EVPN Type-2 MAC+IP routes from dynamic or static ARP entries.
MAC VRFs on leafs
Finally, MAC VRFs are created on the leafs to create a broadcast domain and corresponding bridge table for Layer 2 learning. Since, by default, a MAC VRF corresponds to a single broadcast domain and bridge table, we can map only one Layer 2 VNI to it. Thus, on leaf1 and leaf4, two MAC VRFs are created - one for VNI 10010 and another for VNI 10020. Under the MAC VRF, there are several important things to consider:
The Layer 2 subinterface is bound to the MAC VRF using the interface configuration option.
The corresponding IRB subinterface is bound to the MAC VRF using the interface configuration option.
The VXLAN tunnel subinterface is bound to the MAC VRF using the vxlan-interface configuration option.
BGP EVPN learning is enabled for the MAC VRF using the protocols bgp-evpn hierarchy and the MAC VRF is bound to an EVI (EVPN virtual instance).
The ecmp configuration option determines how many VTEPs can be considered for load-balancing by the local VTEP (more on this in the validation section). This is for overlay ECMP (for multihomed hosts).
Route distinguishers and route targets are configured for the MAC VRF using the protocols bgp-vpn hierarchy.
This completes the configuration walkthrough section of this post. Next, we'll cover the control plane and data plane validation.
Control plane & data plane validation
When the hosts come online, they typically send a GARP to ensure there is no duplicate IP address in their broadcast domain. This enables the locally attached leafs to learn the IP-to-MAC binding and build an ARP entry in the ARP cache table (since the arp learn-unsolicited configuration option is set to true). This, in turn, is advertised as an EVPN Type-2 MAC+IP route for remote leafs to learn this as well and eventually insert the IP-to-MAC binding as an entry in their ARP caches.
On leaf1, we can confirm that it has learnt the IP-to-MAC binding for server s1 (locally attached) and s3 (attached to remote leaf, leaf4).
The ARP entry for server s3 (172.16.20.3) is learnt via the EVPN Type-2 MAC+IP route received from leaf4, as shown below.
--{ + running }--[ ]--
A:leaf1# show network-instance default protocols bgp routes evpn route-type 2 ip-address 172.16.20.3 detail
---------------------------------------------------------------------------------------------------------------------------
Show report for the EVPN routes in network-instance "default"
---------------------------------------------------------------------------------------------------------------------------
Route Distinguisher: 192.0.2.14:2
Tag-ID : 0
MAC address : AA:C1:AB:9F:EF:E2
IP Address : 172.16.20.3
neighbor : 198.51.100.1
Received paths : 1
Path 1: <Best,Valid,Used,>
ESI : 00:00:00:00:00:00:00:00:00:00
Label : 10020
Route source : neighbor 198.51.100.1 (last modified 4d18h49m3s ago)
Route preference : No MED, No LocalPref
Atomic Aggr : false
BGP next-hop : 192.0.2.14
AS Path : i [65500, 65414]
Communities : [target:20:20, bgp-tunnel-encap:VXLAN]
RR Attributes : No Originator-ID, Cluster-List is []
Aggregation : None
Unknown Attr : None
Invalid Reason : None
Tie Break Reason : none
Path 1 was advertised to (Modified Attributes):
[ 198.51.100.3 ]
Route preference : No MED, No LocalPref
Atomic Aggr : false
BGP next-hop : 192.0.2.14
AS Path : i [65411, 65500, 65414]
Communities : [target:20:20, bgp-tunnel-encap:VXLAN]
RR Attributes : No Originator-ID, Cluster-List is []
Aggregation : None
Unknown Attr : None
---------------------------------------------------------------------------------------------------------------------------
Route Distinguisher: 192.0.2.14:2
Tag-ID : 0
MAC address : AA:C1:AB:9F:EF:E2
IP Address : 172.16.20.3
neighbor : 198.51.100.3
Received paths : 1
Path 1: <Valid,>
ESI : 00:00:00:00:00:00:00:00:00:00
Label : 10020
Route source : neighbor 198.51.100.3 (last modified 4d18h49m0s ago)
Route preference : No MED, No LocalPref
Atomic Aggr : false
BGP next-hop : 192.0.2.14
AS Path : i [65500, 65414]
Communities : [target:20:20, bgp-tunnel-encap:VXLAN]
RR Attributes : No Originator-ID, Cluster-List is []
Aggregation : None
Unknown Attr : None
Invalid Reason : None
Tie Break Reason : peer-router-id
---------------------------------------------------------------------------------------------------------------------------
--{ + running }--[ ]--
This is an important step for asymmetric routing. Consider a situation where server s1 wants to communicate with s3. When the IP packet hits leaf1, it will attempt to resolve the destination IP address via an ARP request, as it is directly attached locally (via the irb.20 interface), as shown below.
Since this IRB interface exists on leaf4 as well, the ARP reply will be consumed by it, never reaching leaf1, and thus, creating a failure in the ARP process. To circumvent this problem associated with an anycast, distributed IRB model, the EVPN Type-2 MAC+IP routes are used to populate the ARP cache.
Let's consider two flows to understand the data plane forwarding in such a design - server s1 communicating with s2 (same subnet) and s1 communicating with s3 (different subnet).
Since s1 is in the same subnet as s2, when communicating with s2, s1 will try to resolve its IP address directly via an ARP request. This is received on leaf1 and leaked to the CPU via irb0.10. Since L2 proxy-arp is not enabled, the arp_nd_mgr process picks up the ARP request and responds back using its own anycast gateway MAC address while suppressing the ARP request from being flooded in the fabric. A packet capture of this ARP reply is shown below.
Once this ARP process completes, server s1 generates an ICMP request (since we are testing communication between hosts using the ping tool). When this IP packet arrives on leaf1, it does a routing lookup (since the destination MAC address is owned by itself) and this routing lookup hits the 172.16.10.0/24 prefix, as shown below. Since this is a directly attached route, it is further resolved into a MAC address via the ARP table and then the packet is bridged towards the destination. This MAC address points to an Ethernet Segment, which in turn resolves into VTEPs 192.0.2.12 and 192.0.2.13.
A packet capture of the in-flight packet (as leaf1 sends it to spine1) is shown below, which confirms that the packet ICMP request is VXLAN-encapsulated with a VNI of 10010. It also confirms that because of the L3 proxy-arp approach to suppressing ARPs in an EVPN VXLAN fabric, the source MAC address in the inner Ethernet header is the anycast gateway MAC address.
The communication between server s1 and s3 follows a similar pattern - the packet is received in macvrf1, mapped VNI 10010, and since the destination MAC address is the anycast MAC address owned by leaf1, it is then routed locally into VNI 10020 (since irb0.20 is locally attached) and then bridged across to the destination, as confirmed below:
--{ + running }--[ ]--
A:leaf1# show network-instance default route-table ipv4-unicast route 172.16.20.3
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
IPv4 unicast route table of network instance default
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+------------------------+-------+------------+----------------------+----------+----------+---------+------------+---------------+---------------+---------------+------------------+
| Prefix | ID | Route Type | Route Owner | Active | Origin | Metric | Pref | Next-hop | Next-hop | Backup Next- | Backup Next-hop |
| | | | | | Network | | | (Type) | Interface | hop (Type) | Interface |
| | | | | | Instance | | | | | | |
+========================+=======+============+======================+==========+==========+=========+============+===============+===============+===============+==================+
| 172.16.20.0/24 | 5 | local | net_inst_mgr | True | default | 0 | 0 | 172.16.20.254 | irb0.20 | | |
| | | | | | | | | (direct) | | | |
+------------------------+-------+------------+----------------------+----------+----------+---------+------------+---------------+---------------+---------------+------------------+
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--{ + running }--[ ]--
--{ + running }--[ ]--
A:leaf1# show network-instance * bridge-table mac-table mac AA:C1:AB:9F:EF:E2
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Mac-table of network instance macvrf2
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Mac : AA:C1:AB:9F:EF:E2
Destination : vxlan-interface:vxlan1.2 vtep:192.0.2.14 vni:10020
Dest Index : 322085950242
Type : evpn
Programming Status : Success
Aging : N/A
Last Update : 2024-10-14T01:05:54.000Z
Duplicate Detect time : N/A
Hold down time remaining: N/A
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--{ + running }--[ ]--
Tip
Notice how the previous output used a wildcard for the network-instance name instead of a specific name (show network-instance * bridge-table ...). This is useful since the operator may not always know exactly which MAC VRF is used for forwarding, and thus, the wildcard traverses across all to determine where the MAC address is learned.
The following packet capture confirms that the in-flight packet has been routed on the ingress leaf itself (leaf1) and the VNI, in the VXLAN header, is 10020.
Summary
Asymmetric routing uses a bridge-route-bridge model where the packet, from the source, is bridged into the ingress leaf's L2 domain, routed into the destination VLAN/VNI and the bridged across the VXLAN fabric to the destination.
Such a model requires the existence of both source and destination IRBs and L2 bridge domains (and L2 VNIs) to exist on all leafs that want to participate in routing between the VNIs. While this is operationally simpler, it does add additional state since all leafs will have to maintain all IP-to-MAC bindings (in the ARP table) and all MAC addresses in the bridge table.