In this post, we deploy a multivendor EVPN L2 overlay fabric, with BGP in the underlay as well. The entire fabric deployment is automated with Ansible, and Containerlab is used to define and deploy the actual topology.
Introduction and topology
This post continues to build and showcase the power of containerlab. In this post, we will create a multivendor EVPN topology for L2 overlays. The vendors included are Arista (vEOS), Juniper (vQFX), Cumulus (Cumulus VX) and Cisco (N9Kv). Every piece of software used here is free (some might be behind a login, so you'll have to register to get access to image downloads).
The topology is as follows:
Arista vEOS is used for both the spines and leaf1. Cumulus VX is leaf2, Cisco N9Kv is leaf3 and Juniper vQFX is leaf4. Behind each leaf is a host, and all hosts are in the same 10.100.100.0/24 subnet. Router-IDs for each spine and leaf, along with the host details are as follows. The router-ID with a /32 mask makes up the loopback of each device as well:
The endgoal is to get all the hosts to talk to each other, and also understand if these vendors play nice with each other when deploying L2 overlays with BGP EVPN and VXLAN (we're in for a few interesting surprises here).
Building the docker images
Juniper vQFX
Juniper vQFX can be built using their qcow2 images that are available publicly. I've already written about this in an earlier blog post with complete instructions on how this is built and what is happening behind the scenes.
Cumulus VX
Docker images are not officially maintained for this by Cumulus Networks (now Nvidia). However, Michael (@networkop) maintains some unofficial images here. You can simply pull this using 'docker pull'.
root@aninchat-ubuntu:~/clabs# docker images | grep cx
networkop/cx 5.0.1 4d6152fa636b 2 weeks ago 805MB
networkop/cx 4.4.0 468cdd1a4be5 7 months ago 772MB
He maintains a fairly up-to-date list of corresponding docker images, but I've pulled an older one because I am not too fond of the new NVUE interface that Cumulus Linux has shifted to - I prefer the older NCLU, which 4.4.x still runs.
Arista vEOS
Arista vEOS can be downloaded for free from Arista's software download site. This is locked behind a guest registration though, so you still need to go through that entire process if you'd like to download this. Once you have the vEOS image (it should be a vmdk file), place it in the vrnet/veos folder.
root@aninchat-ubuntu:~/vrnetlab/veos# pwd
/root/vrnetlab/veos
root@aninchat-ubuntu:~/vrnetlab/veos# ls -l
total 491984
-rw-r--r-- 1 root root 1028 Feb 5 15:41 Makefile
-rw-r--r-- 1 root root 1281 Feb 5 15:41 README.md
drwxr-xr-x 2 root root 4096 Feb 16 06:35 docker
-rw-r--r-- 1 root root 503775232 Feb 16 06:35 vEOS-lab-4.27.2F.vmdk
You can now trigger the docker image build using 'make'. Since I already have the image built, it doesn't do much for me at this point.
root@aninchat-ubuntu:~/vrnetlab/veos# make
Makefile:18: warning: overriding recipe for target 'docker-pre-build'
../makefile.include:18: warning: ignoring old recipe for target 'docker-pre-build'
for IMAGE in vEOS-lab-4.27.2F.vmdk; do \
echo "Making $IMAGE"; \
make IMAGE=$IMAGE docker-build; \
done
Making vEOS-lab-4.27.2F.vmdk
make[1]: Entering directory '/root/vrnetlab/veos'
Makefile:18: warning: overriding recipe for target 'docker-pre-build'
../makefile.include:18: warning: ignoring old recipe for target 'docker-pre-build'
rm -f docker/*.qcow2* docker/*.tgz* docker/*.vmdk* docker/*.iso
# checking if ZTP config contains a string (DISABLE=True) in the file /zerotouch-config
# if it does, we don't need to write this file
Checking ZTP status
ZTPOFF=DISABLE=True; \
echo "docker-pre-build: ZTPOFF is $ZTPOFF" && \
if [ "$ZTPOFF" != "DISABLE=True" ]; then \
echo "Disabling ZTP" && docker run --rm -it -e LIBGUESTFS_DEBUG=0 -v $(pwd):/work cmattoon/guestfish -a vEOS-lab-4.27.2F.vmdk -m /dev/sda2 write /zerotouch-config "DISABLE=True"; \
fi
docker-pre-build: ZTPOFF is DISABLE=True
Building docker image using vEOS-lab-4.27.2F.vmdk as vrnetlab/vr-veos:4.27.2F
cp ../common/* docker/
make IMAGE=$IMAGE docker-build-image-copy
make[2]: Entering directory '/root/vrnetlab/veos'
Makefile:18: warning: overriding recipe for target 'docker-pre-build'
../makefile.include:18: warning: ignoring old recipe for target 'docker-pre-build'
cp vEOS-lab-4.27.2F.vmdk* docker/
make[2]: Leaving directory '/root/vrnetlab/veos'
(cd docker; docker build --build-arg http_proxy= --build-arg https_proxy= --build-arg IMAGE=vEOS-lab-4.27.2F.vmdk -t vrnetlab/vr-veos:4.27.2F .)
Sending build context to Docker daemon 503.8MB
Step 1/11 : FROM ubuntu:20.04
---> 54c9d81cbb44
Step 2/11 : LABEL maintainer="Kristian Larsson <kristian@spritelink.net>"
---> Using cache
---> b8a0857a144e
Step 3/11 : LABEL maintainer="Roman Dodin <dodin.roman@gmail.com>"
---> Using cache
---> a22af11cc083
Step 4/11 : ARG DEBIAN_FRONTEND=noninteractive
---> Using cache
---> 1735e5bccc44
Step 5/11 : RUN apt-get update -qy && apt-get upgrade -qy && apt-get install -y bridge-utils iproute2 python3-ipy socat qemu-kvm tcpdump tftpd-hpa ssh inetutils-ping dnsutils openvswitch-switch iptables telnet && rm -rf /var/lib/apt/lists/*
---> Using cache
---> 43deb024677e
Step 6/11 : ARG IMAGE
---> Using cache
---> 36e30f700548
Step 7/11 : COPY $IMAGE* /
---> Using cache
---> 858ae07ca107
Step 8/11 : COPY *.py /
---> Using cache
---> 62058063f86c
Step 9/11 : EXPOSE 22 80 161/udp 443 830 5000 6030 10000-10099 57400
---> Using cache
---> 53c2102098d6
Step 10/11 : HEALTHCHECK CMD ["/healthcheck.py"]
---> Using cache
---> 9b3ebcc4ab71
Step 11/11 : ENTRYPOINT ["/launch.py"]
---> Using cache
---> 11aed84d6be0
Successfully built 11aed84d6be0
Successfully tagged vrnetlab/vr-veos:4.27.2F
make[1]: Leaving directory '/root/vrnetlab/veos'
Cisco's N9Kv is also available for free (again, locked behind an account registration). Similar to the earlier processes, download the qcow2 image and move it to the vrnetlab/n9kv folder.
root@aninchat-ubuntu:~/vrnetlab/n9kv# pwd
/root/vrnetlab/n9kv
root@aninchat-ubuntu:~/vrnetlab/n9kv# ls -l
total 1934160
-rw-r--r-- 1 root root 266 Feb 5 15:41 Makefile
-rw-r--r-- 1 root root 585 Feb 5 15:41 README.md
drwxr-xr-x 2 root root 4096 Feb 16 15:54 docker
-rw-r--r-- 1 root root 1980563456 Feb 16 06:51 nxosv.9.3.9.qcow2
Be sure to use the 'n9kv' folder and not the 'nxos' folder - the 'nxos' folder is for the older titanium images. Once the image is copied here, trigger 'make' to build the docker image for this.
root@aninchat-ubuntu:~/vrnetlab/n9kv# make
for IMAGE in nxosv.9.3.9.qcow2; do \
echo "Making $IMAGE"; \
make IMAGE=$IMAGE docker-build; \
done
Making nxosv.9.3.9.qcow2
make[1]: Entering directory '/root/vrnetlab/n9kv'
rm -f docker/*.qcow2* docker/*.tgz* docker/*.vmdk* docker/*.iso
Building docker image using nxosv.9.3.9.qcow2 as vrnetlab/vr-n9kv:9.3.9
cp ../common/* docker/
make IMAGE=$IMAGE docker-build-image-copy
make[2]: Entering directory '/root/vrnetlab/n9kv'
cp nxosv.9.3.9.qcow2* docker/
make[2]: Leaving directory '/root/vrnetlab/n9kv'
(cd docker; docker build --build-arg http_proxy= --build-arg https_proxy= --build-arg IMAGE=nxosv.9.3.9.qcow2 -t vrnetlab/vr-n9kv:9.3.9 .)
Sending build context to Docker daemon 1.985GB
Step 1/12 : FROM ubuntu:20.04
---> 54c9d81cbb44
Step 2/12 : LABEL maintainer="Kristian Larsson <kristian@spritelink.net>"
---> Using cache
---> b8a0857a144e
Step 3/12 : LABEL maintainer="Roman Dodin <dodin.roman@gmail.com>"
---> Using cache
---> a22af11cc083
Step 4/12 : ARG DEBIAN_FRONTEND=noninteractive
---> Using cache
---> 1735e5bccc44
Step 5/12 : RUN apt-get update -qy && apt-get upgrade -qy && apt-get install -y bridge-utils iproute2 python3-ipy socat qemu-kvm tcpdump tftpd-hpa ssh inetutils-ping dnsutils openvswitch-switch iptables telnet && rm -rf /var/lib/apt/lists/*
---> Using cache
---> 43deb024677e
Step 6/12 : ARG IMAGE
---> Using cache
---> 36e30f700548
Step 7/12 : COPY $IMAGE* /
---> Using cache
---> 66fbaa5d2045
Step 8/12 : COPY OVMF.fd /
---> Using cache
---> 87c96b2ebf08
Step 9/12 : COPY *.py /
---> Using cache
---> 756902fd2036
Step 10/12 : EXPOSE 22 80 161/udp 443 830 5000 6030 10000-10099 57400
---> Using cache
---> df3be4a02195
Step 11/12 : HEALTHCHECK CMD ["/healthcheck.py"]
---> Using cache
---> 5143e36e8998
Step 12/12 : ENTRYPOINT ["/launch.py"]
---> Using cache
---> 0eb634fdcd16
Successfully built 0eb634fdcd16
Successfully tagged vrnetlab/vr-n9kv:9.3.9
make[1]: Leaving directory '/root/vrnetlab/n9kv'
For hosts, I like to use Michael's host docker image (networkop/host) which are based on Ubuntu 20.04. You can pull the latest tag ('ifreload') from dockerhub.
Note that the 'binds' block allows us to bind a file in the host OS to the container itself. In this case, I am simply pre-configuring the network interfaces for the fabric hosts by copying relevant configuration into the /etc/network/interfaces file of the container.
For example, let's take host h1 and it's file, hosts/h1_interfaces:
root@aninchat-ubuntu:~/clabs/multivendor# cat hosts/h1_interfaces
auto lo
iface lo inet loopback
auto eth1
iface eth1 inet static
address 10.100.100.1
netmask 255.255.255.0
All my hosts are in the same subnet, so h2 is 10.100.100.2/24, h3 is 10.100.100.3/24 and h4 is 10.100.100.4/24.
Deploying the lab
We can now deploy this lab using 'containerlab deploy':
Remember to wait till the containers are reported as healthy. If they remain unhealthy, then something was wrong with the docker build and you should probably re-visit that.
root@aninchat-ubuntu:~/clabs/multivendor# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
86191b18f7f9 networkop/ignite:dev "/usr/local/bin/igni…" 8 minutes ago Up 7 minutes ignite-b01b941136a18c36
b628b4de29b9 vrnetlab/vr-veos:4.27.2F "/launch.py --userna…" 8 minutes ago Up 7 minutes (healthy) 22/tcp, 80/tcp, 443/tcp, 830/tcp, 5000/tcp, 6030/tcp, 10000-10099/tcp, 57400/tcp, 161/udp clab-evpn-l2-multivendor-spine1
8a317c31929d vrnetlab/vr-n9kv:9.3.9 "/launch.py --userna…" 8 minutes ago Up 7 minutes (healthy) 22/tcp, 80/tcp, 443/tcp, 830/tcp, 5000/tcp, 6030/tcp, 10000-10099/tcp, 57400/tcp, 161/udp clab-evpn-l2-multivendor-leaf3
ff2871985b4e vrnetlab/vr-veos:4.27.2F "/launch.py --userna…" 8 minutes ago Up 8 minutes (healthy) 22/tcp, 80/tcp, 443/tcp, 830/tcp, 5000/tcp, 6030/tcp, 10000-10099/tcp, 57400/tcp, 161/udp clab-evpn-l2-multivendor-leaf1
3f426a4c2db4 vrnetlab/vr-veos:4.27.2F "/launch.py --userna…" 8 minutes ago Up 7 minutes (healthy) 22/tcp, 80/tcp, 443/tcp, 830/tcp, 5000/tcp, 6030/tcp, 10000-10099/tcp, 57400/tcp, 161/udp clab-evpn-l2-multivendor-spine2
690414536e2b vrnetlab/vr-vqfx:20.2R1.10 "/launch.py --userna…" 8 minutes ago Up 7 minutes (healthy) 22/tcp, 830/tcp, 5000/tcp, 10000-10099/tcp, 161/udp clab-evpn-l2-multivendor-leaf4
e54edb974d39 networkop/host:ifreload "/entrypoint.sh" 8 minutes ago Up 7 minutes clab-evpn-l2-multivendor-h1
11a73b7514c0 networkop/host:ifreload "/entrypoint.sh" 8 minutes ago Up 7 minutes clab-evpn-l2-multivendor-h2
a02ead147994 networkop/host:ifreload "/entrypoint.sh" 8 minutes ago Up 7 minutes clab-evpn-l2-multivendor-h4
9f22eff7b0ca networkop/host:ifreload "/entrypoint.sh" 8 minutes ago Up 7 minutes clab-evpn-l2-multivendor-h3
Automating the fabric bringup
In order to automate the fabric bringup, I've written a (terrible) Ansible script. The script can be found here. This is meant to configure the p2p interfaces between the leafs and the spines, underlay routing (BGP), overlay routing (BGP EVPN), and all of the necessary VXLAN and VLAN configuration for a L2 overlay.
Ansible inventory and variables for fabric deployment
The inventory for the script is built from the IP addresses that were statically assigned in the containerlab topology file. The inventory is also crucial to grouping together devices, allowing for group variables to kick in:
We're essentially defining fabric specific attributes that will be used to build the configuration of the device via a jinja2 template in the ansible playbook.
Grouping together devices allows us to now define group variables as well, under 'group_vars':
These mostly contain connection details, and some other variables I've set to create some conditionals in the ansible script. For example, let's look at the eos.yml file from 'group_vars':
To ensure that your inventory is built as expected, and all of the variables are inherited correctly, we can use the 'ansible-inventory' command as follows (only a snippet is shown here for brevity):
Jinja2 templates for building device configuration
Jinja2 templates are awesome to build the actual configuration. This also lends very well into a multivendor deployment, where each vendor may have a different style of configuration and you cannot have a pre-defined set for all network operating systems. All templates are stored under the 'templates' folder:
The directory is structured per vendor/OS (excluding Cumulus, see quirks section below for this). For every vendor, I have a jinja2 template for different parts of the configuration. For example, let's take EOS and NXOS to compaare.
For EOS:
root@aninchat-ubuntu:~/clabs/multivendor# cat templates/eos/underlay_interfaces.j2
{% for interface in interfaces -%}
interface {{ interface }}
{% if interface == 'Loopback0' -%}
ip address {{ interfaces[interface].address }}
{% else -%}
no switchport
ip address {{ interfaces[interface].address }}
{% endif %}
{% endfor %}
For NXOS:
root@aninchat-ubuntu:~/clabs/multivendor# cat templates/nxos/underlay_interfaces.j2
feature lldp
{% for interface in interfaces -%}
interface {{ interface }}
{% if interface == 'loopback0' -%}
ip address {{ interfaces[interface].address }}
{% else -%}
no switchport
ip address {{ interfaces[interface].address }}
no shutdown
{% endif %}
{% endfor -%}
As you can see, there are three spaces for EOS configuration, while there are two spaces for NXOS configuration. This becomes crucial for idempotency. This jinja2 template is finally used in the playbook, as follows:
'{{ platform }}' is substituted in real time with the actual platform variable from the flat inventory. This simple hierarchy structure allows for easy lookup into a vendor specific template.
Small quirks and other noteworthy points regarding fabric automation
The playbook can now be executed to automate the full deployment. Couple of noteworthy things, specific to some of these vendors and their quirks:
with Arista's EOS (and vEOS), you need to enable multi-agent service model (the default is ribd, which is single agent). The catch here is that after enabling this, the device must be reloaded - this is true even for the actual hardware platforms, and not just vEOS. As part of this automation, I am enabling multi-agent, but you, as the user, must reload the box at least once to get BGP peerings to come up (you'll see IPv4 unicast BGP come up, but no other address family will work until a reload).
Juniper's vQFX takes a bit to load all of the network interfaces (the 'xe' interfaces). You need to be patient. You can verify if the interfaces have come up using the 'show interfaces terse' command. Until then, do not run the ansible playbook.
Outside of these things, there were certain places where my hands felt like they were tied while building the actual ansible playbook. For example, for Cumulus' automation, I use the 'community.network.nclu' ansible module. This module can take in a template for configuration, however, it doesn't seem to take a template path and instead, expects you to define an inline template, which is really odd. This bloats up the playbook considerably, an example (taken from the playbook) below:
- name: configure interfaces on network devices
block:
- name: configure interfaces on non-cumulus switches
when: platform != 'cumulus'
cli_config:
config: "{{ lookup('template', 'templates/{{ platform }}/underlay_interfaces.j2') }}"
- name: configure interfaces on cumulus devices
when: platform == 'cumulus'
community.network.nclu:
template: |
{% for interface in interfaces -%}
{% if interface == 'loopback0' -%}
add loopback lo ip address {{ interfaces[interface].address }}
{% else -%}
add interface {{ interface }} ip address {{ interfaces[interface].address }}
{% endif -%}
{% endfor -%}
commit: yes
Along the same lines, for all other vendors I could use the network_cli connection but not for Cumulus. Because of this, each task in my playbook had to be divided into blocks, where one block would configure non-cumulus switches and the other would configure cumulus switches. This was done using simple conditional checks (using the 'when' statement in Ansible).
Verification of the EVPN fabric
Basic connectivity checks
In this section, let's look at the fabric and if everything is automated correctly or not. Both the spines have an IPv4 unicast peering with all the leafs:
// spine1
spine1#show bgp ipv4 unicast summary
BGP summary information for VRF default
Router identifier 192.168.101.1, local AS number 65500
Neighbor Status Codes: m - Under maintenance
Neighbor V AS MsgRcvd MsgSent InQ OutQ Up/Down State PfxRcd PfxAcc
10.10.10.1 4 64521 115 117 0 0 01:32:25 Estab 1 1
10.10.10.3 4 64522 1860 2172 0 0 01:32:25 Estab 1 1
10.10.10.5 4 64523 97 115 0 0 01:32:25 Estab 1 1
10.10.10.7 4 64524 207 223 0 0 01:32:26 Estab 1 1
// spine2
spine2#show bgp ipv4 unicast summary
BGP summary information for VRF default
Router identifier 192.168.101.2, local AS number 65500
Neighbor Status Codes: m - Under maintenance
Neighbor V AS MsgRcvd MsgSent InQ OutQ Up/Down State PfxRcd PfxAcc
10.10.10.9 4 64521 119 119 0 0 01:34:07 Estab 1 1
10.10.10.11 4 64522 1893 2211 0 0 01:34:07 Estab 1 1
10.10.10.13 4 64523 99 118 0 0 01:34:06 Estab 1 1
10.10.10.15 4 64524 211 226 0 0 01:34:06 Estab 1 1
The EVPN peering is also up:
// spine1
spine1#show bgp evpn summary
BGP summary information for VRF default
Router identifier 192.168.101.1, local AS number 65500
Neighbor Status Codes: m - Under maintenance
Neighbor V AS MsgRcvd MsgSent InQ OutQ Up/Down State PfxRcd PfxAcc
192.168.100.1 4 64521 116 119 0 0 01:33:50 Estab 2 2
192.168.100.2 4 64522 1887 2211 0 0 01:33:50 Estab 3 3
192.168.100.3 4 64523 99 117 0 0 01:33:49 Estab 2 2
192.168.100.4 4 64524 211 229 0 0 01:33:51 Estab 2 2
// spine2
spine2#show bgp evpn summary
BGP summary information for VRF default
Router identifier 192.168.101.2, local AS number 65500
Neighbor Status Codes: m - Under maintenance
Neighbor V AS MsgRcvd MsgSent InQ OutQ Up/Down State PfxRcd PfxAcc
192.168.100.1 4 64521 120 121 0 0 01:34:18 Estab 2 2
192.168.100.2 4 64522 1896 2219 0 0 01:34:16 Estab 3 3
192.168.100.3 4 64523 100 118 0 0 01:34:15 Estab 2 2
192.168.100.4 4 64524 212 230 0 0 01:34:16 Estab 2 2
Let's take leaf1 now and confirm if it has received the loopback of all other leafs, and if it has reachability to these loopbacks:
leaf1#show ip route
VRF: default
Codes: C - connected, S - static, K - kernel,
O - OSPF, IA - OSPF inter area, E1 - OSPF external type 1,
E2 - OSPF external type 2, N1 - OSPF NSSA external type 1,
N2 - OSPF NSSA external type2, B - Other BGP Routes,
B I - iBGP, B E - eBGP, R - RIP, I L1 - IS-IS level 1,
I L2 - IS-IS level 2, O3 - OSPFv3, A B - BGP Aggregate,
A O - OSPF Summary, NG - Nexthop Group Static Route,
V - VXLAN Control Service, M - Martian,
DH - DHCP client installed default route,
DP - Dynamic Policy Route, L - VRF Leaked,
G - gRIBI, RC - Route Cache Route
Gateway of last resort is not set
C 10.0.0.0/24 is directly connected, Management1
C 10.10.10.0/31 is directly connected, Ethernet1
C 10.10.10.8/31 is directly connected, Ethernet2
C 192.168.100.1/32 is directly connected, Loopback0
B E 192.168.100.2/32 [200/0] via 10.10.10.0, Ethernet1
via 10.10.10.8, Ethernet2
B E 192.168.100.3/32 [200/0] via 10.10.10.0, Ethernet1
via 10.10.10.8, Ethernet2
B E 192.168.100.4/32 [200/0] via 10.10.10.0, Ethernet1
via 10.10.10.8, Ethernet2
B E 192.168.101.1/32 [200/0] via 10.10.10.0, Ethernet1
B E 192.168.101.2/32 [200/0] via 10.10.10.8, Ethernet2
We see all the loopbacks in there, which is good. Let's ping to confirm reachability, using leaf1s loopback as the source:
leaf1#ping 192.168.100.2 source 192.168.100.1
PING 192.168.100.2 (192.168.100.2) from 192.168.100.1 : 72(100) bytes of data.
80 bytes from 192.168.100.2: icmp_seq=1 ttl=63 time=9.03 ms
80 bytes from 192.168.100.2: icmp_seq=2 ttl=63 time=14.8 ms
80 bytes from 192.168.100.2: icmp_seq=3 ttl=63 time=9.03 ms
80 bytes from 192.168.100.2: icmp_seq=4 ttl=63 time=9.23 ms
80 bytes from 192.168.100.2: icmp_seq=5 ttl=63 time=12.1 ms
--- 192.168.100.2 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 41ms
rtt min/avg/max/mdev = 9.034/10.847/14.807/2.305 ms, pipe 2, ipg/ewma 10.474/9.926 ms
leaf1#ping 192.168.100.3 source 192.168.100.1
PING 192.168.100.3 (192.168.100.3) from 192.168.100.1 : 72(100) bytes of data.
80 bytes from 192.168.100.3: icmp_seq=1 ttl=254 time=11.6 ms
80 bytes from 192.168.100.3: icmp_seq=2 ttl=254 time=12.1 ms
80 bytes from 192.168.100.3: icmp_seq=3 ttl=254 time=8.81 ms
80 bytes from 192.168.100.3: icmp_seq=4 ttl=254 time=7.50 ms
80 bytes from 192.168.100.3: icmp_seq=5 ttl=254 time=8.88 ms
--- 192.168.100.3 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 47ms
rtt min/avg/max/mdev = 7.502/9.783/12.108/1.774 ms, pipe 2, ipg/ewma 11.764/10.594 ms
leaf1#ping 192.168.100.4 source 192.168.100.1
PING 192.168.100.4 (192.168.100.4) from 192.168.100.1 : 72(100) bytes of data.
80 bytes from 192.168.100.4: icmp_seq=1 ttl=63 time=112 ms
80 bytes from 192.168.100.4: icmp_seq=2 ttl=63 time=200 ms
80 bytes from 192.168.100.4: icmp_seq=3 ttl=63 time=192 ms
80 bytes from 192.168.100.4: icmp_seq=4 ttl=63 time=185 ms
80 bytes from 192.168.100.4: icmp_seq=5 ttl=63 time=176 ms
--- 192.168.100.4 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 42ms
rtt min/avg/max/mdev = 112.935/173.590/200.128/31.282 ms, pipe 5, ipg/ewma 10.747/143.784 ms
Analyzing the EVPN routes for interoperability issues
Clue #1
Let's start by looking at some EVPN routes now. I am obviously aware of some of the interoperability issues in this network, so I'll start dropping hints to help you along the way.
First, because I am using ingress replication, I should see a type-3 IMET route generated by each leaf. We'll stay on leaf1 for this analysis:
leaf1#show bgp evpn route-type imet
BGP routing table information for VRF default
Router identifier 192.168.100.1, local AS number 64521
Route status codes: s - suppressed, * - valid, > - active, E - ECMP head, e - ECMP
S - Stale, c - Contributing to ECMP, b - backup
% - Pending BGP convergence
Origin codes: i - IGP, e - EGP, ? - incomplete
AS Path Attributes: Or-ID - Originator ID, C-LST - Cluster List, LL Nexthop - Link Local Nexthop
Network Next Hop Metric LocPref Weight Path
* >Ec RD: 192.168.100.2:2 imet 192.168.100.2
192.168.100.2 - 100 0 65500 64522 i
* ec RD: 192.168.100.2:2 imet 192.168.100.2
192.168.100.2 - 100 0 65500 64522 i
* >Ec RD: 192.168.100.3:3 imet 192.168.100.3
192.168.100.3 - 100 0 65500 64523 i
* ec RD: 192.168.100.3:3 imet 192.168.100.3
192.168.100.3 - 100 0 65500 64523 i
* > RD: 192.168.100.1:1 imet 192.168.100.1
- - - 0 i
* >Ec RD: 192.168.100.4:4 imet 10100 192.168.100.4
192.168.100.4 - 100 0 65500 64524 i
* ec RD: 192.168.100.4:4 imet 10100 192.168.100.4
192.168.100.4 - 100 0 65500 64524 i
Notice anything weird in the output here? Anything that stands out? Let's look at two routes for comparison, the imet route from leaf2 and leaf4:
We should also see some type-2 macip routes since all my hosts are up and running:
leaf1#show bgp evpn route-type mac-ip
BGP routing table information for VRF default
Router identifier 192.168.100.1, local AS number 64521
Route status codes: s - suppressed, * - valid, > - active, E - ECMP head, e - ECMP
S - Stale, c - Contributing to ECMP, b - backup
% - Pending BGP convergence
Origin codes: i - IGP, e - EGP, ? - incomplete
AS Path Attributes: Or-ID - Originator ID, C-LST - Cluster List, LL Nexthop - Link Local Nexthop
Network Next Hop Metric LocPref Weight Path
* >Ec RD: 192.168.100.3:3 mac-ip aac1.ab48.5bcf
192.168.100.3 - 100 0 65500 64523 i
* ec RD: 192.168.100.3:3 mac-ip aac1.ab48.5bcf
192.168.100.3 - 100 0 65500 64523 i
* >Ec RD: 192.168.100.2:2 mac-ip aac1.ab93.c413
192.168.100.2 - 100 0 65500 64522 i
* ec RD: 192.168.100.2:2 mac-ip aac1.ab93.c413
192.168.100.2 - 100 0 65500 64522 i
* >Ec RD: 192.168.100.2:2 mac-ip aac1.ab93.c413 fe80::a8c1:abff:fe93:c413
192.168.100.2 - 100 0 65500 64522 i
* ec RD: 192.168.100.2:2 mac-ip aac1.ab93.c413 fe80::a8c1:abff:fe93:c413
192.168.100.2 - 100 0 65500 64522 i
* > RD: 192.168.100.1:1 mac-ip aac1.abde.327e
- - - 0 i
* >Ec RD: 192.168.100.4:4 mac-ip 10100 aac1.abe8.a87d
192.168.100.4 - 100 0 65500 64524 i
* ec RD: 192.168.100.4:4 mac-ip 10100 aac1.abe8.a87d
192.168.100.4 - 100 0 65500 64524 i
This is good, we've learnt a type-2 macip route for all our hosts. But again, something is odd here - one of the leafs is slightly different from the others. Which one?
Clue #2
Since we're using ingression replication, we should have a flood list of all our leafs via the type-3 imet route. Let's look at the flood list on leaf1:
That's interesting - leaf4 (vQFX) is not listed in the flood list. Can you figure out why?
Clue #3
We looked at the type-2 macip routes but we never really saw the MAC address table. Since this is L2 overlay only, we should see the host MAC addresses installed in this table, with a vtep as the next hop.
Let's look at leaf1s table now:
leaf1#show mac address-table unicast
Mac Address Table
------------------------------------------------------------------
Vlan Mac Address Type Ports Moves Last Move
---- ----------- ---- ----- ----- ---------
100 aac1.ab48.5bcf DYNAMIC Vx1 1 2:36:43 ago
100 aac1.ab93.c413 DYNAMIC Vx1 1 2:37:12 ago
100 aac1.abde.327e DYNAMIC Et3 1 2:32:54 ago
Total Mac Addresses for this criterion: 3
There are only three addresses, but there should be four. You probably know which one is missing - it is h4, the host behind leaf4. But why?
What's really happening
You've probably solved it by now, but I'll break it down anyway. If you go back and check, leaf1 did receive both the type-2 macip route for the host, h4, and the type-3 imet route for ingress replication of BUM traffic.
Let's confirm again, and take a detailed look at it this time:
If you look closely enough, you see that there is a value attached to the MAC address for the type-2 route and to the loopback (or vtep) address for the type-3 imet route.
See that '10100' in there? Well, that's the Ethernet Tag ID. Let's take a BGP packet capture and look at this in the BGP update itself to confirm. I'm going to clear my BGP peerings on leaf1 so the updates are sent again for us to capture it.
This it the type-3 imet route update:
And this is the type-2 macip route update:
In both those cases, we see that the update is tagged with an Ethernet Tag ID of 10100, which corresponds to the VNI.
Remember that an Ethernet Tag ID essentially identifies a broadcast domain in an EVPN instance. However, there are multiple ways that such an instance can be represented. The most common ways are:
VLAN based service
VLAN bundle service
VLAN-aware bundle service
By default, Juniper QFX (and vQFX) runs the VLAN-aware bundle service. This causes an Ethernet Tag ID to be set in the BGP EVPN updates. This is where the interoperability issues come in - my configuration on the Arista vEOS leaf is based on the VLAN based service type. EOS is built to drop updates with a non-zero Ethernet Tag ID when configured with VLAN based service type. This isn't mandated in the RFC, so each vendor has its own implementation. In the same setup, Cumulus VX accepts it (it can only do VLAN based service type) and the N9Kv has a knob which allows it to accept non-zero Ethernet Tags (it can also only do VLAN based service type).
// leaf2 - Cumulus VX
root@6ff6c9ba2374bf95:mgmt:~# net show bridge macs | grep a8:7d
100 bridge vni10100 aa:c1:ab:e8:a8:7d extern_learn 03:03:07
untagged vni10100 aa:c1:ab:e8:a8:7d 192.168.100.4 self, extern_learn 03:03:07
// leaf3 - N9Kv
leaf3# show mac address-table
Legend:
* - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC
age - seconds since last seen,+ - primary entry using vPC Peer-Link,
(T) - True, (F) - False, C - ControlPlane MAC, ~ - vsan
VLAN MAC Address Type age Secure NTFY Ports
---------+-----------------+--------+---------+------+----+------------------
* 100 aac1.ab48.5bcf dynamic 0 F F Eth1/3
C 100 aac1.ab93.c413 dynamic 0 F F nve1(192.168.100.2)
C 100 aac1.abde.327e dynamic 0 F F nve1(192.168.100.1)
C 100 aac1.abe8.a87d dynamic 0 F F nve1(192.168.100.4)
G - 5237.8000.1b08 static - F F sup-eth1(R)
As you can see, both these devices have accepted the routes, despite a non-zero Ethernet Tag ID.
Final thoughts and looking forward
Juniper QFX code was eventually enhanced to interoperate with different vendors and their behaviors - with a new routing-instance type of 'mac-vrf', you can now build any type of service you want.
If any of you actually deployed this in your labs, you'd see that h1 cannot ping h4:
root@h1:~# ping 10.100.100.4
PING 10.100.100.4 (10.100.100.4) 56(84) bytes of data.
From 10.100.100.1 icmp_seq=1 Destination Host Unreachable
From 10.100.100.1 icmp_seq=2 Destination Host Unreachable
From 10.100.100.1 icmp_seq=3 Destination Host Unreachable
Until... h4 pings h1, and then suddenly it starts working!
root@h4:~# ping 10.100.100.1
PING 10.100.100.1 (10.100.100.1) 56(84) bytes of data.
64 bytes from 10.100.100.1: icmp_seq=1 ttl=64 time=167 ms
64 bytes from 10.100.100.1: icmp_seq=2 ttl=64 time=104 ms
64 bytes from 10.100.100.1: icmp_seq=3 ttl=64 time=104 ms
^C
--- 10.100.100.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 5ms
rtt min/avg/max/mdev = 103.765/124.966/166.897/29.652 ms
Interesting, interesting. Any guesses why? We'll look at this in detail in part2 of this EVPN multivendor deployment, so stay tuned!