VXLAN Flood and Learn

Updated: Feb 25, 2019



Unicast underlay


The objective of the unicast underlay design is to ensure that the leaf switches are reachable from each other. Specifically, the loopbacks that are being used as the source of the VXLAN tunnel interfaces (NVE, Network Virtualization Edge, interface) should be pingable when sourcing the local loopback. For example, if N7K1-LEAF1 uses Loopback0 as the source of the NVE interface and N7K2-LEAF1 uses Loopback0 as well, then we should be able to ping between the loopbacks.


OSPF will be our choice of IGP to provide this unicast reachability between leaf switches.


As a first step, configure these loopbacks on both leaf switches and advertise it into OSPF (note, OSPF is already running between the spine and leaf switches).


N7K1-LEAF1# show run int lo0


!Command: show running-config interface loopback0

!Time: Thu Jan 11 11:07:10 2018


version 7.3(2)D1(1)


interface loopback0

ip address 1.1.1.11/32

ip router ospf 1 area 0.0.0.0


N7K2-LEAF1# show run int lo0


!Command: show running-config interface loopback0

!Time: Thu Jan 11 11:07:56 2018


version 7.3(2)D1(1)


interface loopback0

ip address 1.1.1.21/32

ip router ospf 1 area 0.0.0.0


Ping between loopbacks to confirm unicast reachability.


N7K1-LEAF1# ping 1.1.1.21 source 1.1.1.11

PING 1.1.1.21 (1.1.1.21) from 1.1.1.11: 56 data bytes

64 bytes from 1.1.1.21: icmp_seq=0 ttl=253 time=1.238 ms

64 bytes from 1.1.1.21: icmp_seq=1 ttl=253 time=0.906 ms

64 bytes from 1.1.1.21: icmp_seq=2 ttl=253 time=0.907 ms

64 bytes from 1.1.1.21: icmp_seq=3 ttl=253 time=0.899 ms

64 bytes from 1.1.1.21: icmp_seq=4 ttl=253 time=0.83 ms


--- 1.1.1.21 ping statistics ---

5 packets transmitted, 5 packets received, 0.00% packet loss

round-trip min/avg/max = 0.83/0.956/1.238 ms


This confirms that unicast reachability for the underlay.


Multicast underlay


We have confirmed PIM spare-mode between the spine and leaf switches. N77K1-SPINE1 is statically configured as the RP on all boxes.


N7K1-LEAF1# show ip pim interface brief

PIM Interface Status for VRF "default"

Interface IP Address PIM DR Address Neighbor Border

Count Interface


Ethernet4/9 10.1.1.1 10.1.1.2 1 no

Ethernet4/10 10.1.1.5 10.1.1.6 1 no


N7K2-LEAF1# show ip pim interface brief

PIM Interface Status for VRF "default"

Interface IP Address PIM DR Address Neighbor Border

Count Interface


Ethernet4/9 10.1.1.9 10.1.1.10 1 no

Ethernet4/10 10.1.1.13 10.1.1.14 1 no


N7K1-LEAF1# show ip pim rp

PIM RP Status Information for VRF "default"

BSR disabled

Auto-RP disabled

BSR RP Candidate policy: None

BSR RP policy: None

Auto-RP Announce policy: None

Auto-RP Discovery policy: None


RP: 11.11.11.11, (0),

uptime: 00:02:58 priority: 0,

RP-source: (local),

group ranges:

224.0.0.0/4


N7K2-LEAF1# show ip pim rp

PIM RP Status Information for VRF "default"

BSR disabled

Auto-RP disabled

BSR RP Candidate policy: None

BSR RP policy: None

Auto-RP Announce policy: None

Auto-RP Discovery policy: None


RP: 11.11.11.11, (0),

uptime: 00:03:09 priority: 0,

RP-source: (local),

group ranges:

224.0.0.0/4


Remember to enable PIM sparse-mode on the loopback interfaces as well:


N7K1-LEAF1(config)# int lo0

N7K1-LEAF1(config-if)# ip pim sparse-mode


N7K2-LEAF1(config)# int lo0

N7K2-LEAF1(config-if)# ip pim sparse-mode


It is important to test if your multicast underlay is functioning as expected. A simple way to do this is to have one of the leaf switches join a multicast group (basically, acting as a client) and have the other leaf switch ping this group. If your underlay is correctly configured, then you should see responses to these pings.


N7K2-LEAF1(config)# int lo0

N7K2-LEAF1(config-if)# ip igmp join-group 224.1.1.1


On the RP, we can see this join was received and processed successfully:


N77K1-SPINE1# show ip mroute 224.1.1.1

IP Multicast Routing Table for VRF "default"


(*, 224.1.1.1/32), uptime: 00:01:02, pim ip

Incoming interface: loopback0, RPF nbr: 11.11.11.11

Outgoing interface list: (count: 1)

Ethernet5/22, uptime: 00:01:02, pim


Now from the other leaf, initiate a ping:


N7K1-LEAF1# ping multicast 224.1.1.1 interface eth4/9 source 10.1.1.1

PING 224.1.1.1 (224.1.1.1) from 10.1.1.1: 56 data bytes

Request 0 timed out

64 bytes from 10.1.1.9: icmp_seq=1 ttl=253 time=1.454 ms

64 bytes from 10.1.1.9: icmp_seq=2 ttl=253 time=1.23 ms

64 bytes from 10.1.1.9: icmp_seq=3 ttl=253 time=0.987 ms

64 bytes from 10.1.1.9: icmp_seq=4 ttl=253 time=0.964 ms


--- 224.1.1.1 ping multicast statistics ---

5 packets transmitted,

From member 10.1.1.9: 4 packets received, 20.00% packet loss

--- in total, 1 group member responded ---


You can confirm (S,G) state on the RP:


N77K1-SPINE1# show ip mroute 224.1.1.1

IP Multicast Routing Table for VRF "default"


(*, 224.1.1.1/32), uptime: 00:03:34, pim ip

Incoming interface: loopback0, RPF nbr: 11.11.11.11

Outgoing interface list: (count: 1)

Ethernet5/22, uptime: 00:03:34, pim


(10.1.1.1/32, 224.1.1.1/32), uptime: 00:00:40, ip mrib pim

Incoming interface: Ethernet5/21, RPF nbr: 10.1.1.1

Outgoing interface list: (count: 1)

Ethernet5/22, uptime: 00:00:40, pim


This confirms that the multicast underlay is working.


Mapping VLANs to VNIs


First, with any Nexus platform, you need to ensure that the correct features are enabled. For VXLAN Flood and Learn, you need to enable the following:


N7K2-LEAF1(config)# feature vni // for VLAN to VNI mapping

N7K2-LEAF1(config)# feature nv overlay // for NVE interface


N7K1-LEAF1(config)# feature vni

N7K1-LEAF1(config)# feature nv overlay


** note, features may differ per platform **


Now, here is where things get a little different based on the platform. Basically, VXLAN configuration can be done in two modes, depending on platform:


VSI (VXLAN Service Instance) mode

VLAN mode


VLAN mode is very simple - you just map the VNI to VLAN directly under VLAN configuration:


*snip*


vlan <>

vn-segment <>


*snip*


VSI mode requires more work - you need to create an encapsulation profile, BDs and map VLAN to VNIs with the help of these. The concept of BD here is different from the existing concept of hw BDs for VLANs. Bridge-domains, in the case of VXLAN, is purely a software concept.


We need to create a service instance under the ports facing the end hosts on the leaf switches (VTEPs). An encapsulation profile is then mapped to the service instance, with the encapsulation profile itself defining what VLAN is mapped to what VNI. A simple example (not from this topology):


LEAF# show running-config interface Ethernet1/41


interface Ethernet1/41

description Spirent 7/10

no shutdown

service instance 1 vni

no shutdown

encapsulation profile vsi_std default


Notice how there is no defined access/trunk configuration on this port. So, does this mean the packet should always come in tagged with the appropriate VLAN ID? Yes, it does. The service instance relies on the dot1q ID to map the VLAN to its associated VNI. This is designed with the understanding of how VMs typically work - one physical box may have multiple VMs with each VM potentially being in a different VLAN. The packets would come out tagged from this box, depending on which VM is sending the packet (VLAN ID for that VM would be added) to the dot1q header.


You can also define a specific VNI for untagged packets:


N7K1-LEAF1(config)# encapsulation profile vni VLAN_10_to_VNI_777710

N7K1-LEAF1(config-vni-encap-prof)# ?


dot1q Encapsulation Dot1q under service instance

no Negate a command or set its defaults

untagged Untagged frame vni mapping <====== for untagged packets

end Go to exec mode

exit Exit from command interpreter

pop Pop mode from stack or restore from name

push Push current mode to stack or save it under name

where Shows the cli context you are in


Let’s go ahead and do this step by step now:


First, create an encapsulation profile that will map VLAN 10 to VNI 777710.


N7K1-LEAF1(config)# encapsulation profile vni VLAN_10_to_VNI_777710

N7K1-LEAF1(config-vni-encap-prof)# dot1q 10 vni 777710


Now, create a BD and map the VNI to that BD:


N7K1-LEAF1(config)# system bridge-domain 100 // declares the BD

N7K1-LEAF1(config)# vni 777710 // creates the VNI

N7K1-LEAF1(config-vni)# exit


N7K1-LEAF1(config)# bridge-domain 100 // goes into BD mode

N7K1-LEAF1(config-bdomain)# member vni 777710 // maps VNI to BD

N7K1-LEAF1(config-bdomain)# end


Lastly, put all this together under the host facing interface by creating a service instance for the respective interface and applying the encapsulation profile under it:


N7K1-LEAF1(config)# int eth4/1

N7K1-LEAF1(config-if)# service instance 1 vni

ERROR: Config for Ethernet4/1 not allowed on layer 2


We clearly cannot apply a service instance while the port is in ‘switchport’ mode. So, undo that and then apply it.


N7K1-LEAF1(config)# int eth4/1

N7K1-LEAF1(config-if)# no switchport

N7K1-LEAF1(config-if)# service instance 1 vni

N7K1-LEAF1(config-if-srv-vni)# encapsulation profile VLAN_10_to_VNI_777710 default

N7K1-LEAF1(config-if-srv-vni)# no shut


So, the final configuration would be:


N7K1-LEAF1# show run int eth4/1


!Command: show running-config interface Ethernet4/1

!Time: Fri Jan 12 09:36:20 2018


version 7.3(2)D1(1)


interface Ethernet4/1

no shutdown

service instance 1 vni

no shutdown

encapsulation profile VLAN_10_to_VNI_777710 default


No STP instance is built against the VLAN itself for that interface (which makes sense since we no longer have it in ‘switchport’ mode and there is no explicit VLAN association under the interface). Instead, it is built against the BD that the VNI is mapped (for that VLAN):


N7K1-LEAF1# show spanning-tree vlan 10

Spanning tree instance(s) for vlan does not exist.


N7K1-LEAF1# show spanning-tree bridge-domain 100


BD0100

Spanning tree enabled protocol rstp

Root ID Priority 32868

Address 8478.ac0d.3643

This bridge is the root

Hello Time 2 sec Max Age 20 sec Forward Delay 15 sec


Bridge ID Priority 32868 (priority 32768 sys-id-ext 100)

Address 8478.ac0d.3643

Hello Time 2 sec Max Age 20 sec Forward Delay 15 sec


Interface Role Sts Cost Prio.Nbr Type

---------------- ---- --- --------- -------- --------------------------------

VSI-Eth4/1.1 Desg FWD 2 128.513 P2p


** note, mac address learning will also be done against the BD and not the VLAN ID **


If you want to look at details of the VNI <-> BD mappings, what interfaces service instances are applied on, if they are up, what encapsulation profile is being used, etc you can use:


N7K1-LEAF1# show service instance vni interface et4/1


VSI Admin Status Oper Status #BD

------------------------------------------------------------------

VSI-Ethernet4/1.1 Up Up 1


N7K1-LEAF1# show service instance vni interface et4/1 detail


VSI: VSI-Ethernet4/1.1

If-index: 0x35180001

Admin Status: Up

Oper Status: Up

Auto-configuration Mode: No

encapsulation profile vni VLAN_10_to_VNI_777710

dot1q 10 vni 777710

Dot1q VNI BD

------------------

10 777710 100


This covers the VLAN to VNI mapping section. Go ahead and do the same on the other leaf switch and verify the configuration


Creating the NVE interface


The NVE interface is a logical tunnel endpoint for VXLAN. It does not have an IP address – you simply need to specify the source of the tunnel, map VNIs that you want to allow over this VXLAN tunnel and specify a corresponding multicast group for this VNI (that will be used for BUM traffic – more on this later).


N7K1-LEAF1# show run int nve1


!Command: show running-config interface nve1

!Time: Fri Jan 12 11:53:33 2018


version 7.3(2)D1(1)


interface nve1

no shutdown

source-interface loopback0 // source interface for VTEP

member vni 777710 // VNIs this NVE will be used for

mcast-group 224.10.10.10 // mcast group used for BUM traffic


N7K2-LEAF1# show run int nve1


!Command: show running-config interface nve1

!Time: Fri Jan 12 11:55:52 2018


version 7.3(2)D1(1)


interface nve1

no shutdown

source-interface loopback0

member vni 777710

mcast-group 224.10.10.10


At this point, you should see both these switches join the multicast tree for 224.10.10.10 on the RP:


N77K1-SPINE1# show ip mroute 224.10.10.10

IP Multicast Routing Table for VRF "default"


(*, 224.10.10.10/32), uptime: 00:02:51, pim ip

Incoming interface: loopback0, RPF nbr: 11.11.11.11

Outgoing interface list: (count: 2)

Ethernet5/22, uptime: 00:00:43, pim

Ethernet5/21, uptime: 00:02:51, pim


N77K1-SPINE1# show cdp neighbor interface eth5/22

Capability Codes: R - Router, T - Trans-Bridge, B - Source-Route-Bridge

S - Switch, H - Host, I - IGMP, r - Repeater,

V - VoIP-Phone, D - Remotely-Managed-Device,

s - Supports-STP-Dispute


Device-ID Local Intrfce Hldtme Capability Platform Port ID

N7K2-LEAF1(JAF1703AGDP)

Eth5/22 139 R S s N7K-C7004 Eth4/9


N77K1-SPINE1# show cdp neighbor interface eth5/21


Capability Codes: R - Router, T - Trans-Bridge, B - Source-Route-Bridge

S - Switch, H - Host, I - IGMP, r - Repeater,

V - VoIP-Phone, D - Remotely-Managed-Device,

s - Supports-STP-Dispute


Device-ID Local Intrfce Hldtme Capability Platform Port ID

N7K1-LEAF1(JAF1703AGDQ)

Eth5/21 131 R S s N7K-C7004 Eth4/9


This completes the configuration for VXLAN Flood and Learn. Confirm end to end reachability now:


Host2#ping 10.0.0.1 source 10.0.0.2 repeat 10

Type escape sequence to abort.

Sending 10, 100-byte ICMP Echos to 10.0.0.1, timeout is 2 seconds:

Packet sent with a source address of 10.0.0.2

.!!!!!!!!!

Success rate is 90 percent (9/10), round-trip min/avg/max = 1/1/4 ms

Multicast states on each box:


N7K1-LEAF1# show ip mroute 224.10.10.10

IP Multicast Routing Table for VRF "default"


(*, 224.10.10.10/32), uptime: 00:17:26, nve ip pim

Incoming interface: Ethernet4/9, RPF nbr: 10.1.1.2

Outgoing interface list: (count: 1)

nve1, uptime: 00:17:26, nve


(1.1.1.11/32, 224.10.10.10/32), uptime: 00:17:26, nve mrib ip pim

Incoming interface: loopback0, RPF nbr: 1.1.1.11

Outgoing interface list: (count: 1)

Ethernet4/9, uptime: 00:01:53, pim


(1.1.1.21/32, 224.10.10.10/32), uptime: 00:03:31, ip mrib pim

Incoming interface: Ethernet4/9, RPF nbr: 10.1.1.2

Outgoing interface list: (count: 1)

nve1, uptime: 00:03:31, mrib


N77K1-SPINE1# show ip mroute 224.10.10.10

IP Multicast Routing Table for VRF "default"


(*, 224.10.10.10/32), uptime: 00:54:57, pim ip

Incoming interface: loopback0, RPF nbr: 11.11.11.11

Outgoing interface list: (count: 2)

Ethernet5/21, uptime: 00:06:40, pim

Ethernet5/22, uptime: 00:54:57, pim



(1.1.1.11/32, 224.10.10.10/32), uptime: 00:01:22, pim mrib ip

Incoming interface: Ethernet5/21, RPF nbr: 10.1.1.1, internal

Outgoing interface list: (count: 2)

Ethernet5/21, uptime: 00:00:19, pim, (RPF)

Ethernet5/22, uptime: 00:01:22, pim



(1.1.1.21/32, 224.10.10.10/32), uptime: 00:01:41, pim mrib ip

Incoming interface: Ethernet5/22, RPF nbr: 10.1.1.9, internal

Outgoing interface list: (count: 1)

Ethernet5/21, uptime: 00:01:41, pim


N7K2-LEAF1# show ip mroute 224.10.10.10

IP Multicast Routing Table for VRF "default"


(*, 224.10.10.10/32), uptime: 00:18:17, nve ip pim

Incoming interface: Ethernet4/9, RPF nbr: 10.1.1.10

Outgoing interface list: (count: 1)

nve1, uptime: 00:18:17, nve


(1.1.1.11/32, 224.10.10.10/32), uptime: 00:04:16, pim mrib ip

Incoming interface: Ethernet4/9, RPF nbr: 10.1.1.10

Outgoing interface list: (count: 1)

nve1, uptime: 00:04:16, mrib


(1.1.1.21/32, 224.10.10.10/32), uptime: 00:18:17, nve mrib ip pim

Incoming interface: loopback0, RPF nbr: 1.1.1.21

Outgoing interface list: (count: 1)

Ethernet4/9, uptime: 00:03:33, pim


VXLAN packet flow – BUM traffic


Host2 initiates a ping to Host1. Since the IP address is unresolved, an ARP is generated. The ARP reaches N7K1-LEAF1. This needs to be VXLAN encapsulated and sent towards the other leaf. N7K1-LEAF1 looks at the VLAN-to-VNI mapping and determines what multicast group is being used for this VNI and then adds the required headers to the packet.


The VXLAN encapsulation includes the addition of a VXLAN header (which has the VNI in it and some reserved bits), within a UDP header (which has a hash-based source port and a well-known destination port of 4789), within an IP header and an Ethernet header.



This packet is multicast routed through the VXLAN core and reaches N7K2-LEAF1, who, after stripping the appropriate headers, sends out the naked ARP packet to Host1.



VXLAN packet flow – unicast


Once the ARP is resolved, Host2 can build an ICMP packet destined to Host1. This includes an ICMP header within an IP header with the source IP address as 10.0.0.2 and the destination IP header as 10.0.0.1 within an Ethernet header with the source mac address as 0002.0002.0002 and the destination mac address as 0001.0001.0001.


This is then VXLAN encapsulated. The outer IP header will have a destination IP address of Loopback0 of N7K2-LEAF1 and a source IP address of Loopback0 of N7K1-LEAF2. The outer Ethernet address would have a source mac address N7K1-LEAF1 and a destination mac address of the next-hop for Loopback0 of N7K2-LEAF2. The packet is unicast routed across the unicast underlay till it reaches N7K-LEAF2. The headers are stripped off by one by one (Ethernet header will be stripped because the destination mac address is locally owned, IP header will be stripped because the destination IP address is locally owned, UDP header is stripped which also informs the box that what follows is a VXLAN encapsulated packet). Finally, the VNI in the VXLAN header tells the box what VLAN this is for.



Drawbacks of Flood and Learn


One of the biggest problems in F&L is the flooding that is required to learn mac addresses. it also requires a IP multicast aware core (or head end replication on the VTEPs themselves) to support this flooding. It presents scalability issues - for example, if a VTEP needs to talk to another VTEP, they need to use the same multicast group - now, this means that every mac will be learnt between every VTEP on the same multicast group, regardless of whether the end hosts need to even talk to each other or not.


In our topology, Host1 pings a non-existent IP in its subnet:


Host1#ping 10.0.0.5 source 10.0.0.1 repeat 100

Type escape sequence to abort.

Sending 100, 100-byte ICMP Echos to 10.0.0.5, timeout is 2 seconds:

Packet sent with a source address of 10.0.0.1

.............................


*snip*


Because of the data driven learn, N7K1-LEAF1 is also forced to learn the source mac and install it in its mac table even though the packet is not really destined for it or a host behind it.


N7K2-LEAF1# show mac address-table

Note: MAC table entries displayed are getting read from software.

Use the 'hardware-age' keyword to get information related to 'Age'


Legend:

* - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC

age - seconds since last seen,+ - primary entry using vPC Peer-Link, E - EVPN entry

(T) - True, (F) - False , ~~~ - use 'hardware-age' keyword to retrieve age info

VLAN/BD MAC Address Type age Secure NTFY Ports/SWID.SSID.LID

---------+-----------------+--------+---------+------+----+------------------

G - 8478.ac0d.3443 static - F F sup-eth1(R)

* 100 0001.0001.0001 dynamic ~~~ F F VSI-Eth4/1.1


N7K1-LEAF1# show mac address-table

Note: MAC table entries displayed are getting read from software.

Use the 'hardware-age' keyword to get information related to 'Age'


Legend:

* - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC

age - seconds since last seen,+ - primary entry using vPC Peer-Link, E - EVPN entry

(T) - True, (F) - False , ~~~ - use 'hardware-age' keyword to retrieve age info

VLAN/BD MAC Address Type age Secure NTFY Ports/SWID.SSID.LID

---------+-----------------+--------+---------+------+----+------------------

G - 8478.ac0d.3643 static - F F sup-eth1(R)

* 100 0001.0001.0001 dynamic ~~~ F F nve1/1.1.1.21(R)


This doesn't look like a problem when you look at it from just one mac’s perspective - however, scale this up and it potentially becomes a scalability issue. The only way for N7K1-LEAF1 to not learn this mac is to not use the same multicast group but this would essentially cut off any communication between these two VTEPs which is not something you want either.

For any queries, concerns or conversations, please email contact@theasciiconstruct.com