Recently we had some problems with call disconnections at the location of the IP telephony integration in the communications network. Our client had switched from an analog system to the IP system and therefore had understandably high expectations.
But after a short introductory or test period, strange things started to happen. Calls were disconnected, Internet access was impossible — in short, every day between 15:00 and 16:00 pm, the client was cut off from the rest of the world.
The first question that occurs to an engineer in such cases is how a telephone system can work fine for a while and then problems start to appear. Had anything been changed (added/removed) in the network in the last few days? As users were growing more and more annoyed, action had to be quick and efficient. The only clue we had was the fact that, in the previous few days, about 10 new remote locations had been added, accessing the Internet via central location or central proxy server.
Inspection of our NIL Monitor system showed that individual patterns or peak traffic are copied to all ports in the main switch. This led us to well-grounded suspicions about several possibilities:
- a temporary loop in the network
- excessive broadcast/multicast traffic
- filled up CAM table on the switch
- legal unicast traffic that could not find the right destination MAC address in the CAM table
After a detailed inspection of the traffic with the NIL Monitor analysis, we found the most logical explanation to be an unknown MAC address in the CAM table of the switch.
Next suspicion was about proxy servers. We performed traffic sniffing for one of the clients. Because all the clients were connected in the same VLAN segment as proxy servers, it was not difficult to capture the traffic, although we had not activated any command for the traffic capture at the switch. All the traffic going through the proxy server was copied and sent to all ports at the switch. A more detailed analysis or insight into the packet structure itself revealed and confirmed our suspicion: there was a destination MAC address that simply did not exist in the CAM table. How did this happen?
To access the Internet, a client makes an ARP request for the IP address stored in the settings of the proxy server (for example, 10.10.10.1). The request is sent as broadcast traffic through the whole VLAN segment. The proxy server identified by its IP address replies with the ARP reply packet containing the information about its MAC address. And here is where a slight confusion occurs. The MAC address is from the Microsoft NLB functionality, not the actual MAC address configured at the card.
For example, assume that NLB replies in the ARP reply packet with the MAC address 02bf.xxxx.yyyy, although the actual MAC address in one of the cards in the NLB functionality is 0202.xxxx.yyyy. As you probably know, the switch remembers the MAC address from the card that sends the traffic. But here, NLB replies with the ARP reply packet with a completely different MAC address. When the client receives the ARP reply packet, it simply receives instructions as to where to send the packet (in our example, the MAC address 02bf.xxxx.yyyy). The client encapsulates the packet with all the necessary overhead and sends it on to the main switch. The switch receives the packet and then checks the destination MAC address in its CAM table. But from the proxy server it learned the MAC address 0202.xxxx.yyyy. It does not have the packet’s MAC address in this table, so at this moment the switch becomes a hub and sends traffic to all ports.
Nice, isn’t it? Try to imagine 50 remote locations and a central location, all eagerly trying to access the Internet between 15:00 and 16:00 pm. Do you wish to be a part of this network?
The solution to this problem was network segmentation. Proxy servers (and all other servers using the NLB functionality) were moved to a new VLAN segment. As a first aid, Cisco also recommends placing a hub between the switch and the servers, but that is only a short-term solution — and it would not be possible in our environment, where servers are connected by Gigabit connection. Before we carried out the network segmentation (which requires renumbering the servers, so we did this after working hours), we solved the problem by switching off one proxy server and placing static MAC addresses to all switches in the VLAN segment.
Note: In this story a unicast NLB was configured; in the case of a multicast NLB, the problems would be considerably smaller.