Comcast 2gig/200M Service Upgrade Issues
I recently upgraded my home Comcast internet service from 1gig/40meg to their newest 2gig/200meg service. This is a significant upgrade for Comcast as it requires some upgrades to field installed upstream amps to allow a greater amount of upstream bandwidth. I was surprised to see that the hardware work was completed in my area, but happy to give it a try.
One oddity of the service is that Comcast requires you to run their cable modem if you want to get the 2gig/200m service combined with the unlimited traffic. They specifically mention it during the sign up, and specifically mention that if you switch to your own modem that that the service will downgrade back to the 1gig service with a 1.2TB upload cap.
A day after ordering the service a new XB8 modem arrived. I am running a Ubiquity UDM Pro firewall, which has 10G SFP+ interfaces for both LAN and WAN interfaces. In my case the LAN side is connected to a Ubiquity 10G core switch that has fiber drops to other wiring closest in my house, as well as 10G connections to my office machines and 40G connections to the server racks.
The XB8 modem has a single copper 2.5G capable port (marked with the red bar), and most new 10G-BaseT SFP+ modules will down-convert to 2.5G wire speeds. I connected up the 2.5G port from the cable modem to my UDMP via an SFP+ module, and the Comcast side indicated a 2500 mbit connection. Interestingly on the UDMP software side the interface appears as a 10G interface because the SFI interface from the device to the SFP+ module is still running at 10G. I’d be curious to dig into that a bit more to see what that module does for speed conversion and throttling. With the connection up and running, and with the cable modem in the default NON-BRIDGE mode, my firewall got a 10.0.0.4 address as I would expect. Some quick speed tests from my office machine gave me acceptable >2 gig download speeds, and 100-200mbit upload speeds. Upload speeds were more volatile in part because of the end nodes for the speed test being saturated, but still overall acceptable results.
Given my use case I then switched the XB8 over to BRIDGE mode. It is a simple toggle in the Comcast interface (address 10.0.0.1) in which the modem appears to restart. After restart the UDMP got an external Comcast address (24.x.x.x) over DHCP, and internet connectivity seems great.
A few hours later I noticed the UDMP was saying the ‘internet’ connection was not working, and sure enough if I did a ping of 1.1.1.1 it would drop 90% of the packets. You could still open some web pages due to the amazing resilience of TCP but there was clearly something causing lots of packet drops.
My first thought was that the 2.5G interface might be having problems. The 10GBaseT SFP+ module I was using was quite old, so perhaps that 2.5G support was not perfect. As soon as I unplugged the connection from the SFP+ and put it into the UDMPros 1G only copper port the internet connectivity was restored and looked perfect. Ok – Time to order up a new 10GT SFP+.
5 mins later, the same internet drop happened again. I unplugged the drop and plugged it back in and it instantly fixed it. Hmm. That is odd… Is it exactly 5 mins? Stopwatch out – Yep, exactly 5 mins from re-plug-in the problem happens. Reseting the interface fixes it for 5 mins more, so my first thought was something with the UDMP interface. I reroute the UDMP interface through the core router on a dedicated VLAN so the newer core router would be the connection point for the Comcast connection, but it makes not difference. If I soft down the interface and bring it back up on the Comcast side now it doesn’t fix anything, but if I do it on the UDMP side it does. Interesting!
I did notice the problem didn’t start until I switched to BRIDGED mode, so I tried switching back out of BRIDGED mode. That fixed it. I can get back to 2.5g and everything is working great. No 5 min dropout. Could BRIDGE mode be the problem?
At this point, the network engineer in me said ‘get a pcap going’. Since I’m running through the core switch I configure another SFP+ port to be a mirror port, and then connect that port to an interface on another laptop that has Wireshark on it.
I switched the cable modem back to BRIDGED mode and started capturing from power on.
My first thought is – When does the UDMP send the ARP requests for the default gateway? Filtering on just DHCP I can see the UDMP getting an IP Address at timestamp 64, and just after that the UDMP does an ARP request for the gateway 24.20.70.1. This looks just as you would expect. One thing that does stick out is that gateways MAC address: IETF-VRRP-VRID_32. That is an indication that the other side is using VRRP (Virtual Router Redundancy Protocol). Not surprising, but a hint at the unusual behavior.
With the ARP at 64 seconds, I start doing some ICMP traffic. At almost exactly 300 seconds later (Timestamp=364s), the ICMPs start to fail. Some other traffic is still working, but this specific ICMP path is failing. Clearly something has timed out, and it seems like perhaps my MAC address has timed out in one of the paths from one of the VRRP member routers.
I SSH into the UDMP and check the ARP table, which has the entry for 24.20.70.1. I use the ‘arp -d’ command to remove that entry which causes the UDMP to send a new ARP request for 24.20.70.1. I see that traffic, and suddenly the ICMPs start working again!
I took a look at the statistics from this capture session, and you can see the outbound traffic going from the UDMP to the VRRP MAC address, but the return traffic is split between two Junipers routers. ( MAC 0xef and 0x6c). These two return paths are not equal in usage, which is not a surprise since the hashing used to pick in VRRP isn’t going to guarantee an even split in such a small set of connections.
I dug a bit more, and sure enough the ICMPs I was sending were all coming back from one of the two Junipers ( the 0x6C one ). It seems that most of the traffic from that particular Juniper gets dropped if the the ARP request does not happen (and presumably reset a MAC table somewhere along the way) every 5 mins.
Looking at the traffic from the 2 src mac for the return path confirms the traffics drops from that second Juniper after the potential MAC timeout.
The FIX:
ARP timeout are a bit complex. While there are OS level settings ( in this case you can see them in /proc/sys/net/ for the interface of interest), things like the base_reachable_time_ms don’t provide everything you need. An entry in the ARP cache might not be refreshed via ARP if an upper level protocol updates the status of that entry. That can happen if a packet is received successfully based in the use of an entry. As a result having those OS timers set to something like 30s might not actually result in a new ARP every 30s, especially on an active link. On the UDM I was able to see >20 min intervals between ARPs if traffic is flowing.
Since I need to guarantee a new ARP every 5 mins I created a CRON job that runs every 4 mins and deletes the particular arp entry for the gateway in the internet interface. In my case (the UDMPro) the SFP+ WAN interface is eth9. I added the following command to the crontab:
sudo arp -d `arp -i eth9 | awk ‘BEGIN { FS=”[ ]” } ; NR==1 {print $1 }’`
That command will delete the ARP entry for the first entry in the WAN interface table, which in this case is the default route (because the IP is .1). This is a serious hack, and not something you would want to rely on long term.
It does work, and a week later everything is working great with the ARP refresh. It is a surprising issue perhaps obfuscated by not as many people using non-bridged mode, and because Windows tends to ARP much more often. It is possible that other firewalls have lower ARP thresholds that mitigate this problem. There isn’t a magic ARP timeout that is ‘correct’ and the actual implementations vary a lot.
I talked about this issue at length with my friend Eric Rosenberry, who happens to be Director of Network Architecture at Ziply. He suggested the it could be some layer 2 network layer like EVPN in between the VRRP router and my device. Fortunately he had a few contacts who know a few other contacts that may be able to pass along this interesting problem to some of the client network engineers at Comcast. I’ll update if I hear back.
I did find after a bit of searching that other people have seen this issue with the UDM Pro and Comcast:
10 thoughts on “Comcast 2gig/200M Service Upgrade Issues”
I am having the same trouble but with different equipment and can’t figure out how to fix it. I have a Synology router in bridge mode to XB8. Almost all Macs and an xbox. Xbox drops connections almost exactly every 5 minutes – browsing on Macs more unpredictable but several times an hour. I don’t think I have access to changing arp settings on Synology router. I tried ssh’ing in to it and the following:
sysctl -w net.link.ether.inet.max_age=120
and it complained it didn’t have that key. I don’t know what flavor of underlying OS it uses or how to make such a change – UI has nothing I can find.
I also don’t know that you can change this on the XBOX, even though it’s “sort of” windows.
When I use your command in SSH I get “arp:: Unknown host”
Nice. I thought I was having issues with my SFPs (per Comcast’s advice) and kept trying different network cards and SFPs, using pfSense for routing. I was getting the same bug on 1Gb and 2.5Gb ports so I was quite confused.
I only get the issue after a couple hours but it’s been there every morning when I wake up. Normally I reboot my modem to clear it, I was able to resolve it this time by clearing the ARP entries for that interface instead.
I set up the same cron you did. Hopefully this fixes it for good and I can return all these extra SFPs.
Yea,it does look like a hardware problem when you first look at it. A few other people have said the behaviour changes with a new version of Unify OS, so I’ll upgrade and see if that makes a difference.
Thank you for publishing this blog. This post was very helpful in diagnosing my own problem. I signed up for the 2000/200 service and using the XB8 modem here in the Atlanta area. I’m having the same issue in bridge mode, using the Amplifi alien router. I don’t understand networking like you do. I hope Comcast can address this issue soon. Otherwise, I plan to downgrade my service.
Thank you so much for this posting. I am in Fort Lauderdale, FL and I am having the same exact issue myself. I am using a UDM-SE with a XB7 modem. However, I am having the same exact issue and it seems many Comcast/UI users are having the same issue.
Just another data point on this. I just switched to a comcast XB8 in bridge mode with a UDM running Unify OS v3.0.20 and the ARP issue seems to be resolved. I could not tell you if it was a fix on the Comcast side or from Ubiquiti though.
Interesting. It seemed on at my location that the problem went away, but a week later it came back and I had to add back in the arp cache deletes. I’ll have to check and see what version of Unify OS I’m on right now, as it is possible that the latest version has some changes that fix this.
Jeff did you have any luck fixing this? I just moved my parents to a UDM and ran right into this issue. I would really like to have passthrough on but without the need for the CRON job
I recently identified a cause of randomly (days to weeks) dropping WAN connections. This is a serious bug that is likely impacting many Comcast customers.
Likely exclusive to Comcast customers that use DHCP (may not happen with static IPs), and have the modem in bridge mode. Perhaps the failure is only seen if the customer is using the 2g down -200k up plan. I found the bug occurs regardless of the modem manufacturer (tested 3 of XB8s and 4 different CODA56 modems, so it appears not to be a modem issue but maybe some combination of their profiling, or their gateways or DHCP servers). In my Netgate router (running pfSence), if the WAN’s DHCP lease time is changed from what the server/modem assigned, then precisely at the 50% point of the lease, the monitoring of the WAN’s status will no longer receive replies (such as ping test to 8.8.8.8), and the router will consider the port failed. The WAN connection will come back online when only one of these events occurs: (a) manually release/renew the DHCP; (2) the lease expires and renews; (3) the Port is unplugged/replugged; (4) WAN is disabled/re-enabled; or (5) reboot of the router or the modem is done.
The up/down of the WAN will cycle at the time set for the lease, with 1/2 being up and the other 1/2 down. Reducing or increasing the cycle time of sending ARPs does not change the cycle or results.