Comcast 2gig/200M Service Upgrade Issues

Comcast 2gig/200M Service Upgrade Issues

I recently upgraded my home Comcast internet service from 1gig/40meg to their newest 2gig/200meg service. This is a significant upgrade for Comcast as it requires some upgrades to field installed upstream amps to allow a greater amount of upstream bandwidth. I was surprised to see that the hardware work was completed in my area, but happy to give it a try.

One oddity of the service is that Comcast requires you to run their cable modem if you want to get the 2gig/200m service combined with the unlimited traffic. They specifically mention it during the sign up, and specifically mention that if you switch to your own modem that that the service will downgrade back to the 1gig service with a 1.2TB upload cap.

A day after ordering the service a new XB8 modem arrived. I am running a Ubiquity UDM Pro firewall, which has 10G SFP+ interfaces for both LAN and WAN interfaces. In my case the LAN side is connected to a Ubiquity 10G core switch that has fiber drops to other wiring closest in my house, as well as 10G connections to my office machines and 40G connections to the server racks.

The XB8 modem has a single copper 2.5G capable port (marked with the red bar), and most new 10G-BaseT SFP+ modules will down-convert to 2.5G wire speeds. I connected up the 2.5G port from the cable modem to my UDMP via an SFP+ module, and the Comcast side indicated a 2500 mbit connection. Interestingly on the UDMP software side the interface appears as a 10G interface because the SFI interface from the device to the SFP+ module is still running at 10G. I’d be curious to dig into that a bit more to see what that module does for speed conversion and throttling. With the connection up and running, and with the cable modem in the default NON-BRIDGE mode, my firewall got a address as I would expect. Some quick speed tests from my office machine gave me acceptable >2 gig download speeds, and 100-200mbit upload speeds. Upload speeds were more volatile in part because of the end nodes for the speed test being saturated, but still overall acceptable results.

Given my use case I then switched the XB8 over to BRIDGE mode. It is a simple toggle in the Comcast interface (address in which the modem appears to restart. After restart the UDMP got an external Comcast address (24.x.x.x) over DHCP, and internet connectivity seems great.

A few hours later I noticed the UDMP was saying the ‘internet’ connection was not working, and sure enough if I did a ping of it would drop 90% of the packets. You could still open some web pages due to the amazing resilience of TCP but there was clearly something causing lots of packet drops.

My first thought was that the 2.5G interface might be having problems. The 10GBaseT SFP+ module I was using was quite old, so perhaps that 2.5G support was not perfect. As soon as I unplugged the connection from the SFP+ and put it into the UDMPros 1G only copper port the internet connectivity was restored and looked perfect. Ok – Time to order up a new 10GT SFP+.

5 mins later, the same internet drop happened again. I unplugged the drop and plugged it back in and it instantly fixed it. Hmm. That is odd… Is it exactly 5 mins? Stopwatch out – Yep, exactly 5 mins from re-plug-in the problem happens. Reseting the interface fixes it for 5 mins more, so my first thought was something with the UDMP interface. I reroute the UDMP interface through the core router on a dedicated VLAN so the newer core router would be the connection point for the Comcast connection, but it makes not difference. If I soft down the interface and bring it back up on the Comcast side now it doesn’t fix anything, but if I do it on the UDMP side it does. Interesting!

I did notice the problem didn’t start until I switched to BRIDGED mode, so I tried switching back out of BRIDGED mode. That fixed it. I can get back to 2.5g and everything is working great. No 5 min dropout. Could BRIDGE mode be the problem?

At this point, the network engineer in me said ‘get a pcap going’. Since I’m running through the core switch I configure another SFP+ port to be a mirror port, and then connect that port to an interface on another laptop that has Wireshark on it.

I switched the cable modem back to BRIDGED mode and started capturing from power on.

My first thought is – When does the UDMP send the ARP requests for the default gateway? Filtering on just DHCP I can see the UDMP getting an IP Address at timestamp 64, and just after that the UDMP does an ARP request for the gateway This looks just as you would expect. One thing that does stick out is that gateways MAC address: IETF-VRRP-VRID_32. That is an indication that the other side is using VRRP (Virtual Router Redundancy Protocol). Not surprising, but a hint at the unusual behavior.

The ARP at 64 seconds is the last before the manual one at 528.

With the ARP at 64 seconds, I start doing some ICMP traffic. At almost exactly 300 seconds later (Timestamp=364s), the ICMPs start to fail. Some other traffic is still working, but this specific ICMP path is failing. Clearly something has timed out, and it seems like perhaps my MAC address has timed out in one of the paths from one of the VRRP member routers.

ICMP working as expected.

ICMP Fails starting at timestamp 364.

I SSH into the UDMP and check the ARP table, which has the entry for I use the ‘arp -d’ command to remove that entry which causes the UDMP to send a new ARP request for I see that traffic, and suddenly the ICMPs start working again!

The ARP at timestamp 528 fixes the ICMP return path.

I took a look at the statistics from this capture session, and you can see the outbound traffic going from the UDMP to the VRRP MAC address, but the return traffic is split between two Junipers routers. ( MAC 0xef and 0x6c). These two return paths are not equal in usage, which is not a surprise since the hashing used to pick in VRRP isn’t going to guarantee an even split in such a small set of connections.

I dug a bit more, and sure enough the ICMPs I was sending were all coming back from one of the two Junipers ( the 0x6C one ). It seems that most of the traffic from that particular Juniper gets dropped if the the ARP request does not happen (and presumably reset a MAC table somewhere along the way) every 5 mins.

The statistics from this small capture session. You can see both outbound and inbound paths.

Looking at the traffic from the 2 src mac for the return path confirms the traffics drops from that second Juniper after the potential MAC timeout.

This return path has normal packet distribution the entire time.
You can see the significant falloff in packets from this return path during the interval from 364-528.

The FIX:

ARP timeout are a bit complex. While there are OS level settings ( in this case you can see them in /proc/sys/net/ for the interface of interest), things like the base_reachable_time_ms don’t provide everything you need. An entry in the ARP cache might not be refreshed via ARP if an upper level protocol updates the status of that entry. That can happen if a packet is received successfully based in the use of an entry. As a result having those OS timers set to something like 30s might not actually result in a new ARP every 30s, especially on an active link. On the UDM I was able to see >20 min intervals between ARPs if traffic is flowing.

Since I need to guarantee a new ARP every 5 mins I created a CRON job that runs every 4 mins and deletes the particular arp entry for the gateway in the internet interface. In my case (the UDMPro) the SFP+ WAN interface is eth9. I added the following command to the crontab:

sudo arp -d `arp -i eth9 | awk ‘BEGIN { FS=”[ ]” } ; NR==1 {print $1 }’`

That command will delete the ARP entry for the first entry in the WAN interface table, which in this case is the default route (because the IP is .1). This is a serious hack, and not something you would want to rely on long term.

It does work, and a week later everything is working great with the ARP refresh. It is a surprising issue perhaps obfuscated by not as many people using non-bridged mode, and because Windows tends to ARP much more often. It is possible that other firewalls have lower ARP thresholds that mitigate this problem. There isn’t a magic ARP timeout that is ‘correct’ and the actual implementations vary a lot.

I talked about this issue at length with my friend Eric Rosenberry, who happens to be Director of Network Architecture at Ziply. He suggested the it could be some layer 2 network layer like EVPN in between the VRRP router and my device. Fortunately he had a few contacts who know a few other contacts that may be able to pass along this interesting problem to some of the client network engineers at Comcast. I’ll update if I hear back.

I did find after a bit of searching that other people have seen this issue with the UDM Pro and Comcast:

An old calculator

An old calculator

When I was in middle school my parents got me a Sharp PC-1401 ( a slightly earlier version of this 1403) for Christmas. I had a Sharp EL-506 calculator, and the PC-1401/3 was an extension of that style but included a BASIC language interpreter and a full QWERTY keyboard. It was the precursor of the laptop, and I carried it around every day. I spent quite a bit of time writing basic routines to calculate astronomical ephemeris of various kinds. As you can imagine this behavior was quite the magnet for attention from the ladies.

It is a surprisingly easy to use interface, and it has terrific battery life. I used it for a long time before eventually moving onto the classic HP48. Of course now the watch I wear has 10,000 times the processing power, but for those times, it was something.

Nerd on.

Upgrading to 2G/200M home internet.

Upgrading to 2G/200M home internet.

Interesting results with the newest Comcast offering at home. 2 gig download and 200m upload. The upload seems a little jumpy..of course this is during peak times.

The DOCSIS3.1 modem has a 2.5G ethernet port, but fortunately one of my 10G SFP+ BaseTs supports 2.5G which works in the UDM Pro. From there 10G to everything else.

Power outage and battery system check

Power outage and battery system check

We had a power outage last week that gave me a good change to test out my battery system. Like any successful test it uncovered a few bugs. The battery system is based around the Enphase Smartswitch combined with a bank of 4x 10.2KWh battery packs that use LiFePo4 batteries combined with 53 380W LG Solar panels all using IQ7 microinverters.

The concept is that the batteries and the solar microinverters can create a standalone power grid when main power is unavailable. The house can then run on battery power, with the solar providing additional power and charging.

On the positive side, when the power dropped out, I didn’t notice. On the negative side, I didn’t notice! Normally I would have seen an alert on my phone, but I added a small distribution switch for some experiments in the upper garage a while back and didn’t notice that I put it on the non-gen-bat circuit. When power dropped off that switch went down, and the Enphase system couldn’t send out a note to say we were off grid power.

It is an easy fix to change that to a POE powered switch so it is not only on the gen power circuits, but also the APC symmetra UPS. I have also added some detection circuitry that will do some home automation tasks in that case, including an announcement over the PA.

The biggest downside of not knowing the power was out was that I continued to use power with abandon! I had a few extra larger servers powered up, plus laundry, dishwasher, and a bunch of other things kicking. I was drawing about 10KW at the time, which is a pretty typical draw for me during the day. 40KWh won’t last long at 10KW.

Once I figured it out, I did some server-shutdowns to reduce usage for things I didn’t need. The batteries ended up lasting through to the next morning before I fired up the generator.

I am going to automate some system shutdown for things that are non-essential, especially my 40+ drive backup SAN, and the math array. I also have a cell modem connection for sending notifications in the case both the power and the internet are offline.

Fun stuff for sure!

Magnetic Fields and the variation

Magnetic Fields and the variation

Here is a helpful map if you find yourself walking with a compass and needing to find true north. The lines represent the degree difference between true north and magnetic north.

Here in the PNW the difference is about 15 degrees, while in the middle of the US the difference goes to 0.

Perhaps the most amazing observation is the latitude of the southern pole, which is close to 65 degrees. An eventual geomagnetic reversal will be something to see, albeit it might take a thousand years to complete.

As long as the poles stay out of the right half plane we will be ok! Happy New Year!

A little bit of snow

A little bit of snow

The fog gives it that Close Encounters of the Third Kind look. On a related note, these shoes with the Michelin snow tire tread work really well on ice.

Ecoflow Battery + Generator Testing

Ecoflow Battery + Generator Testing

I had mentioned a few months back that I was testing out an EcoFlow Delta Pro portable battery pack. I used it when doing some remote astrophotography late in the summer. It is a pretty compact unit that has 3.6kwh of storage along with a 3.6kw inverter, 1.6kw solar input, and the usual DC outputs.

When I used it this summer I paired it with a small Honda 1800W generator, which worked well. The biggest downside was the lack of integration. Ecoflow has their own generator that is designed to be a companion to the battery unit, so I picked one up.

It is a similar generator to the Honda, being 1800W AC output plus some DC outputs. The biggest advantage is the combination of the autostart (it has an electronic starter) and the integration into the battery unit. It can also run on propane which is very convenient for stored fuel situations.

You can configure the battery back to control the generator. If the battery level reaches a certain level it will auto fire up the generator to recharge the batteries, turning the generator off once things are charged. That makes a super automated power system that doesn’t require much intervention, and is very efficient. Using just a generator alone is very inefficient if you have varying and low loads. With the battery/genset combo you get the advantage of only running the generator at nearly full load to recharge, which is where the engine is most efficient.

There is also the small advantage that the generator has a special DC output for that charging, which removes a second AC/DC conversion.

Happy Birthday to me!

Happy Birthday to me!

Today I am completing my 52 trip around the fusionball in the sky. A friend mentioned the yearly trip in an email recently, and it reminded me of something interesting I learned many years ago. When I was around 8 or 9 I bought a book called ‘Practical Astronomy with you Calculator’ which was a fantastic book explaining how to use a calculator to determine the position of objects in the sky, the locations of the planets, time of eclipses, and other similar things. Of all of the chapters the one on ‘time’ was the most interesting.

The time for the earth to go around the sun is on average 365.24219 days. The extra bit beyond the 365 ( the .24219) was the reason Julius Caesar created the leap year in 45 BC. If you add one day every 4 years you get a mean year of 365.25 days, which is pretty close to the actual. The small error wasn’t seen as that critical, and indeed this leap year scheme lasted until the late 1500s. In 1582 Pope Gregory XIII created the modern Gregorian calendar by adding an extra rule to the Julian leap year system. A leap day would be added every 4 years, but additionally every 400 years three leap days would be skipped (not added). He chose to make it such that if a year was divisible by 100, but not by 400, it would not be a leap year. This meant that the years 1700, 1800 and 1900 were not leap years even though they were divisible by 4. That small change removed 3 leap days every 400 years, which brings the average length of the year down to 365.2425, which is very close to the actual mean 365.24219.

Perhaps even more interesting was that in 1582 the errors of the previous years had accumulated enough to have a noticeable effect on the start of the seasons and the calculation of Easter. To get us back on track Pope Gregory ordered that Thursday October 4th,1582 would be followed by Friday October 15th,1582. The dates between those two don’t exist. Pope Gregory deleted 10 days from the calendar to get us back on track. It is an oddity of the calendar that most people don’t notice.

There was a cool party trick I learned for quickly calculating the ‘day of the week’ from any given date. You can ask someone their birthdate and quickly tell them what day of the week is was on (a Monday for instance). However that trick only works for days after October 15th, 1582. Equally no one can have a birthdate of October 5th, 1582, because it doesn’t exist in the Gregorian calendar.

Cheers my friends to another year of interesting and challenging problems to solve, people to meet, and the past to appreciate.