Troubleshooting Slow Internet Connections with PRTG and Palo Alto


“The Internet out here sucks!”  Surely this is something we’ve all heard time and time again from our users; no matter how big a pipe we’re willing to spend the money on.  Oftentimes it’s even perceived as poor overall network performance, or worse, that the IT department just doesn’t know what they’re doing at all.  As more and more services are moved to the cloud these days, maintaining high speed connectivity to the Internet is becoming more of a necessity and less of a luxury.

The problem is determining what exactly is causing the “slow” connection.  It very well could be a faulty piece of gear, inadequate rate limiting policies, or even the other side of the connection that you have no control over.  But in my experience, it’s very often just an overused link.  Of course, to prove that, we need data.  Some firewalls and routers will provide you with real time statistics for the interfaces they service, but not all will.  Those that do, are often just for a point in time, namely, the time you’re watching it, right that second.  But you sure don’t want to sit there and watch a chart all day until you see the issue and then consult your user to see if they’re seeing the problem.  Enter PRTG.

PRTG will help you monitor this and much more data over time, long periods of time, short periods of time, weird periods of time, whatever you want.  Being able to graph this data has helped me solve numerous issues, and it’s especially helpful for convincing a vendor that you know what you’re doing and your data is accurate.

Here’s the scenario.  A client has a handful of remote sites, each with their own Internet connection, and many of whom complain that the internet is slow.  One user in particular claims it’s unusable from the time they get there until after lunch; but after lunch, no issues at all.  So, we stick the firewall for the site into PRTG, make sure all the interfaces are being monitored, and the next day we get this:

PRTG Graph

As you may surmise, the site in question has a 25Mb/s connection.  That graph stays pegged until about 11:30AM and then drops down to a manageable 5-15Mb/s usage.  So obviously, we have a saturation issue.  But the pattern is a little odd.  We let the system continue to run for another few days; every single day, we saw the exact same pattern, like clockwork.

Great, we know we need a bigger pipe, right?  Actually, 25Mb/s seems like plenty for the user base at this site.  Especially considering the load after lunch is just fine.  To look a little closer, we can use data from our Palo Alto Firewalls to see what exactly is using all that bandwidth.  First we need to create a custom report.  You can find this under Monitor > Manage Custom Reports.  We’re going to want to create a new one that looks like this:

PA Report

You can of course add columns as you like, but this will get us what we’re after.  We can also adjust the timeframe to whatever we want, in this case, just the morning.  We can even schedule it to be run on a regular interval and then emailed to anyone who may be interested.  Now, you can achieve a similar goal with NetFlow, but with Palo Alto App-ID, we can get much more granular.  Here is what we found for the first 30 minutes of that spike.

5.2GB of Windows Updates! In a corporate environment, that uses WSUS?  If you do the math, that pretty much maxes that connection the entire time.  On top of that, it’s apparently happening every single day.  We ended up discovering that the computer labs at this site were using Deep Freeze.  The image used was misconfigured to not only pull Windows Update from the internet, but not to store the updates in thaw space.  So every morning, when they were rebooted, they had to re-download all those updates.  Once we fixed this issue, the Internet connection out there was never reported as “slow” again.  At least not yet, but if it is, we’ll already have the tools in place to troubleshoot it right away.

While this was a special case, involving specially imaged machines, you can see how this report can also show you other bandwidth hogs.  We see a lot of Java, Adobe, and Google Updates in addition to Windows Updates.  All of these can either be managed centrally to, or at least QOSed to stop wasting your bandwidth.  One cool thing we’ve seen from Palo Alto Firewalls, is they can actually App-ID Apple Updates too.  So you can pretty easily prevent a new version of iOS from crippling your connection.