Fault Monitoring vs. Performance Monitoring
Every IT administrator knows that users are typically complaining of two things: it doesn’t work or it’s slow. When it doesn’t work, it’s usually because something is down and we can rely on fault monitoring tools to notify us. But where do we start when users complain of poor performance? And what tools are available to help us? In these situations, performance monitoring tools might be just what you need.
Knowledge is power. We all know that. And when you’re tasked with managing a network, this statement is truer than ever. You need to know what’s going on at all times. But the problem with networks is that you can’t readily see what’s going on. It all happens at blazing speeds within copper wires or glass fibres. This is where monitoring tools come in handy. However, with so many different types of monitoring tools available—each with their pros and cons—picking the right one for your specific situation can turn out to be a challenge. Today, to help shed some light on this complex subject we’re about to compare two popular types of network monitoring, Fault Monitoring and Performance Monitoring.
Our objective today is to compare two very different types of monitoring technologies and, more importantly, to emphasize how different they are. So different indeed that they are not mutually exclusive and can even be thought of as complementary. With one of them centralized and the other distributed, they provide a vastly different kind of insight, a different point of view. So, we’ll first describe both approaches to monitoring and outline their respective characteristics before we compare them.
Fault Monitoring In A Nutshell
There’s one thing with monitoring—any kind of monitoring: it seems like everyone has his own idea of what it is, or what it should be. Fortunately, this is not so much the case with fault monitoring and pretty much everyone agrees on what it is. In one sentence, fault monitoring is the process of detecting, isolating, and sometimes correcting malfunctions in networks. The process involves maintaining and examining error logs, receiving and processing error notifications, tracing and identifying faults, and carrying diagnostic tests.
There are two main types of fault monitoring: active and passive. Passive fault monitoring works by collecting alerts from networking devices (often sent using SNMP traps) which are raised whenever something abnormal happens. This type of fault monitoring relies on the networking devices themselves to identify faults and notify a centralized monitoring system. In a way, a passive fault monitoring system is simply a glorified notification system that can send various types of notifications based on the detected faults. The main drawback of such systems is that a complete failure of a piece of equipment will often result in its inability to send out an alert and the problem might fail to be quickly detected.
Active fault monitoring was invented to address the shortcomings of the passive approach. Instead of relying on devices to report faults, it actively—hence the name—monitors them. It will, for instance, use ping to ensure that devices are up and that they can be reached and raise an alarm whenever something abnormal is detected such as, a device that no longer responds. It just as centralized as passive fault monitoring as all tests are typically performed from the centralized monitoring system.
The best type of fault monitoring is, of course, one that combines both approaches. But active or passive, fault monitoring has a few disadvantages. First, it is a reactive process. It will notify you that something is broken once it is broken. This is undeniably a good thing, but wouldn’t you rather be notified as soon as something is not working well instead of not working at all? Another disadvantage is that all components must be explicitly monitored. If a networking device is not configured to send its alerts to a passive monitoring system or is not polled by an active one, no one will ever be notified of issues with it. It is almost impossible to add fault monitoring on the devices you don't control such as the service providers routers and switches.
How About Performance Monitoring?
As much as Fault Monitoring was clearly defined, things are not so clear with performance monitoring. In fact, it seems like every monitoring tool vendor has its own definition of the term. For instance, some vendors of bandwidth usage monitoring tools will claim their tool is a performance monitoring tool. To a certain extent, they are right. After all, available bandwidth is a valid measure of a network’s performance. Or is it really?
We often compare networks to highways. After all, there are many similarities between the two. So, imagine a four-lane highway that has been designed to carry up to 7 600 vehicles per hour. The equivalent of bandwidth monitoring would be counting the vehicles passing a given point. For the purpose of this analogy, let’s say that we’ve counted 3 800 vehicles in an hour. This tells us that our highway is used at 50% of its capacity.
While this is piece of very useful information, it says nothing of how long a vehicle will take to go from one point to another along that highway. Some will argue that, knowing that there is ample capacity and knowing the speed limit, travel time can easily be extrapolated. And while it could, it would only be an estimation, not a real figure. It would not, for example, take into account any slowdowns caused, for example, by bad weather or a stalled vehicle somewhere along the way. The only way to have a real account of how long it takes to go from one point to another is to get into a car and time the ride from the starting to the ending points.
A Real-life Example
Suppose you manage a cloud-based infrastructure which hosts business-critical applications and services. While fault monitoring would be useful and notify you if one of the components or systems stopped responding, you probably expect more from your monitoring solution. A true performance monitoring tool would go a step further and test that the actual performance of these systems and services is sufficient to support the productivity of its users.
True performance monitoring tools will measure the actual performance of the complete network path by generating simulated traffic that mimics what users are doing and measuring exactly how long it takes to reach its destination. What you get from these systems is a real assessment of what the performance of your network is from the end user’s point of view. This includes the performance of the devices you don't have access to, but are in the network path, such as the service provider equipments.
Comparing the two
Both fault monitoring and performance monitoring have advantages. For instance, fault monitoring is often easier to set up. This is particularly true with active systems that will typically feature some sort of auto-discovery mechanism. You only need to supply these systems with a range of IP addresses to scan and they will do the rest. Passive fault monitoring systems, on the other hand, can be more complex to set up as each piece of equipment must be configured to send its alerts to the monitoring tool. Another drawback of these systems is that not all equipment can send out its alerts, although most can. There might also be some equipment along the path that you don’t control such as your service provider’s routers.
As for setting up performance monitoring systems, it tends to be somewhat more complex but in reality, it can be as simple as you want it to be. For example, you could deploy monitoring agents only in problematic locations. Performance monitoring systems will typically require more planning to identify where to deploy agents and what types of traffic to generate but putting them in place won’t require any change to any networking equipment. And to make things easier, some systems will have predefined tests that you won’t even have to configure. You simply need to deploy your agents, add them to the monitoring tool and it can take care of the rest.
There are a few places where performance monitoring really shines. One of them is the ability it provides to proactively detect upcoming issues. The slightest performance degradation can trigger an alert, giving administrators enough time to determine the cause of the issue and potentially fix it before it has too much impact, often before it goes noticed by users.
Another important strength of Performance Monitoring is how it can perform tests continuously. The best systems can be configured to run tests as often as required to offer the needed level of protection. The test frequency can be as short as every 500 ms. That is twice per second. With such a testing frequency, we can virtually talk about real-time performance monitoring.
But one of the biggest advantages of performance monitoring tools is that they will detect issues anywhere along the path between the agents. This could include service provider devices. Although it usually features some form of centralized console used for configuration and reporting and often providing some form of dashboard, performance monitoring is not centralized and testing is typically carried between agents with the results sent to the console which rarely takes an active part in the actual monitoring.
Performance monitoring is typically more complex than fault monitoring. But this complexity brings a level of sophistication that simply cannot be matched otherwise. Performance monitoring is like having an army of administrators constantly running tests from various points on your network. In fact, it’s actually better than an army of administrators because it can run tests much faster than any human could and it will react faster to any abnormal test result.
One word of caution, though. When shopping for a performance monitoring tool, you need to make absolutely sure that this is really what you’re getting. There are far too many monitoring tools of various types calling themselves performance monitoring tools but many of them, while they provide a useful type of network monitoring—such as bandwidth usage, for instance, don’t truly monitor performance. At least not in the sense that we intend it, that is as a tool that measures the true performance of the network by simulating real user activity rather than extrapolating it from other measurements.