Performance Monitoring
The first step is to examine your rules, routers and firewalls to identify which are most susceptible to risks, which do you rely on the most and which do you need the most from. This will help identify and prioritize where to focus your resources.
A standard firewall metric that will probably spring to mind is "availability." It tells you about the performance of the box. for example, 99.9 percent up. This is obviously a good metric to track; however, in my opinion, it has limited applicability as is, although it conveys that everything’s fine with only 0.1 percent downtime. It doesn’t tell you what went wrong, how to fix it or how to improve performance and avoid it happening again. It simply states the obvious -- that valuable uptime was missed.
That’s not to say that all basic metrics aren’t valuable. Some standard baseline performance metrics that deliver exceptionally useful data, and every firewall team should be tracking, are CPU utilization, memory utilization, connections passed, connections dropped and simultaneous connections. These are all dimensions that are important when examining your firewall’s current performance and whether it behaved like this previously – yesterday, last week or last month -- to determine if there’s a significant change warranting further investigation. These are also key components for a capacity-planning exercise to pinpoint if a firewall is overloaded. Performance metrics may indicate that a hardware upgrade is needed, but it is worth first checking whether the firewall configuration can be optimized, as there may be underutilized capacity elsewhere.
A more sophisticated metric for tracking firewall performance is to use an external testing product that streams traffic through the firewall to a collector and records the throughput, latency and jitter of the firewall and network influence on this packet stream. This live bandwidth monitoring can be an important part of understanding if a firewall is cleanly passing performance-sensitive traffic such as VoIP and videoconferencing traffic.
Change Management Monitoring
Nothing stays the same for long, and as your IT environment changes, so does your firewall. You need to change, create, disable or even delete rules. Change can affect availability, either positively or negatively, and as this is one of the main things a firewall must provide, metrics that provide meaningful data that can be acted upon are invaluable.
Configuration updates happen in a number of ways:
Firewalls do not have a change-management process built into them, so documenting changes has never become a best (or even a standard) practice for many organizations. If a firewall administrator makes a change because of an emergency or some other business disruption, chances are he is under pressure to make it happen as quickly as possible, and process goes out the window. But what if this change cancels out a prior policy change, resulting in downtime? By monitoring the number of planned versus unplanned changes, you can determine how well the team is pre-empting the users' requirements and proactively managing the firewalls versus "seat of the pants" updates. A great metric is the percentage of changes resulting in outages, as this provides feedback on how well the operational team understands the changes they’re making and their impact, and whether they’re using some method or tool to verify changes before they’re made.
Another really useful metric, although rarely tracked, is the mean time to recovery (MTTR) -- in other words, how fast did the team restore service for each of your outages. This metric is a good gauge of your team’s familiarity and understanding of the firewall's configuration and whether it’s improving or diminishing. It could also be an indicator that everything is getting complex or unruly. If you’ve read "The Visible Ops Handbook," you’ll remember that 80 percent of all outages are caused by configuration adjustments and that 80 percent of the MTTR is spent identifying what changed. Therefore it stands to reason that if the team understands exactly what happened, they should be able to isolate the failure point within a minute and restore service in less than five. Ultimately the goal is to eliminate downtime in the first place.
Comments
Post new comment