Monthly Archives: December 2013

When “Uptime” is not enough, the case for Wire Data?

In an interview with Network World Extrahop founder and CEO Jesse Rothstein pointed out that “resource utilization is not the same thing as performance.” In this day and age of hypervisors and virtualization, if your CPU isn’t occasionally in the yellow you aren’t getting the most out of your hardware. In the recent, and very public, launch of healthcare.gov I can almost guarantee that all systems and system owners involved in the launch were tracking their machine utilization. We have all been on that tech call where you start to hear “CPU and memory looks good on the back end, disk queue length is fine” or “I don’t see any dropped packets or errors in the log files”.

I think the introduction of wire data into the operational base can provide even more stability to what is currently a 3 legged stool of agent data, machine data and synthetic transaction data. Wire data offers that fourth leg that can help stabilize IT Operations as well as provide more valuable data by giving you performance of the actual transactions as the happen on the wire. Myself, I have not installed an OS (Accept Xen, KVM or vSphere) on bare metal on a non-Exchange or non-SQL Server in almost ten years. This influx of hypervisors is going to force us to consider beyond the traditional monitoring of system resources and begin to monitor what is important, the transaction.

When I worked for a large healthcare provider, after using Extrahop for a few days I realized that we are not in the business of selling CPU, Memory or Disk performance. At our core, we are in the business of selling transactions and none of the incumbent tools could do (nor were they designed to do) an adequate job of telling me how long users waited to complete a SQL or RESTFUL transaction until Extrahop began to expose the wire data metrics. Traditional APM has blind spots that are not visible outside the wire. While wireshark is a great tool, a lot of the time there are only a handful of people who know how to use it effectively (at least Level I Operations) and the amount of data to look through in today’s 10GB, 40GB and up networks it is somewhat like drinking through a fire hose.

My CPU looks good, Memory looks good, Disk I/O is good, why are my applications slow? (What’s in YOUR blind spot?)


While traditional APM solutions will give you things like logs, CPU, Disk and Memory things that can elude tradition APM are:

  • Malware
  • Slow DNS Performance
  • DNS Lookup Failures
  • CIFS Errors
  • Duplicate Ips
  • Slow DB Queries (Ever added more RAM to a SQL Server that just needed the tables indexed?)
  • Slow RESTFUL URIs
  • TCP Retransmission Timeouts
  • Database Errors (Schema changes)
  • .conf files that have been fat-fingered with the wrong servername

 

All of the metrics above are things that Extrahop shined light on that were previously blind spots in our environment. In the absence of wire data, these metrics would have been largely invisible to us and could have still run amuck even if our CPU, Memory and Disk I/O looked great. Extrahop provides not only errors but performance metrics that are neatly parsed and can be forwarded to an existing Operation Intelligence platform like Splunk. If the traditional 3 legged stool of agent data, synthetic transactions and machine data doesn’t show you the problem, add the fourth leg to your operational platform and get visibility into an entirely new set of metrics.

At the end of the day, what good are “five nines” when an end user stares at an hour glass for 30 seconds as the tab though menus or look at “page loading” for 2 minutes after clicking “Purchase”.

Extrahop positions you to know, at a minimum, the process time of every single layer 4 conversation. Depending on the licensing, you can get down to the very query that is running slow and gain insight into distributed applications allowing you to see their performance as they talk to one another.

Conclusion:
The truth is, uptime isn’t good enough anymore, many of us have gone, for years, without visibility into performance and errors on the wire. While traditional methods still have their place in operations they lack the visibility that wire data can provide. Do we really care what system resources are like as long as our transactions are responding smartly? And does it matter what the system resources say if users are waiting too long for transactions to finish? I think operationally we may be at a crossroads in terms of understanding what is really important. We have done the best we can with the tools we have had for the last 15 years but as systems have matured and become more distributed, we need tools that can adapt at the speed of the cloud. Wire data should be the FIRST thing we look at as we can deploy it without agents. It does not care if you have code updates once a week, you don’t have to re-record synthetic transactions or install new agents when the next version of Windows comes out. If you’ve got an IP Address…you good!

The time has come for operations teams to take a good hard look at wire data and come to grips with just exactly how powerful it really is.

Thanks for reading and have a wonderful holiday

John M. Smith