Making the news the last few weeks has been the problems associated with the Healthcare.gov website. Being interested in the Healthcare Exchanges and seeing what might be available I decided to sign up. While doing so I decided to connect to healthcare.gov from my lab so that I could see the results in Extrahop and try to get some wire data of the experience.
There were definitely some areas that were slower than others but luckily I signed in during the AM on the east coast so I am guessing the site was not extensively busy at the time. Currently I am waiting to hear back on my eligibility and they sent me a document to download but I am currently unable to download it as it times out repeatedly. Outside of that, the initial sign up took about 15 minutes and while there were some slowdowns, it was not so bad compared to other municipal sites (save those I hosted as a Federal Employee at the CDC which snapped smartly to all requests J )
While several pundits and are having a great time making fun of the Federal Government and the issues with the Healthcare.gov site, as a former Federal Employe and Contractor for over ten years I can tell you that I worked 50+ hours a week routinely and while there are well noted inefficiencies in the Federal Government some of the smartest people I ever worked with were feds. I am NOT dancing on the sorrows of anyone and I have NO DOUBT that people are busting their asses to make the end user experience as productive as possible. Regardless of how you feel about ObamaCare, we paid for this site to be up and it needs to work as well as possible, if you are involved with this project, I feel ya.
That said, it’s no secret that I am a big fan of wire data and Extrahop and this article is an attempt to promote it and with that, I will go into detail on how I used Extrahop to gain wire data (no agents installed on my workstation, all data was taken directly from the wire) and I will provide info on what could be done if Extrahop were located behind their firewalls and aggregated.
My lab setup: I have a span set up on a Cisco 3550 switch (it’s a lab but it should work on your SDN, Nexus or 6500) that grabs data from the uplink to my firewall. Extrahop has the ability to handle 20GB/s on an appliance and they can Cluster several of them if you want to aggregate several of them and manage them from a single console. For my test, I launched a published desktop within my Citrix farm and signed up for the Obamacare site from behind the span.
While signing up to look at the Healthcare Exchanges I did two things, first Extrahop grabbed the Layer 7 visibility and provided performance metrics on all non-encrypted URI stems. I also used Extrahop Triggers to send the following Events to a Splunk Syslog Server for parsing and reporting.
I will try to keep my suggestions and opinions to a minimum but I do have some suggestions that I will include later. For the most part I just want to report on what Extrahop was able to grab from the wire and either report in the Extrahop Console or Report to Splunk.
Data in the Extrahop Console:
From the Extrahop console I drilled into the Citrix server that I signed up for Healthcare.gov on. From there I am given a menu of options
Layer 4 TCP:
When I first start to troubleshoot an issue, after verifying Layer 1 and Layer 2 integrity (in this case, my lab is in a two post rack two feet from me so I can verify that) I dive into Layer 4 where Extrahop really starts to give you a solid holistic view of your environment. There are two views within the Layer 4 environment, the first is the L4 TCP node. The L4 TCP Node provides a quick holistic view of the Layer 4 bandwidth data both open, established and closed sessions as well as reporting on the Aborted sessions. You also see a graphing of the RoundTrip time.
If an Extrahop appliance were on a spanned port inside the healthcare.gov network, similar metrics could be provided for web farms, back end Database connections and SOAP/RESTful api calls. For this test, I only had the ability to grab the wire data from the client perspective. There would be a much larger breadth of data were the Extrahop appliance located on the healthcare.gov side. Also included in the L4 TCP node view are graphs on inbound/outbound congestion and inbound/outbound throttling.
Layer 4 TCP > Details:
The second node is the Layer 4 Details node. I am a Systems Admin first and a Network Engineer a distant second. While I pride myself on being a generalist, I usually ask for help when looking at Layer 4 details just to make sure I know what I am looking at. I will give you my best effort observation of the L4 TCP > Details node.
Looking below, you see drill-down options on Accepted, Connected, Closed, Expired and Established Sessions. On the In and Out grids I generally look at Resets, Retransmission Timeouts (RTOs), Zero Windows Abords and Dropped Segments. A more experienced Network Engineer may focus on other metrics. Again, if we had an Extrahop Appliance on the inside, we would see the wire data for the actual web server.
As you can see below, when we look at the L4 Detail data we see a much higher number of outbound (From my client to Healthcare.gov) Aborts, Dropped Segments, Resets and Retransmissions. If you had a web farm, you could trigger this data and find a problem node in the group. You can click on any of the linked metrics to drill in to see which hosts are dropping segments, Aborting Connections, etc.
L7 Protocol Node:
The L7 Protocols node provides a holistic view of the protocol utilization during the specified time period as well as the peer devices. From the client perspective, you can see the sites that are providing data either in iframes or the client is sent to directly as a result of healthcare.gov redirecting. You see two charts of incoming and outgoing protocol usage broken down by L7 technology. Also, below you see a list of peer devices, I generally look here as well to see if a CRL or OCSP service is not responding fast enough and delaying my site or if I have an infected iFrame that is sending a user to a rouge site. We will get more into peer performance in the trigger section.
From here you can drill into actual Packets and Throughput per Protocol as well as take an interesting look at Turn Timing (also discussed in the trigger section) where you can see the performance of specific protocols.
Layer 7 Protocols > Turn Timing:
Within the turn timing you can see the Network In (Client to Server) Processing Time (Server Performance or Time waiting to respond) and Network Out (Server Response back to the Client).
If the Appliance were on the inside, this could be very valuable to see if there were back end systems that were not responding. From the Client Perspective, looking at the information below, it appears the servers themselves (processing time) seemed to actually perform relatively well on average (we will get into more detail on the triggers) and we seemed to have issues with the web servers responding back to the client. Keep in mind that this data is JUST from the client to the Healthcare.gov site. It would e considerably more valuable to have information on the performance inside Healthcare.gov.
Layer 7 Protocols > Details:
The details page can also be very valuable, if you are using a specific server for all of your images in your web farm you can take a look at the bandwidth and find out if a specific server has larger images than you have planned on. Also, you can ensure that all of the peer communications are with appropriate IP Addresses. If you are integrating with outside partners, you are only as secure as they are. Sometimes it’s best to periodically verify who your peer nodes are.
DNS: *Disclosure I forgot to use an external DNS Server so in the initial test, my DNS Server was local and therefore did not traverse the span and was not logged in Extrahop. I went back and added an external DNS Server and went back to the site to do some browsing to get these metrics
Few people fully realize the extent to which slow DNS resolution can wreck an application. In the DNS Node you can quickly get a glance at the number of requests and the performance of your DNS Servers as well as drill down into errors and timeouts. If your Web server is consuming a RESTFUL API and the DNS Resolution takes 200ms and the API is called several thousand times a minute, you could see a lot of waiting while using the web app. As previously stated, if I had an Extrahop Appliance inside the healthcare.gov network we could see if the web front end were having trouble resolving any names of the tiered API’s they are consuming.
While the majority of the site is delivered via SSL there are a few actions that are delivered by HTTP, the HTTP Node provides a holistic view of the overall environment. In my case, I was the Client so I would set my Metric Type to Client and look at the data. From there I have drilldowns for Errors, URIs and Referrers, if I were looking at a Healthcare.gov Webserver I would select the Server Metric type and look at the same data. You see below I have Status Codes, methods transaction metrics and transaction rates readily available. If you put the SSL Keys on the Extrahop Appliance (like you would in wireshark) you can also get the Layer 7 performance of Every URI stem that is being delivered via SSL. This could then be used to alert you to slow web applications or downstream API farms where you are consuming web servers from 3rd party partners. I understand that exporting SSL keys is EXTREMELY taboo in the Federal space but I believe you can remove the keys once you have finished troubleshooting.
Due to the data being encrypted there isn’t as much SSL Data as there is with other protocols. When you click on the SSL Node you can ensure that web servers have been configured with FIPS Compliant Ciphers and you can double check key lengths by Clicking on the “Certificates” link. From this menu you will see the Session details, if sessions are being aborted, which versions are being used (root out non-compliant SSLv2/v3 Certs). If I had an appliance on the inside, I would look at the “Aborted” Metric within the “Session Details” area.
Extrahop Trigger Data with Splunk Integration:
My favorite feature of Extrahop is it’s trigger function. Extrahop has the ability to fire off a syslog message, custom metrics or even a pcap compliant capture based on a set of criteria you give it. In the case of Healthcare.gov, they could set a trigger that states, only syslog REST transactions that take longer than 200ms or Alert me when a database transaction occurs for specific tables in a database. Because I am looking at healthcare.gov from a client perspective I can only provide triggers on the Client end but if I had an appliance inside, I could see not only the Client interaction with the site but I could trigger on downstream performance of Databases, IBM Queueing, SOAP/REST calls and Slow DNS Lookups.
As I stated previously, I have triggers on the following. In the case of Healthcare.gov I may also look at the performance of the DNS Servers. Currently, I am only reporting on the DNS failures and not the performance. This can be added in less than a minute.
Let’s examine a few of these triggers and see what we can glean from the information.
Within the TCP Flow Triggers I want to look at the FLOW_TURN data as this gives us a good indication of where potential bottlenecks are and how long a client waited for a response from a server. In the FLOW_TURN trigger I am going to grab the following metrics and map them by average to ServerIP. When I make a request there are a number of potential bottlenecks that need to be monitored on the wire. DNS Performance, the client request, the server processing time and then finally the server response. I can wait on any one or all four of them. Within this first Splunk query I am going to look at the client request performance, server processing time and server response. The triggers I use for FLOW_TURN can be found in the Triggers/Downloads section of the blog.
What am I doing in this Query:
The query below uses a Regular expression to convert the ServerIP into “ip”. The “ip” field can be passed to a reverse lookup function within Splunk that will give me the hostname of the ServerIP field. If you are not a big RegEX person there is plenty of RegEx material you can research. I have always found that if you just hit <Shift> and the Number keys enough times, eventually what you are looking for will come up on the screen…J. If you are not interested in learning RegEx then you can simply copy the query below.
FLOW_TURN | Search ClientIP=”192.168.1.82″ | rex field=_raw “ServerIP=(?<ip>.[^:]+)\sServerPort” | stats count(_time) as Instances avg(TurnReqXfer) avg(TurnRespXfer) avg(tprocess) by ip | lookup dnsLookup ip
Below you see the results of the query above, I have taken the time to sort by the slowest return time and you see that the server odlinks.govedelivery.com had the slowest average response turn time of more than 3 seconds per response. The saving grace is that there were only three instances of it. Far more concerning is 191ms tprocess metric that occurred 676 times.(Please note that it is hitting the Akamai front end, I AM NOT saying Akamai is the bottleneck but there may be a back end server that is causing the slowdown. Again, if I had an appliance on the inside, I could get this metric for each server in the web farm. That said, the nearly 200ms tprocess time is the time that Extrhop observed before the server sent a response packet. This can give you an indication of how long the server took to respond, either due to DNS Resolution (how many DNS Suffixes are in YOUR IP Configuration!?) or just time processing the information.
Once you have the data in Splunk it becomes like “Baseball Stats” where you can get the average Response Transfer time of west coast customers between midnight and six AM on weekdays. The amount of stats you can query is dizzying. From the data below, I can see the average performance by Server.
One other point I want to make about the FLOW_TURN trigger is that it is very valuable in SSL Environments where you cannot get performance metrics because the data is encrypted. While I do not have URI Stems and SOAP/REST calls I do have the basic Layer 4 performance data which can be very valuable in instances where using the Private Key on the appliance is not possible.
*Please note that the data below is what was collected while I was signing up on healthcare.gov. While I was not knowingly doing anything outside of that, some connections may have been made to other sites that are not affiliated with healthcare.gov and may show up in the stats below. There is NO HIDING from wire data and without knowing the application I am not sure who to exclude. Please make a note of it.
Due to the site conducting all but the a few packets in SSL the HTTP data is actually quite lite. I did want to point out what you can get from the HTTP data and show how you can correlate it with a big data back end like Splunk. You can essentially trigger any HTTP Header value as well as the performance of web applications using the HTTP_RESPONSE trigger. Below you are seeing the performance of URI stems for the healthcare.gov site as well as the performance of the CRLs. In instances where environments are “Locked Down” the lack of access to CRL’s and OCSP can have a negative impact on Web Applications. Here we note that there are no performance issues with CRLs or OCSP sites. The only Healthcare.gov URI that we see is the initial site. Everything thereafter was encrypted
Note: With the keys installed on the Extrahop Appliance, you can see the performance of each URI stem and quickly identify web services that are not performing properly.
The Splunk Query: Note I am removing some “noise” from the results. I had bing as my search and I had to go to gmail to verify my login.
|HTTP_* |search ClientIP=”192.168.1.82″ Host!=”mail.google.com” Host!=”gmail.com” Host!=”api.bing.com” | table _time eh_event ServerIP Host HTTP_uri tprocess|
Most of the relevant data was from the SSL_OPEN trigger, the only unique item I was triggering from SSL_RECORD was Key Size, I am not sure you can even get a 1024 bit Key anymore, but all keys were 2048 bit so I will not include SSL_RECORD in this article.
While it does not say much in terms of performance, it is sometimes nice to just make sure those Certificates that people are using are what you expect, especially when you are working with 3rd parties and partners. This will allow you to ensure that all SOAP/RESTful web services are meeting the FIPS encryption standards.
|The Splunk Query: Note we are, once again, using REGEX to parse out the “ip” so that we can perform a reverse lookup.
eh_event=”SSL_OPEN” | rex field=_raw “ServerIP=(?<ip>.[^:]+)\sServerPort”| eval SSL_EXP=strftime(SSL_EXP,”%+”) | stats count(_time) as Instances by ip SSL_VERSION SSL_CIPHER SSL_SUBJECT SSL_EXP | lookup dnsLookup ip
I am certain that there is no end of “Arm Chair” architects offering DHHS advice. Like I said, I am NOT dancing on anyone’s sorrows here. As a former DHHS (CDC) employee myself I know that they are working around the clock to fix any issues users are having. I feel like the first step in that process is to start gathering Operation Intelligence and Extrahop can do that without any impact on the existing Server architecture. As I stated throughout the post, the data I have is Client side and the data they could collect inside the healthcare.gov network would be orders-of-magnitude more valuable.
There are a few VERY important things to note for Feds/Govies (or anyone else) who want to leverage Extrahop’s Wire Data
- It does NOT require any agents, there will be minimal, if any, changes to the incumbent C & A framework and you should only have to get the Appliance approved. This means that you can call them today, get the appliance rack mounted and tie it into your span and start looking at data without doing ANY configuration changes to the servers.
- They have a free Discovery Edition that you can use to perform your own Proof of Concept that is a VM.
- As I stated previously, they can handle up to 20GB/s of data per appliance and they can be clustered so that they are centrally managed as well as aggregated.
- It will integrate with your existing Splunk environment or any other Syslog server that you have in place. I have used it with both KIWI and Splunk.
- It augments existing INFOSEC strategies by allowing real-time access to wire data to find Malware, DNS Cache Poisoning (Pharming) and Session hijacking within seconds.