Monthly Archives: October 2013

Using Extrahop Triggers to Monitor Databases for Leakage and Performance

To continue with the INFOSEC posts I wanted to demonstrate how you can use Extrahop triggers to Monitor your Database connectivity and be able to tell when data is actually being stolen from your back end Databases. In many cases sensitive data (data that you will get sued, fined, embarrassed or fired in the event it is compromised) is located on Database servers. From an INFOSEC perspective, it is important that back end databases are only accessed by those systems that are supposed to. Let’s say we have a Web tiered application that connects to a CRM database that has all of my company’s leads. If I note that I see an IP Address that is on the users segment connecting to my back end database that is something I should be concerned about. Even if my organization has taken steps to layer then network so that only appropriate hosts can talk to one another, you still have the issue of someone potentially compromising one of the web servers then running queries and stealing data from the compromised web server. In order to prevent this I need two things, I need to know what type of SQL traffic is expected (it is actually rather rare to see “select *” in a well written application. Taking stock of the types of queries that are being run against your data and from whom/where is an important step to preventing data leakage due to SQL Injection or a trusted box getting compromised.

Outside of INFOSEC you also have the benefit of being able to see which queries are taking the longest time to run. If you have Splunk you can use some RegEx to actually parse out the performance by table (will show you a video) which could give you an indication that a table needs indexing. Using Triggers you can log and report on the following:

  • Table Performance
  • Processing time by Server
  • Processing time by Client
  • Total queries by Client
  • Total queries by server
  • Processing time by Query (which Queries take the longest time to complete)

Imagine doing an application upgrade or a schema update and being able to go to your stored procedures and see before and after performance without needing to run profiler. All of this data can be collected, parsed and reported on without a single agent being installed and without anyone touching an incumbent system.

The Triggers:
The two triggers that you need for this are located in the Triggers section and they can be copied and pasted into your Extrahop Discovery Edition. Once you have loaded the triggers you can then see the SQL traffic traverse the span and using the Console trigger you can see the data in the Console.

Note the simple Query below:

And you see the same query below, you have the IP Address of who made the query, the actual Query and the amount of time it took. (I will show this in the video too)

 

General Punk busting:
In addition to being able to see the overall performance of each SQL Query you will be able to audit exactly what queries have been run against your critical databases and even critical tables. You see in the graphic below a user (myself) is attempting to select critical data from a fictional table called CreditCardData. Note the time I ran the query (the Splunk server is not synched up with my AD domain so I am off by a few seconds)

What I look for in the results:
The first thing I note is that we see the query for PII running and I see the IP Address. The important thing to ask yourself is, “Does that IP Address look right, is that my front-end E-Commerce server verifying payment information or is that some clown on the network. The next thing I ask myself is “does that query look like something that was compiled into a stored procedure or written into the application or is this someone who has compromised a trusted server and is running ad hoc queries?”

SPLUNK QUERY: (Note the RegEx to parse out the Statement)
EH_DB_TRIGGER | rex field=_raw “Statement=(?<STMT>.[^:]+)\sProcessTime” | table _time ClientIP STMT

Another way to keep track of exactly who is accessing your SQL Servers is to keep an average ProcessTime by ClientIP and ServerIP.

What I look for in the results:
Question1:
“Are all of the IP’s appropriate for SQL Queries?”
Question2: What is 192.168.1.98 doing to 192.168.1.205 that its queries are taking several times longer to process?
Question3: What is 192.168.1.205? Is that a proper database or has someone gone rouge.

SPLUNK QUERY:
EH_DB_TRIGGER | stats avg(ProcessTime) by ClientIP ServerIP

 


So let’s say we suspect 192.168.1.98 of possible malfeasance, I can now query the Extrahop data for every query that client has run for the last 24 hours. What we note from the query below is that this particular IP address has been engaging in some very undesirable behavior and by the time you have finished your tirade of obscenities you can call security had have them delivery him or her a cardboard box and escort them out of the building. Either way, you have adequate digital evidence for both termination or, if needed, prosecution as the log itself is fully intact on the Syslog server.

SPLUNK QUERY:
EH_DB_TRIGGER | search ClientIP=”192.168.1.98″ |rex field=_raw “Statement=(?<STMT>.[^:]+)\sProcessTime” | stats count(STMT) by ClientIP ServerIP STMT

 

Conclusion:
In the past to get this type of data I have had to run the very invasive SQL Profiler. This tool can take up to 20% of your resources and you cannot run it on a long term basis. Using Extrahop’ s wire data you are able to collect all of this information (I have cross referenced it to SQL Profiler and in all cases, the metrics were EXACTLY the same) you can get access to very meaningful SQL data without impacting any systems. As always, this is completely agentless and required no reconfiguration of any SQL Server or any Client accessing the server. If you add 6 web servers to your web farm to accommodate extra front end capacity, you don’t have to worry about installing more agents, if they have an IP Address, you will see the data.

More importantly, while we have tools to detect viruses, malware and spyware it will not defend against a malicious employee or a trusted system that has become compromised. As part of the Human Algorithm, periodically inspecting the behavior of critical systems that house sensitive data is very important and should be a part of the overall INFOSEC strategy. Extrahop has better videos/documentation on monitoring SQL Performance but when used with triggers you can easily compare SQL performance and see changes in performance (better or worse) as they happen in real-time. If you are ever a Sys Admin that has taken a beating because a table needed indexing you know what I mean. Let’s be honest, they go after the systems folks first, network folks second then they look at the software. I am not indicting developers, there just hasn’t been a great deal of visibility until now. At my previous employer, we shared Extrahop metrics with Systems, Network, INFOSEC AND Developers. I think it is better to know that Application slowness is due to a single server in a back end database cluster before you double the amount of RAM, add more spindles/IOPS and upgrade the switch.

Also, please note that while this article uses SQL Server in the examples, Extrahop supports the major DB vendors (DB2, Oracle and MySQL) as well.

Thanks for reading

John

The “Arm-chair Architect”: Healthcare Dot Gov

Making the news the last few weeks has been the problems associated with the Healthcare.gov website. Being interested in the Healthcare Exchanges and seeing what might be available I decided to sign up. While doing so I decided to connect to healthcare.gov from my lab so that I could see the results in Extrahop and try to get some wire data of the experience.

My Experience:
There were definitely some areas that were slower than others but luckily I signed in during the AM on the east coast so I am guessing the site was not extensively busy at the time. Currently I am waiting to hear back on my eligibility and they sent me a document to download but I am currently unable to download it as it times out repeatedly. Outside of that, the initial sign up took about 15 minutes and while there were some slowdowns, it was not so bad compared to other municipal sites (save those I hosted as a Federal Employee at the CDC which snapped smartly to all requests J )

While several pundits and are having a great time making fun of the Federal Government and the issues with the Healthcare.gov site, as a former Federal Employe and Contractor for over ten years I can tell you that I worked 50+ hours a week routinely and while there are well noted inefficiencies in the Federal Government some of the smartest people I ever worked with were feds. I am NOT dancing on the sorrows of anyone and I have NO DOUBT that people are busting their asses to make the end user experience as productive as possible. Regardless of how you feel about ObamaCare, we paid for this site to be up and it needs to work as well as possible, if you are involved with this project, I feel ya.

That said, it’s no secret that I am a big fan of wire data and Extrahop and this article is an attempt to promote it and with that, I will go into detail on how I used Extrahop to gain wire data (no agents installed on my workstation, all data was taken directly from the wire) and I will provide info on what could be done if Extrahop were located behind their firewalls and aggregated.

My lab setup: I have a span set up on a Cisco 3550 switch (it’s a lab but it should work on your SDN, Nexus or 6500) that grabs data from the uplink to my firewall. Extrahop has the ability to handle 20GB/s on an appliance and they can Cluster several of them if you want to aggregate several of them and manage them from a single console. For my test, I launched a published desktop within my Citrix farm and signed up for the Obamacare site from behind the span.

While signing up to look at the Healthcare Exchanges I did two things, first Extrahop grabbed the Layer 7 visibility and provided performance metrics on all non-encrypted URI stems. I also used Extrahop Triggers to send the following Events to a Splunk Syslog Server for parsing and reporting.

  • FLOW_TICK
  • FLOW_TURN
  • HTTP_REQUEST
  • HTTP_RESPONSE
  • SSL_OPEN
  • SSL_RECORD

Findings:

I will try to keep my suggestions and opinions to a minimum but I do have some suggestions that I will include later. For the most part I just want to report on what Extrahop was able to grab from the wire and either report in the Extrahop Console or Report to Splunk.

Data in the Extrahop Console:
From the Extrahop console I drilled into the Citrix server that I signed up for Healthcare.gov on. From there I am given a menu of options

Layer 4 TCP:
When I first start to troubleshoot an issue, after verifying Layer 1 and Layer 2 integrity (in this case, my lab is in a two post rack two feet from me so I can verify that) I dive into Layer 4 where Extrahop really starts to give you a solid holistic view of your environment. There are two views within the Layer 4 environment, the first is the L4 TCP node. The L4 TCP Node provides a quick holistic view of the Layer 4 bandwidth data both open, established and closed sessions as well as reporting on the Aborted sessions. You also see a graphing of the RoundTrip time.

If an Extrahop appliance were on a spanned port inside the healthcare.gov network, similar metrics could be provided for web farms, back end Database connections and SOAP/RESTful api calls. For this test, I only had the ability to grab the wire data from the client perspective. There would be a much larger breadth of data were the Extrahop appliance located on the healthcare.gov side. Also included in the L4 TCP node view are graphs on inbound/outbound congestion and inbound/outbound throttling.

Layer 4 TCP > Details:
The second node is the Layer 4 Details node.  I am a Systems Admin first and a Network Engineer a distant second. While I pride myself on being a generalist, I usually ask for help when looking at Layer 4 details just to make sure I know what I am looking at. I will give you my best effort observation of the L4 TCP > Details node.

Looking below, you see drill-down options on Accepted, Connected, Closed, Expired and Established Sessions. On the In and Out grids I generally look at Resets, Retransmission Timeouts (RTOs), Zero Windows Abords and Dropped Segments. A more experienced Network Engineer may focus on other metrics. Again, if we had an Extrahop Appliance on the inside, we would see the wire data for the actual web server.

As you can see below, when we look at the L4 Detail data we see a much higher number of outbound (From my client to Healthcare.gov) Aborts, Dropped Segments, Resets and Retransmissions. If you had a web farm, you could trigger this data and find a problem node in the group. You can click on any of the linked metrics to drill in to see which hosts are dropping segments, Aborting Connections, etc.

L7 Protocol Node:
The L7 Protocols node provides a holistic view of the protocol utilization during the specified time period as well as the peer devices. From the client perspective, you can see the sites that are providing data either in iframes or the client is sent to directly as a result of healthcare.gov redirecting. You see two charts of incoming and outgoing protocol usage broken down by L7 technology. Also, below you see a list of peer devices, I generally look here as well to see if a CRL or OCSP service is not responding fast enough and delaying my site or if I have an infected iFrame that is sending a user to a rouge site. We will get more into peer performance in the trigger section.

From here you can drill into actual Packets and Throughput per Protocol as well as take an interesting look at Turn Timing (also discussed in the trigger section) where you can see the performance of specific protocols.

Layer 7 Protocols > Turn Timing:
Within the turn timing you can see the Network In (Client to Server) Processing Time (Server Performance or Time waiting to respond) and Network Out (Server Response back to the Client).

If the Appliance were on the inside, this could be very valuable to see if there were back end systems that were not responding. From the Client Perspective, looking at the information below, it appears the servers themselves (processing time) seemed to actually perform relatively well on average (we will get into more detail on the triggers) and we seemed to have issues with the web servers responding back to the client. Keep in mind that this data is JUST from the client to the Healthcare.gov site. It would e considerably more valuable to have information on the performance inside Healthcare.gov.

Layer 7 Protocols > Details:
The details page can also be very valuable, if you are using a specific server for all of your images in your web farm you can take a look at the bandwidth and find out if a specific server has larger images than you have planned on. Also, you can ensure that all of the peer communications are with appropriate IP Addresses. If you are integrating with outside partners, you are only as secure as they are. Sometimes it’s best to periodically verify who your peer nodes are.


DNS: *Disclosure I forgot to use an external DNS Server so in the initial test, my DNS Server was local and therefore did not traverse the span and was not logged in Extrahop. I went back and added an external DNS Server and went back to the site to do some browsing to get these metrics

Few people fully realize the extent to which slow DNS resolution can wreck an application. In the DNS Node you can quickly get a glance at the number of requests and the performance of your DNS Servers as well as drill down into errors and timeouts. If your Web server is consuming a RESTFUL API and the DNS Resolution takes 200ms and the API is called several thousand times a minute, you could see a lot of waiting while using the web app. As previously stated, if I had an Extrahop Appliance inside the healthcare.gov network we could see if the web front end were having trouble resolving any names of the tiered API’s they are consuming.


HTTP:
While the majority of the site is delivered via SSL there are a few actions that are delivered by HTTP, the HTTP Node provides a holistic view of the overall environment. In my case, I was the Client so I would set my Metric Type to Client and look at the data. From there I have drilldowns for Errors, URIs and Referrers, if I were looking at a Healthcare.gov Webserver I would select the Server Metric type and look at the same data. You see below I have Status Codes, methods transaction metrics and transaction rates readily available. If you put the SSL Keys on the Extrahop Appliance (like you would in wireshark) you can also get the Layer 7 performance of Every URI stem that is being delivered via SSL. This could then be used to alert you to slow web applications or downstream API farms where you are consuming web servers from 3rd party partners. I understand that exporting SSL keys is EXTREMELY taboo in the Federal space but I believe you can remove the keys once you have finished troubleshooting.

SSL:
Due to the data being encrypted there isn’t as much SSL Data as there is with other protocols. When you click on the SSL Node you can ensure that web servers have been configured with FIPS Compliant Ciphers and you can double check key lengths by Clicking on the “Certificates” link. From this menu you will see the Session details, if sessions are being aborted, which versions are being used (root out non-compliant SSLv2/v3 Certs). If I had an appliance on the inside, I would look at the “Aborted” Metric within the “Session Details” area.

Extrahop Trigger Data with Splunk Integration:
My favorite feature of Extrahop is it’s trigger function. Extrahop has the ability to fire off a syslog message, custom metrics or even a pcap compliant capture based on a set of criteria you give it. In the case of Healthcare.gov, they could set a trigger that states, only syslog REST transactions that take longer than 200ms or Alert me when a database transaction occurs for specific tables in a database. Because I am looking at healthcare.gov from a client perspective I can only provide triggers on the Client end but if I had an appliance inside, I could see not only the Client interaction with the site but I could trigger on downstream performance of Databases, IBM Queueing, SOAP/REST calls and Slow DNS Lookups.

As I stated previously, I have triggers on the following. In the case of Healthcare.gov I may also look at the performance of the DNS Servers. Currently, I am only reporting on the DNS failures and not the performance. This can be added in less than a minute.

  • FLOW_TICK/FLOW_TURN
  • HTTP_REQUEST/HTTP_RESPONSE
  • SSL_OPEN/SSL_RECORD

Let’s examine a few of these triggers and see what we can glean from the information.

FLOW data:
Within the TCP Flow Triggers I want to look at the FLOW_TURN data as this gives us a good indication of where potential bottlenecks are and how long a client waited for a response from a server. In the FLOW_TURN trigger I am going to grab the following metrics and map them by average to ServerIP. When I make a request there are a number of potential bottlenecks that need to be monitored on the wire. DNS Performance, the client request, the server processing time and then finally the server response. I can wait on any one or all four of them. Within this first Splunk query I am going to look at the client request performance, server processing time and server response. The triggers I use for FLOW_TURN can be found in the Triggers/Downloads section of the blog.

What am I doing in this Query:
The query below uses a Regular expression to convert the ServerIP into “ip”. The “ip” field can be passed to a reverse lookup function within Splunk that will give me the hostname of the ServerIP field. If you are not a big RegEX person there is plenty of RegEx material you can research. I have always found that if you just hit <Shift> and the Number keys enough times, eventually what you are looking for will come up on the screen…J. If you are not interested in learning RegEx then you can simply copy the query below.

Splunk Query:
FLOW_TURN | Search ClientIP=”192.168.1.82″ | rex field=_raw “ServerIP=(?<ip>.[^:]+)\sServerPort” | stats count(_time) as Instances avg(TurnReqXfer) avg(TurnRespXfer) avg(tprocess) by ip | lookup dnsLookup ip

Below you see the results of the query above, I have taken the time to sort by the slowest return time and you see that the server odlinks.govedelivery.com had the slowest average response turn time of more than 3 seconds per response. The saving grace is that there were only three instances of it. Far more concerning is 191ms tprocess metric that occurred 676 times.(Please note that it is hitting the Akamai front end, I AM NOT saying Akamai is the bottleneck but there may be a back end server that is causing the slowdown. Again, if I had an appliance on the inside, I could get this metric for each server in the web farm. That said, the nearly 200ms tprocess time is the time that Extrhop observed before the server sent a response packet. This can give you an indication of how long the server took to respond, either due to DNS Resolution (how many DNS Suffixes are in YOUR IP Configuration!?) or just time processing the information.

Once you have the data in Splunk it becomes like “Baseball Stats” where you can get the average Response Transfer time of west coast customers between midnight and six AM on weekdays. The amount of stats you can query is dizzying. From the data below, I can see the average performance by Server.

One other point I want to make about the FLOW_TURN trigger is that it is very valuable in SSL Environments where you cannot get performance metrics because the data is encrypted. While I do not have URI Stems and SOAP/REST calls I do have the basic Layer 4 performance data which can be very valuable in instances where using the Private Key on the appliance is not possible.

*Please note that the data below is what was collected while I was signing up on healthcare.gov. While I was not knowingly doing anything outside of that, some connections may have been made to other sites that are not affiliated with healthcare.gov and may show up in the stats below. There is NO HIDING from wire data and without knowing the application I am not sure who to exclude. Please make a note of it.

HTTP_REQUEST/HTTP_RESPONSE:
Due to the site conducting all but the a few packets in SSL the HTTP data is actually quite lite. I did want to point out what you can get from the HTTP data and show how you can correlate it with a big data back end like Splunk. You can essentially trigger any HTTP Header value as well as the performance of web applications using the HTTP_RESPONSE trigger. Below you are seeing the performance of URI stems for the healthcare.gov site as well as the performance of the CRLs. In instances where environments are “Locked Down” the lack of access to CRL’s and OCSP can have a negative impact on Web Applications. Here we note that there are no performance issues with CRLs or OCSP sites. The only Healthcare.gov URI that we see is the initial site. Everything thereafter was encrypted
Note: With the keys installed on the Extrahop Appliance, you can see the performance of each URI stem and quickly identify web services that are not performing properly.
The Splunk Query:
Note I am removing some “noise” from the results. I had bing as my search and I had to go to gmail to verify my login.

HTTP_* |search ClientIP=”192.168.1.82″ Host!=”mail.google.com” Host!=”gmail.com” Host!=”api.bing.com” | table _time eh_event ServerIP Host HTTP_uri tprocess


SSL_OPEN/SSL_RECORD:
Most of the relevant data was from the SSL_OPEN trigger, the only unique item I was triggering from SSL_RECORD was Key Size, I am not sure you can even get a 1024 bit Key anymore, but all keys were 2048 bit so I will not include SSL_RECORD in this article.

While it does not say much in terms of performance, it is sometimes nice to just make sure those Certificates that people are using are what you expect, especially when you are working with 3rd parties and partners. This will allow you to ensure that all SOAP/RESTful web services are meeting the FIPS encryption standards.

The Splunk Query: Note we are, once again, using REGEX to parse out the “ip” so that we can perform a reverse lookup.
eh_event=”SSL_OPEN” | rex field=_raw “ServerIP=(?<ip>.[^:]+)\sServerPort”| eval SSL_EXP=strftime(SSL_EXP,”%+”) | stats count(_time) as Instances by ip SSL_VERSION SSL_CIPHER SSL_SUBJECT SSL_EXP | lookup dnsLookup ip

Conclusion:
I am certain that there is no end of “Arm Chair” architects offering DHHS advice. Like I said, I am NOT dancing on anyone’s sorrows here. As a former DHHS (CDC) employee myself I know that they are working around the clock to fix any issues users are having. I feel like the first step in that process is to start gathering Operation Intelligence and Extrahop can do that without any impact on the existing Server architecture. As I stated throughout the post, the data I have is Client side and the data they could collect inside the healthcare.gov network would be orders-of-magnitude more valuable.

There are a few VERY important things to note for Feds/Govies (or anyone else) who want to leverage Extrahop’s Wire Data

  • It does NOT require any agents, there will be minimal, if any, changes to the incumbent C & A framework and you should only have to get the Appliance approved. This means that you can call them today, get the appliance rack mounted and tie it into your span and start looking at data without doing ANY configuration changes to the servers.
  • They have a free Discovery Edition that you can use to perform your own Proof of Concept that is a VM.
  • As I stated previously, they can handle up to 20GB/s of data per appliance and they can be clustered so that they are centrally managed as well as aggregated.
  • It will integrate with your existing Splunk environment or any other Syslog server that you have in place. I have used it with both KIWI and Splunk.
  • It augments existing INFOSEC strategies by allowing real-time access to wire data to find Malware, DNS Cache Poisoning (Pharming) and Session hijacking within seconds.

You can check out the Discovery Edition here at: (Please tell them John@wiredata.net sent you J )
http://www.extrahop.com/discovery/
Thanks for Reading

John

The Human Algorithm

Ever feel like this guy? (Above), today breaches are becoming far more complicated than just a virus, they involve practices like social engineering and phishing as well as the old, tried and true injection based hacks like Cross Site Scripting and SQL Injection. As these breaches become more sophisticated the tools we use to combat them need to evolve. Additionally, we need to come to grips with the idea that we are going to stop every single breach. Malware, viruses, social engineering and inadvertent clicks on malicious links are becoming like the common cold, we need to approach breaches with the idea of when and not if. Because I have a local police/fire department does not mean that I don’t lock my doors, look through window to see who it is before I open the door or install smoke alarms. Sadly, while INFOSEC as a discipline has evolved over the last 15 years I fear it has made those of us who are responsible for hosting systems a bit lazy and dependent on the existing INFOSEC apparatus and we have stopped thinking about our own Security. To expect the INFOSEC group in your organization to take responsibility for your system’s security is not entirely different than me going to the mayor in my city and asking that a policeman and fireman be present in my house at all times, just not very practical. For that reason, I need to take some responsibility in securing my home and property and use police and fire for emergencies. In 2013, Schnucks had customer credit card numbers and expiration dates siphoned out of their network for a period of nearly four months according to Network World. Now, I don’t have any inside knowledge of the Schnucks breach and, as with most breaches, they are very tight lipped about how they were compromised but if it were a case of APT or Malware the data would have to get FROM their systems TO someone else’s system.

Using Extrahop you can grab data from the wire and report on Egress traffic and identify anomalies or things that just don’t belong. But to do this there needs to be a paradigm shift in how INFOSEC operates, instead of just focusing on preventing breaches, how about focusing on quickly determining if you are breached? It is estimated that between 30 and 70 percent of malware goes initially undetected by anti-virus software. Like the poor guy in the graphic above, Anti-virus companies do a lot of whack-a-mole and in the absence of a crystal ball; I am not sure what else they are supposed to do. If we are going to Depend solely on shrink-wrapped products to protect our digital intellectual property we might as well also call Miss Cleo’s Psychic hotline once a day to find out if we are breached. At the end of the day, we need resurgence in the human being, in most cases; the best candidate for this is..well..you. You should have a full understanding of which systems have PII, Financial or PHI on them and can more easily identify pattern behavior that fits and does not fit. Example, if you see packets from your SQL server housing patient data making connections to Belarus that MIGHT be something out of the ordinary. The simple act of having a report sent to you once or twice a day, or providing parsed real-time data to your operations staff and teaching them how to interpret it (the way insurance companies have been training fraud investigators for 20 years) can help folks spot things that don’t make sense or don’t look right.

Anyone who watched Sesame Street as a kid understands how to spot an anomaly.

Some of these packets is doin’ their own thing!
For the Gen-Xers who remember Sesame Street, once you have data and are able to collate it and logically represent it, finding problems, breaches, malware becomes much easier. If you look below, you see that three kids are playing baseball and one kid is playing football. If you look to the right, you can see that we have a number of packets going to Beijing. My lab is located in the US and there is no reason anyone should be visiting Beijing websites (unless you are writing a blog and need some data). I am not trying to insult anyone’s intelligence here. I am simply pointing out how easy this is to do with Extrahop and Splunk. I did not change a single configuration on any servers to increase the apache log level, I did not have to put my ASA in debug mode and I did not have to install any agents. Extrahop grabbed all of this right off the wire and handed it off to Splunk for parsing and geocoding.

 

=

In today’s post I want to talk about four Extrahop triggers than can help you take part in being your own blue team and taking responsibility for the security of your servers.

FLOW_TICK
SSL_OPEN
HTTP_REQUEST
HTTP_RESPONSE

FLOW_TICK: Incoming Data
First let’s look at our incoming data, using the FLOW_TICK feature of Extrahop I can see in coming sessions by external client, internal server and port. Then integrating it with Splunk allows you to take it even further by geo coding the IP Information and performing reverse lookups allowing you to see where they are coming from and what their DNS names look like.

What I look for/observe from this data:
I have always felt that if the IP Address did not have a DNS record then there are a couple of possibilities. They are either up to no good OR their ISP does not properly set up their DNS which makes me wonder if they are paying attention to what their subscribers are doing. At any rate, no reverse lookup is always a red flag for me.

Next I look at connections from the Russian Federation and China as those likely stuck out to you as well. This is a home lab and those addresses are obviously performing recon as there is utterly no reason whatsoever anyone in China, Japan or Russia would have any desire to look at my home lab.

The last problematic entry is obviously the Shodan queries, InfoSec practitioners are probably chuckling at seeing it but the singlehop2.shodanhq.com (Shodan) is basically an intelligence gathering site for posting your open ports on a google-like website where hackers can go in and check for open ports. You do not want your systems on Shodan. You can email them and ask them to stop performing recon on your IPs. Won’t be the first time I have done it.

Splunk Query:
sourcetype=”Syslog” FLOW_TICK | rex field=_raw “ClientIP=(?<ip>.[^:]+)\sServerIP” | geoip ip | stats count(_time) as Instances by ServerIP ip ServerPort ip_city ip_region_name ip_country_name | lookup dnsLookup ip | Search ip_city!=””

A Quick Walk-thru of monitoring ClientIP with Extrahop:

What the data looks like in Splunk:
BYOBT

 

FLOW_TICK: Outgoing Data:

Perhaps even more important in the case of Shnucks is the monitoring of EGRESS which Extrahop can do at a gigabit(S) per second rate. This provides me visibility into where traffic is going. There are numerous free csv files with blacklists of malware sites and the existing security Practitioners within your company likely know where to find this and can set up a lookup table for you in Splunk if one does not already exist. Basically, Extrahop, using the same FLOW_TICK trigger previously mentioned, can log the outgoing traffic giving me the ClientIP, Server IP, Port and Splunk provides geographic information (City, State/Region, Country) and using the same reverse lookup, provide me the DNS host name as well.

What I look/observe for from this data:
First, you want to make sure you understand EVERY SINGLE SYSTEM that has critical, private, financial or any type of digital intellectual property. From here, you want to note the Egress patterns of those systems and see if they are making any external connections. First thing I would ask myself is, outside of patches and updates, why the hell any system I have that has sensitive data would EVER make an outgoing connection. By outgoing I mean outside of the local intranet however that does not mean that you should not look for someone copying data internally THEN taking it out on their laptop. The second concern that you want to look for is if someone has fallen victim to a phishing scam and click a link that goes outside your organization to a country known for state sponsored Cybercrime or a site that just doesn’t look right.

Below you see a series of connections to Google, here is where you may see an abundance of data under the “Instances” column that does not make sense. If you see high number of packets to a host/IP that doesn’t have anything to do with your business, then it you need to run it down and find out what is being sent to, or downloaded from, them.

Splunk Query:
sourcetype=”Syslog” FLOW_TICK | rex field=_raw “ServerIP=(?<ip>.[^:]+)\sServerPort” | geoip ip | stats count(_time) as Instances by ClientIP ip ServerPort ip_city ip_region_name ip_country_name | lookup dnsLookup ip | Search ip_city!=””

A Quick Walk-thru of monitoring ServerIP with Splunk/Extrahop:

What the data looks like in Splunk:

SSL_OPEN:
Using the Extrahop triggers you can get a real-time view of all SSL Connections made on within your network and outside of your network. Below you see a list of servers that are access from inside the LAN. You see the SSL Version (It is a bit Cryptic but 769 is TLS 1.0, 770 is TLS 1.1 and 771 is TLS 1.2. As you can see in the screenshot we have the server IP, the SSL Version and we can also see the cipher.

What to look/observe for in this data:

This will help in identifying sites with weak ciphers as well as help with enforcing cipher standards internally. You can query for KEYSIZE as well (It’s in the trigger) but I did not include that as I don’t think you can even request a 1024 bit Certificate anymore can you? Also visible below is the expiration date of the SSL Certs alerting you if someone is using an outdate certificate or if your own certificates are about to expire. As a lot of you are aware, most folks click right through the Cert warnings and go straight to the site.

Splunk Query:
eh_event=”SSL_OPEN” | eval SSL_EXP=strftime(SSL_EXP,”%+”) | table ServerIP SSL_VERSION SSL_CIPHER SSL_SUBJECT SSL_EXP

HTTP_REQUEST/RESPOSE:
I will be the first to admit I am not in the INFOSEC-Proper mold. I worked for a year doing event correlation in the early 2000’s but as far as noticing what an HTML Injection looks like I am really not that guy. BUT, what I can do is get an idea of what my HTTP traffic should look like and I can tell if someone is injecting new header values or issuing an unauthorized redirect.

What to look/observe for in the data:
The Extrahop HTTP Triggers can give you a broad level of visibility into your HTTP traffic alerting you to potentially compromised websites or even catching malware that is trying to sneak out over HTTP. I can also see, in real time, if a cookie is assigned to more than one IP Address and note if I have a session hijacking issue going on. I look at URI stems and within Splunk you could likely match them up with known malicious code. I look for a high amount of traffic on odd ports (trying to sneak out). I look for User-Agents like “Python” and other values that could indicate someone using Metasploit or some other hacking tool as most of your User-Agents should be of the “Mozilla” variety. As we have done previously, I might geocode the data to see where the users are actually going or I might also include the HOST header value. I highly recommend reading the Trigger API documentation for Extrahop because they trigger on a lot more than I can speak to with any level of expertise. If you are an INFOSEC practitioner and can comment on what to look for, please do.

Splunk Query: (the table was too big to include the GEO data)
HTTP_REQ | geoip ServerIP | table ClientIP ServerIP ServerPort Payload HTTP_uri HTTP_query User_Agent CookieID ServerIP_country_name | Search ServerIP_country_name!=””

Below is some of the traffic I generated to China (purposely)


HTTP_REQUEST : Cookie Watching
Another nice tool is to run a query that checks the total number of IP Addresses using a CookieID. If you EVER see more than one IP using the same Cookie ID you need to alert your INFOSEC team post haste! This is a very strong indication that someone is hijacking sessions.

Catching Hijacked PHP Sessions in seconds with Extrahop/Splunk:

What I look/observer for in the data:
As I stated, you want to check and make sure no cookies are in use by more than one IP Address. Remember, this is real time, you don’t have to back through Apache/IIS logs to look at it after the fact, this will be evident in real-time. If you have an Operations Center, this is one of the items I would alert on or have readily available to watch.

Splunk Query:
HTTP_REQUEST | stats distinct_count(ClientIP) by CookieID

Conclusion:
As I stated, I did not have to make any configuration changes to any systems (outside of a port span on the switch) and I did not have to install any agents to get this data in real time. In light of some of the breaches I have looked at for this year one common theme has been that there didn’t seem to be anyone watching the door. No matter what malfeasance is going on, to steal information it is going to HAVE to come across the wire (outside of blatant hardware theft). If it happens on the wire, Extrahop positions you to see it within seconds of it happening and take immediate steps to mitigate within minutes instead of days, weeks and in some cases, months. For me, there are two types of systems, compromised, and “about-to-get” compromised. I am not saying that we should abandoned existing preventative measures and policies as we are bound by the regulatory framework for our existing verticals but I think the time has come for hosting operations to start to take some role in their own security. We are already seeing under-writers balk at paying for breaches due to what they perceive as inadequate steps to protect.

When I purchased my home owners insurance in Florida (any Floridians know we are the “whipping post” of the home owners insurance industry) they actually denied my policy because I did not have an arm-rail on my back steps. After complaining profusely I agreed to have an arm-rail installed (the great irony being, while moving in I fell ass-over-tea kettle due to the lack of the very arm-rail my agent was requiring). The point is, as insurance carriers and umbrella policy writers become more technically savvy, just like my house and just as the banks have had to do more fraud investigation with Credit Cards, expect them to start demanding more proactive approach. Algorithm-based information security is just a part of keeping your intellectual property secure and likely won’t be enough for regulators and umbrella policy writers in the future.

In the case of Schnucks, Liberty Mutual (Schnuck’s Umbrella Policy Writer) is already balking at paying for the breach stating that it is not “property damage”. Sony is in a similar predicament with Zurich over the PSP network breach. In the end corporations are going to want breaches covered under their umbrella policies or there will be supplemental policies for data breaches and it is going to take the insurance industry about 6 seconds to start requiring non-algorithm based security and more proactive approach. As the threat landscape changes the tools and methods you use need to change but expect the regulatory framework to change as well. Having logs that you consult days/weeks after a breach is not proactive enough to protect intellectual property, you need to be able to provide a precise narrative to person who can interpret it. You cannot get more proactive than grabbing the data right off the wire and Extrahop appliances can support up to 20 Gbps and can be clustered to support even more.

My next post will also be INFOSEC based covering DNS, Databases and CIFS.

Thanks for reading, I hope you enjoy the new site.

John M. Smith

 To read about Extrahop’s Security position check out:
http://www.extrahop.com/solutions/by-project/security-audit-compliance/

To download a free Discovery Edition of Extrahop  follow the link below:
http://www.extrahop.com/discovery/

If you have ANY questions about how to set this up, don’t hesitate to reach out to me at jmsazboy@wiredata.net

 

 

Let the finger pointing BEGIN! (..and end) Canary Herding With Extrahop FLOW_TURN

In IT, dependable metrics become our Canary in a coal mine. We use them as indicators of issues. Like a dead canary in a coal mine, they don’t know exactly how much they have been exposed or exactly how bad it is but they know they need to get the hell out of there. In the world of Operational intelligence, we can use metrics as indicators of which parts of the proverbial shaft are having issues and need to be adjusted, sealed off or abandoned altogether. To continue in the same vein as my previous post I wanted to discuss the benefits of the FLOW_TURN trigger when trying to get a baseline performance of specific servers and transactions and you don’t want to drill into layer 7 data as much as you just want to check the layer 4 performance between two hosts Extrahop has the FLOW_TURN trigger that will allow you to take the next step in layer 4 flow metrics by looking at the following:

Request Transfer: Time it took for the client to make the request
Response Transfer: Time it took for the server to respond
Request Bytes: Size of the Request
Response Bytes: Size of the Response
Transaction Process Time: The time it took for the transaction to complete. You may have a fast network with acceptable request and response times but you may note serious tprocess times which could indicate the kind of server delay we discussed in some of the Edgesight posts.

In today’s Virtualized environment you may see things like:

  • A Four Port NIC with a 4x1GB port channel plugged into a 133mhz bus
  • 20 or more VMs sharing a 1GB Port Channel
  • Backups and volume mirror going on over the production network.

These are things that may manifest themselves as slowness of either the application or slow response from your Clients or servers. What the FLOW_TURN metric gives you is the ability to see the basic transport speeds of the Client and Server as well as the process time of the transaction. Setting up a trigger to allow you to harvest this data will lay the foundation for quality historical data on the baseline performance of specific servers during specific times of the day. The trigger itself is a few lines of code.

log(“ProcTime ” + Turn.tprocess)
RemoteSyslog.info(
” eh_event=FLOW_TURN” +
” ClientIP=”+Flow.client.ipaddr+
” ServerIP=”+Flow.server.ipaddr+
” ServerPort=”+Flow.server.port+
” ServerName=”+Flow.server.device.dnsNames[0]+
” TurnReqXfer=”+Turn.reqXfer+
” TurnRespXfer=”+Turn.rspXfer+
” tprocess=”+Turn.tprocess

)

Then you assign the trigger to specific servers that you want to monitor (If you are using the Developer Edition of Extrahop in a home lab just assign to all) then you will start collecting metrics. In my case I am using Splunk to collect Extrahop Metrics as they are the standard for big data archiving and fast queries. Below you see the results of the following Query:
sourcetype=”Syslog” FLOW_TURN | stats count(_time) as Total_Sessions avg(tprocess) avg(TurnReqXfer) avg(TurnRespXfer) by ClientIP ServerIP ServerPort

This will produce a grid view like the one below:
Note in this grid below you see the client/server and port as well as the total sessions. With that you then see the Transfer metrics for both the Client and Server as well as the process time. The important things to note here:

  • If you have a really long avg(tprocess) time, double check the number of sessions. A single instance of an avg(tprocess) of 30000ms is not as big of a deal as 60,000 instances of an 800ms avg(tprocess). Also keep in mind that Database servers that may be performing data warehousing may have high avg(tprocess) metrics because they are building reports.
  • Note the ClientIP Subnets as you may have an issue with an MDF where clients from a specific floor or across a frame relay connection are experiencing high avg(TurnReqXfer) numbers.

If you want to see the average request transfer time by Subnet use the following Query: (I only have one subnet in my lab so I only had one result)

sourcetype=”Syslog” FLOW_TURN | rex field=_raw “ClientIP=(?<subnet>d+.d+.d+.)” | stats avg(TurnReqXfer) by subnet

If you want to track a servers transaction process time you would use the query below:

sourcetype=”Syslog”
FLOW_TURN ServerIP=”192.168.1.61″ | timechart avg(tprocess) span=1m

Note in the graph below you can see the transaction process time for the server 192.168.1.61 throughout the day. This can give you a baseline so that you know when you are out of what (or when the Canary has died)

Conclusion:
While I am not trying to take what we do for a living and say that it is as simple as swinging a hammer in a coal mine but for the longest time, this type of wire data has not been readily accessible unless you had a “Tools team” working full time on a seven figure investment in a mega APM Product.  This took me less than 15 minutes to set up and I was able to quickly get a holistic view of the performance of my servers as well as start to build baselines so that I know when the servers are out of the norm. I have had my fill of APM products that I need an entourage to deploy or have a dozen drill downs to answer a simple question, is my server out of whack?

In the absence of data, people fill those gaps with whatever they want and they will take creative license to speculate. The systems team will blame the code and the network, the Network team will blame the server and the code the developers will blame the Systems admins and Network team. With this simple canary herding tool, I can now fill that gap with actual data.

If the Client or Server transfer times are slow we can ask the Network team to look into it, if the tprocess time is slow it could be a SQL table indexing issue or a server resource issue. If nothing else, you have initial metrics to start with and a way to monitor if they go over a certain threshold. When integrated with a big-data platform like Splunk, you have long term baseline data to reference.

A lot of time there is no question the canary has died, it’s just getting down to which canary died.

Extrahop now as a Discovery Edition that you can download and test for free (Including FLOW_TICK and FLOW_TURN triggers).

http://www.extrahop.com/discovery/

Thanks for reading!!!

John M. Smith


Go with the Flow! Extrahop’s FLOW_TICK feature

I was test driving the new 3.10 firmware of ExtraHop and I noticed a new feature that I had not seen before (it may have been there in 3.9 and I just missed it). There is a new trigger called FLOW_TICK, that basically monitors connectivity between two devices at layer 4 allowing you to see the response times between two devices regardless of L7 Protocol. This can be very valuable if you just want to see if there is a network related issue in the communication between two nodes. Say, you have an HL7 interface or a SQL Server that an application connects to. You are now able to capture flows between those two devices or even look at the Round Trip time of tiered applications from the client, to the web farm to the back end database. When you integrate it with Splunk you get an excellent table or chart of the conversation between the nodes.

The Trigger:
The first step is to set up a triggler and select the “FLOW_TICK” event.

Then click on the Editor and enter in the following Text: (You can copy/Paste the text and it should appear as the graphic below)

log(“RTT ” + Flow.roundTripTime)
RemoteSyslog.info(
” eh_event=FLOW_TICK” +
” ClientIP=”+Flow.client.ipaddr+
” ServerIP=”+Flow.server.ipaddr+
” ServerPort=”+Flow.server.port+
” ServerName=”+Flow.server.device.dnsNames[0]+
” RTT=”+Flow.roundTripTime
)

Integration with Splunk:
So if you have your integration with Splunk set up, you can start consulting your Splunk interface to see the performance of your layer 4 conversations using the following Text:
sourcetype=”Syslog” FLOW_TICK | stats count(_time) as TotalSessions avg(RTT) by ClientIP ServerIP ServerPort

This should give you a table that looks like this: (Note you have the Client/Server the Port and the total number of sessions as well as the Round Trip Time)

If you want to narrow your search down you can simply put a filter into the first part of your Splunk Query: (Example, if I wanted to just look at SQL Traffic I would type the following Query)
sourcetype=”Syslog” FLOW_TICK 1433
| stats count(_time) as TotalSessions avg(RTT) by ClientIP ServerIP ServerPort

By adding the 1433 (or whatever port you want to filter on) you can restrict to just that port. You can also enter in the IP Address you wish to filter on as well.

INFOSEC Advantage:
Perhaps an even better function of the FLOW_TICK event is the ability to monitor egress points within your network. One of my soapbox issues in INFOSEC is the fact that practitioners beat their chests about what incoming packets they block but until recently, the few that got in could take whatever the hell they wanted and leave unmolested. Even a mall security guard knows that nothing is actually stolen until it leaves the building. If a system is infected with Malware you have the ability, when you integrate it with Splunk and the Google Maps add-on, to see outgoing connections over odd ports. If you see a client on your server segment (not workstation segment) making a 6000 connections to a server in China over port 8016 maybe that is, maybe, something you should look into.

When you integrate with the Splunk Google Maps add-on you can use the following search:
sourcetype=”Syslog” FLOW_TICK | rex field=_raw “ServerIP=(?<IP>.[^:]+)sServerPort” | rex field=_raw “ServerIP=(?<NetID>bd{1,3}.d{1,3}.d{1,3})” |geoip IP | stats avg(RTT) by ClientIP IP ServerPort IP_city IP_region_name IP_country_name

This will yield the following table: (Note that you can see a number of connections leaving the network to make connections in China and New Zealand, the Chinese connections I made on purpose for this lab and the New Zealand connections are NTP connections embedded into XenServer)

If you suspected you were infected with Malware and you wanted to see which subnets were infected you would use the following Splunk Query:
sourcetype=”Syslog” FLOW_TICK
%MalwareDestinationAddress%
| rex field=_raw “ServerIP=(?<IP>.[^:]+)sServerPort” | rex field=_raw “ClientIP=(?<NetID>bd{1,3}.d{1,3}.d{1,3})” | geoip IP | stats count(_time) by NetID

Geospatial representation:
Even better, if you want to do some big-time geospatial analysis with Extrahop and Splunk you can actually use the Google Maps application you can enter the following query into Splunk:
sourcetype=”Syslog” FLOW_TICK | rex field=_raw “ServerIP=(?<IP>.[^:]+)sServerPort” | rex field=_raw “ClientIP=(?<NetID>bd{1,3}.d{1,3}.d{1,3})” |geoip IP | stats avg(RTT) by ClientIP NetID IP ServerPort IP_city IP_region_name IP_country_name | geoip IP

Conclusion:
I apologize for the RegEx on the ServerIP field, for some reason I wasn’t getting consistent results with my data. You should be able to geocode the ServerIP field without any issues. As you can see, the FLOW_TICK gives you the ability to monitor the layer 4 communications between any two hosts and when you integrate it with Splunk you get some outstanding reporting. You could actually look at the average Round Trip Time to a specific SQL Server or Web Server by Subnet. This could quickly allow you to diagnose issues in the MDF or if you have a problem on the actual server. From an INFOSEC standpoint, this is fantastic, your INFOSEC team would love to get this kind of data on a daily basis. Previously, I used to use a custom Edgesight Query to deliver a report to me that I would look over every morning to see if anything looked inconsistent. If you see an IP making a 3389 connection to an IP on FIOS or COMCAST than you know they are RDPing home. More importantly, the idea that an INFOSEC team is going to be able to be responsible for everyone’s security is absurd. We, as SyS Admins and Shared Services folks need to take responsibility for our own security. Periodically validating EGRESS is a great way to find out quickly if Malware is running amok on your network.

Thanks for reading

John M. Smith