So I have been working on a new Citrix bundle for ExtraHop customers and potential customers for a few weeks now. ExtraHop complements and expands on HDX insight by furthering your visibility into your Citrix infrastructure allowing you to get, not only ICA metrics on such things as ICA Channel impact, Latency and Launch times, we also provide you with the ability to break the information out by Subnet, Campus (friendly name), custom dashboards that empower lower paid staff as well as keep Director/VP types informed. In this post I want to discuss the Citrix Alpha Bundle and go over what it has so far and what can be included in it.
How ExtraHop works:
For those who do not know, ExtraHop is a completely agentless wire data analytics platform that provides L7 operational intelligence by passively observing L7 conversations. We do not poll SNMP or WMI libraries, ExtraHop works from a SPAN port and observes traffic and provides information based on what is observed at up to 20GB per second. In the case of Citrix, we do considerably more than observe the ICA Channel and I will show you in this post just how much incredibly valuable information is available on the wire that is relevant to Citrix engineers and architects.
How are we gathering custom Metrics?
The way we gather these metrics is to pull the metrics from the spanned port through a process called triggers. ExtraHop allows us to logically create device groups and collate your VDA (ICA listeners for either XenDesktop or XenApp) and apply triggers specifically to those device groups. Device groups can be created using a variety of methods including IP Address, Naming convention, type (ICA,DB,HTTP,FTP,etc) as well as static.
Now, to earmark specific metrics to JUST your Citrix environment you apply the necessary triggers to the device group.
(All triggers are pre-written and you won’t have to write anything, worst case, you may have to edit some of them).
The triggers we have written create what is called an “App Container”. Below is a screen shot of the types of metrics that we are gathering in our Triggers. While it does not even remotely cover all that we can do I will explain a few of the metrics for you.
Citrix Infrastructure Page
Infrastructure Metrics: (all metrics drill down)
- Slow Citrix Launches: We have set in the trigger to classify any Citrix Launch time in excess of 30 seconds to increment the Slow Launch counter. This is something you have the ability to change within the trigger itself.
- CIFS Errors: We log the CIFS errors so that you can see things like “Access Denied” or “Path Not Found”. Anyone who has had a DFS share removed and the ensuing 90 logon while Windows pines for the missing drive letter knows what I am talking about.
- Fast Citrix Launches: A better name is probably normal Citrix Launches, for the existing trigger set this would be anything under 30 seconds. Your thresholds can be customized.
- DDC1/DDC2 Registrations: The sample data did not have any XenDesktop but this was a custom XML trigger written to count the number of times a VDA registers with a DDC so that you can see the distribution of your VDA’s across your XenDesktop Infrastructure.
- DNS Response Errors: Self-explanatory, be aware if you have not looked at you DNS before. Because DNS failures happen on the wire, this is a huge blind spot for agent based solutions. I was shocked the first time I actually saw my DNS failure rate.
- DNS Timeouts: Even more damning than response errors, these are flat out failures. These, like response errors, can indicate Active Directory issues due to an overworked DC/GC or misconfigured Sites and Services
- Citrix I/O Stalls: While we do not measure CPU/Disk/Memory, we can see I/O related issues via the zero windows. When a system is experiencing I/O binding, it will close the TCP window. When this happens we will see it and it is an indication of an I/O related issue.
- Server I/O Stalls: This is basically the opposite of a Citrix I/O stall, if a back end Database server is acting up and someone is on the phone and they say “Citrix”, we all saw the Citrix “Get out of jail free” card…the call will be sent to the Citrix team. This provides the Citrix team the ability to see that the back end server is having I/O related issues and not waste their time doing someone else’s troubleshooting which, in my 16 years of supporting Citrix, was about 70% of the time.
Launch Metrics: (Chart includes drill down)
When you click on the chart under the Launch Metrics chart you will be given a list of Launch/Login metrics based on the following metrics below.
- Launch by Subnet: We collect this to see if there is an issue with a specific subnet
- Launch by App: As some applications may have wrappers that launch or have parameters that make external connections we provide launch times by applications.
- Launch by Server: We provide this metric so that you can easily see if login issues are specific to a particular server.
- Launch by User: This will let you validate a specific user having issues or you can note things like “hey, all these users belong to the accounting OU” maybe there is an issue with a drive letter or login script.
- Login by User: This is the login info which is how fast A/D logged them in.
Login vs. Launch:
At a previous employer, what we noted was that if we had a really long Load time accompanied by a really long login time we needed to look at the A/D infrastructure (Sites and Services, Domain Controllers) and a long Load time accompanied with a short login time would indicate issues with long profile loads, etc. The idea is that the login time should be around 90% of the load time meaning post-login, not much goes on.
XML Broker Performance:
We are one of the few platforms that can provide visibility into the performance of the XML broker. While not part of the ICA Channel it is an important part of your overall Citrix infrastructure. Slow XML brokering can cause slow launches, the lack of applications being painted, etc. We can also provide reporting on STA errors that we see as we have visibility into XML traffic between the Netscaler/Web Interface and the Secure Ticket Authority (STA).
If you have applications that are not run directly on your Citrix servers and you are not using host files, DNS performance is extremely important. The drill down into the DNS Performance chart will provide performance by client and by server. If you see a specific DNS Server that is having issues you may be able to escalate it to the A/D and DNS teams.
Citrix Latency Metrics Page
- Latency by Subnet: Many network engineers will geo-spatially organize their campus/enterprise/WAN by network ID. One of the Operational Intelligence benefits of ExtraHop is that we can use triggers to logically organize the performance by subnet allowing a Citrix team to break out the performance by Subnet. If given a list of friendly names, we can actually provide a mapping of location-to-NETID. Example: 192.168.252.0 is your 3rd floor on your main campus. We can provide the actual friendly name if you want. This can be very useful in quickly identifying problem areas, especially for larger Citrix environments.
- High Latency Users: For this metric, any user who crosses the 300ms threshold is placed into the High Latency area. The idea is that this chart should be sparse but if you see a large amount of data here you may find the need to investigate further. Also, double check the users you note here with the chart below that will include the overall user latency. You may have an instance where a user’s latency was high due to them wandering too far away from an access point or overall network issues but find that when you look at the Latency by UserID chart that they overall latency was acceptable.
- Latency by UserID: This metric is to measure the latency by user ID regardless of how good/bad it was.
- Latency by Client IP: This is the latency by individual client IP. I think that I may change this to include the latency by VDA (XenDesktop or XenApp). This can be valuable to know if a specific set of VDA listeners are having issues.
Below is the drill down for the Latency by Subnet chart. This will allow you to see if you have an issue with a specific subnet within your organization. Example: You get a rash of calls about type-ahead delays, the helpdesk/first responder does not put together that they are all from the same topological area. The information below will allow the Citrix engineer to quickly diagnose the issue if the problem is a faulty switch in an MDF or an issue with a LEC over an MPLS cloud. Below we have set the netmask to /24 but that can be changed to accommodate however you have subnetted your environment.
Citrix PVS Performance Page
I haven’t had a great deal of PVS experience outside of what I have set up in my lab. In the last few years I had sort of morphed into a DevOps/Netscaler/INFOSEC role with the groups I was in. That said, because we are on the wire we are able to see the turn timing for your PVS traffic. I won’t go into the same detail here as with the previous two pages but what you are looking at is a heat map of your PVS turn times. In general I have noticed, other than when things are not working well, that the turn timing should be in the single digits. I will practice breaking my PVS environment to see what else I can look at. I have tested this with a few customers but their PVS environments were working fine and no matter how many times I ask them to break them they just don’t seem compelled to. I have also included Client/Server request transfer time as well as size to allow the Citrix team to check for anomalies.
Detecting Blue Screening Images:
One thing I have come across while being on a team that used PVS was that occasionally something would go wrong and systems would blue screen. Most reboots happen overnight and so it can be somewhat difficult to get into work every day and know right away which servers did not come up the previous night without some sort of manual task. Below is the use of a DHCP trigger that counts the number of requests. In the Sesame Street spirit, when you look below you can sort of see that “one of these kids is doin’ his own thing….”. Note that most of the PVS driven systems have a couple of DHCP requests and 192.168.1.156 has 30. Why? Because I created a mac address record on the PVS Server and PXE booted XenServer image from my VMWare Workstation which produced a Blue Screen.
In those environments with hundreds or even thousands of servers, the ability to see blue screening systems (or systems that are perpetually rebooting) can be very valuable. The information below is from our new Universal Payload Analysis event that the TME team wrote for us to gather DHCP statistics.
Other things we can add
While my lab is pretty small and I don’t have any apps like PeopleSoft, Oracle Financials or basic Client Server applications. ExtraHop has the ability to map out your HTTP/Database/Tiered applications for you and make sure that you see the performance of all of your enterprise applications as they pertain to Citrix. By adding the HTTP request/response events we can see ALL URI’s and their performance as well as any 500 series errors. We can see slow stored procedures for database calls that are made from the Citrix servers. You can also classify SOAP/REST based calls by placing those applications in their own App Container and position your team to report on the performance of downstream applications that can be a sore spot for Citrix teams when they are held accountable for the performance because the front end was published on Citrix.
Empowering First Responders
When you have a small lab of 3-4 VDA’s and a limited amount of demo data it is a little tough to get too detailed here but I wanted to show ways that we can empower some of your first responders. One of the challenges with Citrix support is that it can get very expensive really fast. Normally, calls may go from a helpdesk to a level 2 and if it is still an issue, to a level 3 engineer. With Citrix, calls have a habit of going from the helpdesk directly to the Citrix engineer and this can make supporting it very expensive. If we can position first responders to be able to resolve the issue during the initial call it can be a considerable savings to the organization. For this we have created a “Citrix Operations” app container and while it is somewhat limited here, the idea is that we can put specific information in here that could make supporting remote Citrix users much easier.
Below you see a list of metrics, the App Container below allows a service desk resource to actively click on Open/Closed sessions and get the following information.
From this page, the first responder can see if there are any database errors, if we want, we can put the HTTP 500 errors all in real time. The engine updates every 30 seconds so by the time the user calls in, they will be able to go in and see their Citrix experience. This can be custom retrofitted for your specific applications/environment.
So why are you writing this?
So can I haz it?
In fact, yes, you can have it for free if you like, we offer a Discovery edition that has ICA enabled that will allow you to keep up to 24 hours of data but will do everything that you see in this post. You have a few options but if you don’t want to go down the sales cycle you can download the discovery edition (you will get a call from inside sales to pick your brain) or you can get an evaluation of either a physical or virtual appliance but for that I have to get your area rep involved (you will not regret working with us and we will not make the process painful). Because we are passive and we gather information with no agents we can sit back passively and observe your environment with zero impact on your servers. If you want to set this up, just shoot me an email at email@example.com and I will provide you the Citrix Alpha bundle after you have downloaded the Discovery Edition or requested an Evaluation.
The discovery edition can be downloaded from the link below, the entire process was pretty painless, I went through it a few weeks ago. After signing up we can get you access to the documentation you will need and a forum account.
Thanks for reading and please let me know if you would like to contribute.