This guest post is written by BizTalk and Azure integration consultant, technical writer and #aimsperformancepro Eva De Jong.
Recently, I was asked to help set up monitoring using AIMS for one of our customers at Motion 10. This was a very exiting project for me, since I have written a lot about AIMS agents but had never really worked with real (customer) data.
I started with downloading and installing the agents and giving them time to learn how the machines were supposed to behave, giving it time to understand when the environment is deviating from normal behavior patterns.
After this two-week period, I started building dashboards and reports based on my findings, combined with (my own) best practices for reports on performance / monitoring with AIMS.
Getting started with BizTalk, OS and SQL monitoring can be a bit confusing because of the many performance counters. Luckily, I have some experienced colleagues on all these areas, so I got to create some cool dashboards.
Beneath you’ll find the dashboards as I would set up on BizTalk, SQL and Operating System level.
BizTalk Performance dashboard
Block 1: CPU performance (node level), Representing the CPU load on each of the nodes in the BizTalk Group. BizTalk automatically load-balances its workload, however some adapters (like FTP and SMTP) are not ‘cluster aware’ and can only be active on one single node per BizTalk group. Other processes not related to BizTalk services might also cause fluctuations on a single node. When behaving normally, the CPU load should be evenly distributed over the nodes.
Block 2: Memory usage (node level), Over time you might see a gradual increase of memory usage. In that case it’s good to have a look at the memory usage per host. Memory leaks in e.g. badly written adapters or pipeline components might cause a host instance to consume more and more memory which can cause throttling or low-memory exceptions over time.
Block 3: BizTalk metrics (group level), This graph plots some of the most important BizTalk load / performance indicators on group level.
• The blue line shows the message count over all receive ports on a given time interval.
• The orange line shows the number of megabytes of data transferred in all the receive ports.
• The green line represents the average message delay for message in the Message Box.
• The red line shows us of whether throttling has occurred, especially focus on the peaks to notice an increase of momentary throttling
These graphs usually are strongly related to each other. High message count and high message volume increase the chance of throttling and message delay.
Block 4: Performance overview of the BizTalk Group, this last block gives direct feedback on any current issues. Throttling will usually increase the number of suspended messages.
(Host) Throttling dashboard
On the throttling dashboard, I’ve created a block on top for global throttling; Message publishing throttling vs. Message delivery throttling. Beneath are the four separate BizTalk servers combined with all host throttling – delivery and publishing.
Operating System Dashboard
On the operating system dashboard I have created a Disk C– Memory – Network – CPU graph for all servers in the environment. Underneath you’ll see an overview of all errors and anomalies on a per server, per day basis, allowing you to spot troubles right away.
SQL Performance Dashboard
The SQL performance dashboard is created with some help of our SQL expert. I did not know enough about all SQL performance counters to create an intelligent dashboard but wanted to make sure my customer had the best insights I could offer him, it would have been a shame to not use AIMS’ full potential.
See our result below:
Block 1: SQL CPU vs. Memory, SQL Server is rarely CPU intensive. Performance issues are almost always storage I/O bound. Although the CPU can spike for short periods to 100%, high CPU is a symptom of another issue, for example, high page swapping or I/O issues. Most of the RAM usage in SQL Server is buffer cache. The size of the buffer cache can offset some I/O issues. More RAM almost always equates to better performance.
Block 2, Memory usage: Target Server memory (how fast is memory growing towards its set target) vs. Size of Physical Memory vs. Max server memory supplies us with some insights in memory usage.
Block 3, Writes and reads: Logical reads vs. logical writes vs. physical reads vs. physical writes. This combination tells us something about the disk usage. Peak in reads can be caused by reports or stored procedures that are not stored in memory. A peak in physical writes can represent an update, create or delete statement being executed.
Block 4, Memory: Internal memory pressure on system level vs. internal memory pressure on process level vs. Non Uniform Memory Acces (NUMA) Nodes with memory dangerously low vs. Page life expectancy vs. memory grants pending vs. Page IO latch waits. NUMA Nodes are SQL Server groups schedulers to map to the grouping of CPUs, based on the hardware NUMA boundary exposed by Windows.
Page life expectancy represents the usage of cache. Page IO latch waits are the time between reads and writes in which data must be locked and cannot be adjusted. When Page IO latch waits are peaking transfer time is up, it can indicate a memory issue (writing the data) or a disk issue (fetching the data). Combined with page life expectancy: when page life expectancy stays the same you should look for initiator at disk level, when page life expectancy drops its probably memory.
Block 5: CPU, Kernel mode time vs. user mode time vs. average signal wait vs. CPU utilization. Kernel mode time (OS internals, strictly separated from Users like SQL) in combination with user mode time could be a sign of CPU utilization. When Kernel peaks, there is probably something wrong on OS level. Average signal wait tells us how fast you are assigned a CPU core. When Average signal wait peaks there’s probably pressure on CPU.
Block 6: Logical writes by SQL Procedure vs. adhoc queries, compares writes and why / when they peak.
Block 7: Database performance, Deadlocks vs. Full scans vs. page splits. A deadlock occurs when two processes are waiting on each other. This can be indicator of index trouble. Full scans happen when no (appropriate) indexes are found, SQL Server must then scan to an entire document. A page split tells us something about the density of the indexes, and that tells us something about the density of the data- and index pages in the data files of the SQL Server databases.
Monitor anything with AIMS
So, as you can see, AIMS can be used to monitor (almost) anything. And using these dashboards, I'm able to get real-time analytics on my BizTalk, SQL databases and OS which can be used to confirm the health of my integration environment and support technology and even business decisions.