Increasingly solutions like Microsoft Exchange and Microsoft Lync have become the preferred communication medium for employees within themselves, or with the external world. Any degradation in performance or unavailability of these services is noticed immediately, and the business impact is reduced employee productivity, resulting in lost revenue and brand value.
To realize the business value of these investments, end users must be able to effectively use Exchange and Lync to achieve business results. Most of our customers ask us –
How do you monitor End User Experience for Exchange and Lync solutions?
Traditionally, when people speak about end user experience, the discussion is generally focused on the following:
- application response times;
- aesthetic design of the application & ease of use;
- application availability
By focusing on monitoring application response times and availability, customers realistically only monitor resource utilization of the infrastructure resources like Network & Servers within the data center. While they are excellent at providing a view of infrastructure and application component availability within the data center, they miss the complete picture of service quality from the perspective that matters most – the end user; how they are experiencing it.
The end user experience evolves with every interaction they have with solutions and people managing these solutions. Monitoring is of no help, if the solution has not been architected optimally or does not factor the peaks/valleys of business. Change Management and a Communication Strategy along with focused Service Delivery can help prevent the dissatisfaction arising out of transformational changes in the environment, operational downtimes; or the unavailability of key services/resources. The key elements for managing end user experience are:
Architecture & Design
Every business and their users have their own unique set of requirements when it comes to the behavioral and quality of service expectations from Exchange and Lync. Based upon our experience – analyzing the business and technical requirements and aligning the architecture accordingly is the key.
From a Messaging and Collaboration perspective along with the required features, there are 3 key areas that we factor during the architectural and design phase that directly affect the availability and experience of these services to end users – Network, Associated Infrastruture & Server Platform.
Network is that where we spend lot of time carefully planning w.r.t Latency and Bandwidth, as most of Exchange/Lync solutions deployed in a centralized topology based out of 1-2 major datacenters across the globe. The reason is that, most of backed Messaging & Collaboration infrastructure is dedicated, whereas Network (WAN primarily) is shared across multiple technologies and users.
From a latency perspective, the dependence of clients (ex: Outlook) that connect to the backend infrastructure is the key.
|Performance based upon Network Latency (ms)|
|Outlook 2010 (MAPI-Online)||0-110 ms||110-190 ms||190-270 ms|
|Outlook 2010 (OA-Cached)||0-200 ms||200-410 ms||410-630 ms|
|Outlook 2010 (MAPI-Cached)||0-320 ms||320-560 ms||560-810 ms|
From a bandwidth perspective, the User Profiling is of utmost important, as to what is the network utilization going to be per user.
The major factors we factor for bandwidth calculations are:
- messages – Sent/Received;
- address book downloads;
- free/busy lookups;
- logon traffic;
- OST re-sync;
The bandwidth requirement from a Heavy User who send/receives 150 messages a day each approximately 50kb in size are:
|Heavy User Profile [Sent: 30 | Received: 120 | Avg Message Size: 50 KB]|
|Outlook 2010 (OA-Cached)||8739.5||5.15|
|Outlook 2010 (MAPI-Cached)||8450.2||4.98|
|Outlook 2010 (MAPI-Online)||10763.1||6.35|
The bandwidth requirement from a Medium User who send/receives 100 messages a day each approximately 50kb in size are:
|Medium User Profile [Sent: 20 | Received: 80 | Avg Message Size: 50 KB]|
|Outlook 2010 (OA-Cached)||6207.2||3.66|
|Outlook 2010 (MAPI-Cached)||6011.7||3.54|
|Outlook 2010 (MAPI-Online)||7273.3||4.29|
The overall bandwidth would cater to total number of users in scope across geographic locations across various user profiles.
Similarly, the network bandwidth utilization from a Lync perspective is again dependent upon the 3 key factors of Signaling, Media and Content traffic that flows across the network across geographic locations again based upon user profiling. The type of data that is factored (with every data having its own bandwidth requirement)
- SIP signaling (Includes presence)
- Instant Messaging
- Peer-to-peer audio and video
- Multiparty audio/ video
- PSTN audio
- Desktop/application sharing
|Lync –Quality Metric|
|Quality Metric (average)||Good||Acceptable||Poor|
|Round Trip Time (RTT)||0-150 ms||150–200 ms||200-500 ms|
|Jitter||0-20 ms||20-30 ms||30-45 ms|
|Packet Loss Ratio||0% – 3%||3% – 5%||5% – 10%|
The Server Platform and Associated Infrastructure comprising of the Exchange/Lync application/servers, Reverse Proxies, Load Balancers, WAN Optimizers, Directory Services, Anti-spam, Anti-virus, Monitoring & Reporting etc. (to name a few) are relatively easier to architect and size; based upon the key in-scope factors of User Profiling, feature-set and hardware/software.
As part of Architecture and Design, we emphasis on the following key principles and processes to improve the end user experience post deployment of the infrastructure:
- HA & DR considerations along with RPO/RTO
- Performance and Volume testing
- User Acceptance Testing
- Operations, Monitoring and Reporting
- Software Lifecycle Management
Industry-leading analysts have established that in 74% of the reported help desk cases, IT first learns about performance and availability problems when the users call the Help Desk.
When we effectively monitor the infrastructure performance and user experience, we not only ensure end users are getting the high performance and quality of experience (QoE) they deserve, it helps the business achieve better business outcomes through increased visibility. It also helps enterprises to attain the required service levels demanded by users.
The users of a messaging system primarily interact with the system for the following key activities:
- Client Connectivity – To retrieve their mails from Exchange
- Mail Routing – Send and Receive mails
- Address Book Lookups – While sending mails or to search for specific people in the organization
- Free/Busy lookups – Check whether the other person or resource if available during a certain period
Any degradation of performance in these services is noticed by the end users immediately, as compared to any backend issue like a database corruption or server unavailability which do not affect the user, if architected properly.
End users experience is determined by the response their Outlook clients are getting while accessing the backend Exchange client access servers. The below metrics help monitor overall client access connectivity:
|Exchange – Outlook – Client Access Metric|
|RPC Averaged Latency||< 250 ms||< 5 min|
|RPC Requests||< 40||Always|
Address Book Lookups
The responsiveness of service while users look up Global Address Books (GAL); is the second most important factor from an end user experience perspective. The counters below are critical to monitor address book service connections and performance
|Exchange – Outlook – Address Book Metric|
|NSPI RPC Browse Requests Average Latency||< 1000 ms||< 5 min|
|NSPI RPC Requests Average Latency||< 1000 ms||< 5 min|
|Referral RPC Requests Average Latency||< 1000 ms||< 5 min|
The Availability service in Exchange 2010 provides Microsoft Office Outlook clients resource checking availability. The monitored metric below shows the average time to process a free/busy request in seconds and help to assess the client load and responsiveness of the service itself.
|Exchange – Outlook – Free/Busy Metric|
|Average Time to Process a Free Busy Request||< 5 sec||Always|
While a degraded end user experience w.r.t client connectivity or lookups impacts the productivity, however any issues in mail routing or delivery can result in revenue loss or even goodwill. Not only does any failure in mail delivery get noticed internally, it also catches the eye of competition and other market analysts, resulting in negative publicity. The following counters are critical to ensuring that there is no email loss or delays.
|Exchange – Mail Routing Metric|
|Aggregate Delivery Queue Length (All Queues)||< 3000||< 5 min|
|Active Remote Delivery Queue Length||< 250||Always|
|Active Mailbox Delivery Queue Length||< 250||Always|
|Submission Queue Length||< 100||Always|
|Active Non-SMTP Delivery Queue Length||< 250||Always|
|Retry Mailbox Delivery Queue Length||< 100||Always|
|Retry Non-SMTP Delivery Queue Length||< 100||Always|
|Retry Remote Delivery Queue Length||< 100||Always|
|Unreachable Queue Length||< 100||Always|
|Poison Queue Length||0||Always|
Of course both connectivity and access performance are subject to the performance of underlying hardware. The primary things that we consider to ensure system performance are processor, disk and memory utilization as well as application services availability.
This is real-time communications and users notice instantly if the solution becomes unavailable. If for some reason, Lync Server should become unavailable, it affects:
- Meetings, which may need to be cancelled and rescheduled.
- Onsite workers, who may become less efficient.
- Remote workers
|Lync –Quality Metric|
|Round Trip Time (RTT)||<200 ms||95%|
The key in monitoring Lync Server is to leverage a variety of features to provide real-time visibility into the health of a Lync Server environment.
|Lync – Monitoring Scenarios/Features|
|Synthetic transactions||Using PowerShell – executed from varied geographic locations, to test high availability of scenarios like – sign in, presence, IM, and conferencing for users.|
|Call reliability alerts||Database queries of Call Detail Records (CDRs) to produce alerts that indicate when a high number of users experience connectivity issues for peer-to-peer calls or basic conferencing functionality.|
|Media quality alerts||Database queries that look at Quality of Experience (QoE) reports to produce alerts that pinpoint scenarios where users are most likely to experience compromised media quality during calls and conferences.|
|Component health alerts||Server components via event logs and performance counters to indicate conditions such as services not running, failure rates, connectivity issues and/or high message latency.|
|Dependency health monitoring||Lync Server can fail for a variety of external reasons, including IIS availability, CPU and memory usage, processes, and disk metrics.|
Service Delivery and Change Management
Understanding and managing user expectations by means of effective service delivery processes, change management, training and communication strategy is another key aspect for us in ensuring higher end user experience and satisfaction. Ex: Keeping users informed about planned downtimes or any changes to their operating environment increases their chances of participation/cooperation thereby improving satisfaction.
User-centric, proactive IT management helps address critical problem at its core with performance baselines, proactive incident analysis, impacted user isolation, and root cause analysis. Finally adherence to SLAs coupled with an effective feedback addressal system ensures that end users know they are being taken care of – resulting in improved end user experience.