Posts in Category "Clustering"

Connect on VMWare – some deployment tips

Issue: VMWare is ubiquitous in the enterprise and while it opens up huge potential for management of the Connect infrastructure, it must be planned and executed with an eye toward robustness.

This advice is gleaned from conversations with senior persons on our operations team as well as from support cases generated by various customers with on-premise VMWare deployments of Connect.

One of the most important and often overlooked variables about virtualization is to make certain that  VMware is compatible with all the underlying components of the server and network architecture. The infrastructure supporting VMWare must be verified by VMware under their Hardware Certification Program or Partner Verified and Supported Products (PSVP) program; be sure to use certified hardware.

Here is the link to the compatibility reference:  http://www.vmware.com/resources/compatibility

With Connect you must consider both Tomcat and  FMS; the former can run on most anything, while the latter is a bit more demanding; RTMP can be acutely;y affected by latency and packet transmissions. If you notice unpredicted latency or a surprise crash of FMS with Connect 9.1, a good test would be to check the network components; sniff for packet transmission issues – have the vNIC of the guest VMs configured to use VMXNET3; this is a good place to start.

With reference to recommendations and best practices, it really depends on the VMware infrastructure adopted. The following references serve as a guide for an enhanced environment:

Enterprise Java Applications on VMware – Best Practices Guide: http://www.vmware.com/resources/techresources/1087

Best Practices for Performance Tuning of Latency-Sensitive Workloads in vSphere VMs: https://www.vmware.com/resources/techresources/10220

Performance Best Practices for VMware vSphere 5.1: https://www.vmware.com/resources/techresources/10329

The key with Network Storage is speed. If you lose connectivity to the shared storage then only what is cached on the origins will be available.

Shared storage requirements

  • Disk specs: 10,000–15,000 RPM — Fibre Channel preferred
  • Network link: TCP/IP — 1GB I/O throughput or better
  • Controller: Dual controllers with Active/Active multipatch capability
  • Protocol: CIFS or equivalent

Avoid, virtualizing the Connect database if possible.

I have seen that in some customer-based VMWare environments that are overtaxed, that latency among the servers on 8507 (and 8506), can cause problems. Intra-cluster latency (server to server communication) should never exceed 2-3ms. When it does we see intermittent crashes. I had one customer who had a particularly weak infrastructure and for whom I could predict his crashes; he was doing back-ups and running other tasks at a certain time weekly that would tax and hamper network connectivity for about an hour; these tasks were so all-consuming on the network, they turned every cluster resource into an individual asset on its own island. The log traces bore this out and we knew with precision what was going on. He knew he needed to upgrade his infrastructure and in the meantime we worked out a reaction plan to deal with the issue; it included:

  1. Place a higher than normal percentage of cache on each server to limit invoking shared storage
  2. Set the JDBC driver reconnection string for Database connectivity
  3. Plan Connect usage around these maintenance activities and when possible, do Connect maintenance activities at the same time as well – not very difficult as these were after hours, but being a  global operation, still not a given.

Tunneling with RTMP encapsulated in HTTP (RTMPT) should be avoided as it causes latency

Tunneling with RTMP encapsulated in HTTP or RTMPT should be avoided as it causes latency that can have a negative impact on user experience in a Connect meeting. In rare circumstances,the latency commensurate with tunneling RTMP encapsulated in HTTP, can become so acute that it renders Connect unusable for affected clients. The performance hit commensurate with tunneling is one of the primary reasons we continue to deploy Connect Edge servers as they often can replace third-party proxy servers that are often the cause of tunneling latency..

While the amount of acceptable latency depends on what one is doing in the room; RTMPT tunneling affects most activities. Some activities, such as screen-sharing are more bandwidth intensive than other activities such as presenting an uploaded PowerPoint from within a meeting room; The high latency commensurate with RTMPT tunneling would affect the former more than the latter. VoIP is often the first thing to make the effects of high latency felt. Here is some feedback from a test I did while on site with a client dealing with tunneling because of their refusal to pipe RTMP around a third-party proxy:.

External user tunneling during test:
Spike at 3.10/3.02 sec
Latency 403/405 ms up to 3.53/3.52 sec up .064 down 118
When latency peaks to 2.6/2.4 sec I get a mild interruption to the audio V
Video pauses momentarily when the latency spikes

Internal user with direct connection:
2 msec / 1 msec Up 0.08 kbits down 127 kbits
No pauses, delays or spikes

Tunneling should only be considered as a fallback mechanism or safety net to allow connections when RTMP is blocked due to something unplanned or for a few remote clients who must negotiate specific network obstacles. When RTMPT is the default by network design, you will need to limit your activities within Connect to those feature that use the least bandwidth.

The picture below shows a direct connection over RTMPS on 443 is being blocked somewhere on the client’s network and the fallback mechanism built into Connect of tunneling RTMP encapsulated in HTTPS is the fallback path. This is usually caused by either proxy servers or firewalls or both – any application-aware appliance on one’s network that sees the RTMPS traffic on 443 and recognizes that it is not HTTPS is a potential obstacle; RTMPS traffic is on port 443 and while it is disguised as HTTPS, it still may be blocked. The result is tunneling, indicated by the “T” in the output:

.tunneljpgOctagon

Compare with a connection without tunneling:

no-tunnel

The recommended steps for anyone experiencing persistent tunneling, is for their network engineers to trust the source IP addresses of the Adobe hosted, ACMS, managed ISP or on-premise Connect/FMS server VIPs in order to prevent the blocking of RTMPS traffic. RTMPS is not supported by any third-party proxy server. Static bypass works well to solve this issue. The problem stems from network policies that require all traffic to go through a proxy. The result is tunneling with commensurate high latency and drops. RTMPS must be allowed to stream around a proxy to avoid the overhead and latency of tunneling encapsulated within HTTPS. Attempts to cache the stream add no value.

Other options will depend on the capabilities of the third-party proxy servers in the affected client infrastructure. Blue Coat ProxySG is one of the popular proxy server options in our niche. In cases of latency invoked by tunneling RTMP encapsulated in HTTP on a network that employs Blue Coat ProxySG servers, sniff tests done by support representatives have indicated that when an affected client attempts to connect to an Adobe Connect meeting, those clients would establish both explicit HTTP connections based on PAC file settings in the system registry to the Blue Coat ProxySG pool through a hardware-based load balancing device (HLD) and transparent HTTP and SSL connections through Blue Coat ProxySG via WCCP GRE redirect to several Adobe Connect servers. The problem manifests with RTMPS when the clients attempt to establish an SSL connection directly to the destination host without going through PAC file proxy settings. Since a Blue Coat ProxySG is commonly configured to perform an SSL intercept on both explicit and transparent HTTPS traffic, upon examining the content after decrypting the SSL payload from the clients, the Blue Coat ProxySG will return an exception and close the connection because the request doesn’t contain an HTTP component and cannot be parsed for policy evaluation. As a workaround, other than using static bypass, it is possible to create a proxy service with the destination set to the Adobe Connect server IP range on port 443 and to set the proxy setting to TCP-Tunnel with Early Intercept enabled. This will allows Blue Coat ProxySG to intercept and tunnel the traffic without considering whether it is RTMPS or HTTPS.

Watch for a more comprehensive article on this topic forthcoming.

Stunnel does not Startup with Connect

Problem: stunnel does not start up with Connect

Although stunnel can be installed as a service, it doesn’t load the stunnel.conf file(!) one workaround is to not setup the services to run automatically but to auto-run these batch files at startup:

Note: This tech-note assumes stunnel is installed in c:\Connect\9.0.0.1\; be sure to adapt the scripts accordingly.

Origin server startup.bat:

@ECHO ON
net start FMS
net start FMSAdmin
net start ConnectPro
net start CPTelephonyService
c:\Connect\9.0.0.1\stunnel\stunnel.exe stunnel.conf
@ECHO OFF Origins stop.bat:

@ECHO ON
net stop ConnectPro
net stop CPTelephonyService
net stop FMSAdmin
net stop FMS /y
@ECHO OFF

If you have remote Edge servers, use these; they includes cache clearing maintenance.

Edges start.bat:

@ECHO ON
net start fms
ping 1.1.1.1 -n 1 -w 10000>nul
net start fmsadmin
c:\breeze\edgeserver\stunnel\stunnel.exe stunnel.conf
@ECHO OFF

Edges stop.bat:

@ECHO ON
net stop fmsadmin
ping 1.1.1.1 -n 1 -w 10000>nul
net stop fms
ping 1.1.1.1 -n 1 -w 20000>nul
del /Q /S c:\breeze\edgeserver\win32\cache\http\*.*
ping 1.1.1.1 -n 1 -w 10000>nul
@ECHO OFF

Run > gpedit.msc
Local Computer Policy > Computer Configuration > Windows Settings > Scripts (Startup/Shutdown)
Batch files are assigned as startup & shutdown scripts. This is in addition to being available to be run manually.

Providing Diagnostic Data to Expedite Solutions for Connect Meeting Issues

Issue: Anything that may happen during a meeting which has a pejorative effect on end-user experience.

Solution: In Connect 9.1 we have a great diagnostic option in the meeting room. You can immediately pull logs from any meeting to diagnose:

If you click Help>About Adobe Connect, while holding down the Ctrl key, the debug logs will appear int he meeting room and you will have the option to copy them to your clipboard.

log-mtg.fw

 

log-mtg1.fw

Sending me these, along with the RTMP string  Help> About Adobe Connect, while holding down the Shift key – this will be most helpful from the client experiencing the extreme latency.

rtmp-mtg.fw

Now if you want to take it even one step further and provide a client-side view of the meeting:

The instructions for enabling client-side logging are here: http://helpx.adobe.com/adobe-connect/kb/enable-logging-acrobat-connect-professional.html

Providing all this data along with the date and time (including timezone) and Meeting URL of any issue, will greatly expedite analysis and solution.

Resource Constraints cause Connection Read Error in Logs on Clustered Connect Servers

Issue: FCSj_IO:4 (x) – Connection read error: -1 LP: 5345 RP: 8506 URI: rtmp://localhost:8506/meetingapp/7/12345678

I have seen that in some VMWare environments that are very overtaxed for resources, latency between/among the clustered Connect servers on ports 8507 (and also 8506 though 8506 does not cause this error), can cause problems. Intra-cluster latency should never exceed 2-3ms. When it does we see intermittent errors and can also see crashes.

I had one unnamed customer who had a particularly weak infrastructure and  I could predict his crashes; he was doing back-ups and running other tasks at a certain time weekly that would severely hamper network connectivity for about an hour; these tasks were so all-consuming on the network, they turned every Connect cluster resource into an individual asset on its own island. The Connect logs bore this out and we knew with precision what was going on and could predict his call or email based on his maintenance schedule. He knew he needed to upgrade his infrastructure and in the meantime we worked out a reaction plan to deal with the issue; it included:

  • Place a higher than normal percentage of cache on each server to limit invoking shared storage during maintenance (see page 57)
  • Set the JDBC driver reconnection string for Database connectivity robustness
  • Plan heavy Connect usage around network and server maintenance activities and when possible, do your Connect server maintenance activities at the same time as well.

Connect & Unified Voice (UV) Traffic Flow Diagram

Issue: Plan for the flow of traffic to enable UV among the various components in any Connect deployment: Connect, Flash Media Gateway (FMG), Session Initiation Protocol (SIP)

There are numerous documents on the topic of Unified Voice (UV) with Connect:

This diagram shows the flow of traffic and the protocols used for UV with Connect and is offered as a planning and a troubleshooting tool; click on the diagram to expand it for viewing:

 

Connect_FMG_Flow

The Adobe Connect Deployment Guide on the F5 Website needs Updating

Issue: Be careful when following the Adobe Connect Deployment Guide posted on the F5 Website. While the article is be helpful, there are some ambiguities that can lead to trouble. I have tried to update their deployment guide but have not succeeded; the LTM is the most popular load-balancing device and SSL accelerator in the Connect niche and when it is set up properly it works splendidly. Here are corrections, updates and things to watch out for when deploying Connect behind an LTM:

1. Do not use an HTTP profile for an RTMP VIP. An HTTP profile for RTMP VIPs may affect playback of video as well as break remote Edge connectivity. Remember that you have two servers running on each box, a Tomcat application server and an FMS server. Do not treat the FMS server as though it were an application server; RTMP is a streaming protocol that requires a TCP profile at the HLD VIP.

2. Use the health monitor documented here for LTM.

3. Do not use session-awareness or stick-sessions even if you use SSL. The Round Robin algorithm should float freely to the Tomcat application pool.

4. Do not use Nagle’s Algorithm with SSL; it will have a negative effect on performance.

Review this general Connect pool/cluster configuration tutorial before configuring BIG-IP LTM with Connect: Adobe® Connect™ server pools/clusters and hardware-based load-balancing devices with SSL acceleration

Preparing Connect Servers for SSL 2048 Certificates

Problem: When a Connect server is running with untrusted, expired or private SSL certificates, Connect Meeting rooms will not launch. Preparing for the transition from 1024 to 2048 SSL certificates is very important for your Connect on-premise SSL-enabled servers.

When you click on a Connect Meeting URL, the initial browser that opens spawns a second browser (the Connect meeting addin):

connecting1.fw

It is this hand-off between browsers that requires a fully trusted public certificate to complete; the Meeting will hang upon loading if the certificate is untrusted:

connecting.fw

During this hand-off between browser sessions, there is not any opportunity to click your way through an untrusted connection. The Meeting will simply hang.

Preparing your on-premise, SSL-enabled Connect servers for the transition from 1024 certificates to 2048 certificates is very important. Failure to upgrade your certificates as required will result in Meeting rooms hanging. There is s great FAQ page on the subject here on the Symantec website: 1024-bit Migration FAQs  Adobe’s SSL configuration documents and tutorials show where and how the SSL certificates are installed for both hardware-based load-balancing devices/SSL-accelerators or in stunnel:

If you are running on stunnel and are running stunnel on the Connect server directly, the transition to 2048 certificates will produce a greater CPU signature: The comparison between software-based vs. hardware-based offloaded and accelerated solutions like LTM is worth considering. The new 2048 certificates will have 70% penalty on CPU load as compared to current utilization stats. Check to see how much CPU stunnel is currently using with 1024 certificates and plan according for 70% more CPU than the current utilized.

If you are not sure whether you are currently running 1024 or 2048 certificates, use this handy tool from Symantec to check: Check your certificate installation

If your account is hosted by Adobe, then you are all set. When I plug in the domain name of an Adobe Connect hosted account for one of our training partners, Rexi Media, I get the following output:

reximedia.adobeconnect.com
Certificate information
Common name: *.adobeconnect.com
SAN: *.adobeconnect.com
Valid from: 2013-Feb-27 00:00:00
Valid to: 2014-Feb-28 23:59:59
Organization: Adobe Systems Incorporated
Organizational unit: DMBU Systems Engineering
City/locality: San Francisco
State/province: California
Country: US
Serial number: 7b8f272555087f6102773df671c95c3c
Algorithm type: SHA1withRSA
Key size: 2048

Brad’s Short-list for Connect Cluster SaaS Monitoring Options

There are many options on the monitoring theme that are worth considering when trying to decide how to keep trach of Connect server resources in a cluster. Articles describing clustered environments are on the Connect Users Community : http://www.connectusers.com  Simply search the User’s Community using the keywords: cluster, pool, edge, SSL, etc.

To effectively monitor your Connect cluster SaaS options can sometimes be cost effective than home-spun solutions; here are some staff picks with some commentary:

Sumologic- It resembles Splunk. The main difference is that Sumologic is hosted and managed externally and Splunk is hosted and managed on-premise. With Sumologic, there is not any need for software licensing, hardware investments or internal administrator expertise.  Splunk offers a similar service called splunk>storm, but it is not as mature as Sumologic and lacks some of the alerting capability found in Sumologic.

Loggly - An alternative to Sumologic could be Loggly which offers a similar service; it seems that the alerting service is not exactly built in.  It requires a little more work and is called AlertBirds.

Note: It is possible to take an on-premise option like Cacti and port it to Sumologic, so you could effectively kill 2 birds with one stone.  You can setup a forwarder in 30 seconds and be searching the logs in no time at all.

Monitis – Provides capabilities similar to those of Nagios along with external monitoring.  The Monitis community writes custom monitors thereby enriching the options.

LogicMonitor – An alternative to Monitis could be something like LogicMonitor.  You may be able to port your existing Nagios checks over to it (check and verify).  This si a simple solution, installing the monitor and having basic checks like CPU, Memory, Bandwidth, Disk Usage, Disk IO and external ping, http, https and udp monitors setup would take all of 20 minutes.

Pingdom- An alternative to RedAlert at a lesser cost.  It is trusted by millions and is easy to use and has more endpoints than comparable options.  It takes five minute setup.

The beauty of a SaaS monitoring solution is that you do not need to worry about scaling your monitoring solution every time you scale your Connect architecture.  You can have a single solution for 20 Connect Clusters vs having to add Cacti servers, Nagios servers, Splunk architecture and licensing to handle the additional monitoring needs commensurate with expansion.  With a SaaS solution, there in not any build-out time.  You can literally have 20 monitors up and running in under an hour, and work on adding additional ones at your leisure in between casts with your new Deceiver 8 Fly Combo.

With reference to basic on-premise monitoring, make sure you use standard perfmon counters for things like CPU, Memory etc. For meeting count and meeting user monitoring you may use the FMSAdmin API with scripts to make various calls and then parse the data and pass it to an option such as Cacti.  To insure robustness, the FMSAdmin service should be restarted routinely. You could also use similar counters to pull data directly from the Connect database, but this is not without risk as Connect updaters and upgrades can introduce changes that may require rework of your custom counters.

Adobe Connect Servers and Hardware-based Load-balancing Devices

This updated article offers a best-practice configuration of a basic Connect pool/cluster behind a high-end, application-aware HLD such as F5 BIG-IP LTM. This article does not discuss SSL-acceleration. This article does not describe all the possible configurations, but offers a general working example of a basic HTML/RTMP non-SSL cluster/pool of Connect servers.

Adobe Connect Server Pools and Hardware-based Load-balancing Devices: http://www.connectusers.com/tutorials/2009/04/load_balancing/index.php