Posts tagged "Reliability"

Clustering Connect Servers with Multiple Availability Zones

Challenge: When a Connect cluster/pool is split into two data centers it is prudent for backup meeting rooms to be created in an alternate datacenter; you do not want the backup and the primary meeting rooms hosted in the same datacenter.

Connect Edge servers fill the gap of extending resources out to remote offices in enterprise proxy mode or of splitting external client traffic from internal traffic in reverse proxy mode; there is not a federated option to allow an origin server cluster to be spread about geographically. Very little latency is tolerable between Connect origin servers. With that said, there is the possibility of splitting an origin cluster across two datacenters for additional redundancy if the latency between the two datacenters is consistently 3 milliseconds or less. Intra-cluster communication on ports 8507 and 8506 must be unhampered for a cluster to work properly.

Spreading a cluster across two local datacenters with less than 3ms of latency requires that all servers point to the same database. Basically I am describing a regular cluster except with half of the servers in one building and the other half in another building nearby. When you do this, it is prudent to make sure that the backup meeting rooms that are spawned in support of every primary active meeting room are always created on servers in the separate datacenter from the active primary meeting. You can do this in the Connect database in PPS_ENUM_DATA_HOSTS. Configure the hosts in one data center with one pool_id, and give the other servers in the second datacenter another pool_id. Then add this setting to the custom.ini on each origin server:

APP_SERVER_POOLING=true

This setting is necessary to configure failover to take advantage of multiple availability zones.

Configuring application-level health monitors for Connect on BIG-IP Local Traffic Manager

Challenge: In order to make sure that the BIG-IP LTM performs failover in case one of the application servers in a cluster/pool should hang, you will want to make certain that the VIP that points to the application server pool is configured with an application-level health monitor. If you simply probe the health of the Connect servers with a default health monitor at the level of the IP stack, then there are potential cases when the BIG-IP LTM might send traffic to a server with a non-responsive application that only seems alive to lower-level probing mechanisms such as the packet Internet groper (PING). Always set the health monitor to probe for an actual string of content on the Connect server; all high-end hardware-based load-balancing devices (HLD) offer application-level health monitoring. The configuration of the monitor may not always be intuitive as each HLD has a different interface and different means of probing an application, but the following guidance will help you get an appropriate monitor in place on BIG-IP LTM as well as demonstrate the concept for implementation with other HLDs.

Example: Consider that you have three server pools and three VIPs. The only VIP and pool combination that needs an application-level health monitor for failover is the Connect application HTTPS server VIP and pool:

HTTPS VIP: connect.adobe.com: 10.10.10.1:443 points to Connect servers: 192.168.0.1: 443 and 192.168.0.2: 443

The probe or health monitor should point to a string on each Connect Enterprise server in its pool to check the health of each server. If one of the servers in the pool becomes non-responsive, the monitor will mark the server down and the HLD will redirect all traffic to the remaining server.

The Connect Meeting server VIP/pool combinations do not need a health monitor because the Connect application server pool handles failover for the Connect meeting rooms:

RTMPS VIP: meeting1.adobe.com: 10.10.10.2:443 points to Connect Meeting server meeting1 192.168.0.1: 1935

RTMPS VIP: meeting2.adobe.com: 10.10.10.3:443 points to Connect Meeting server meeting2 192.168.0.2: 1935

Because there is only one server in each pool, there is no place for the HLD to redirect meeting traffic should one of the Acrobat Connect Professional meeting servers fail to respond. The only reason to probe the Connect Meeting server VIP/pool combinations might be to trigger an email message to an administrator to warn that one of the Connect Meeting servers is problematic and that the application pool has triggered failover.

What to do: The best string on the servers that you may point your application-level health monitor towards is the testbuilder diagnostic page:

/servlet/testbuilder

The testbuilder page will send back the “status-ok” string.

It is best to point the health monitor to the testbuilder page rather than a simple HTML string, because testbuilder is actually probing the Connect Enterprise database to make sure there is a healthy connection. If there is any problem with the Connect server application, then testbuilder will not report the “status-ok” string.

Each HLD has a different interface to configure these monitors and each one does the check differently, the following example works with F5 BIG-IP LTM against testbuilder:

send “GET /servlet/testbuilder HTTP/1.1\nHost: \nConnection: Close \n\r\n”
“status-ok”

For an LTM-based monitor called connect_testbuilder:

monitor connect_testbuilder {
defaults from http
recv “status-ok”
send “GET /servlet/testbuilder HTTP/1.1\nHost: \nConnection: Close \n\r\n”
}

Consider this: You should also place an HTML file in the Connect /common directory on each Connect Enterprise server and point to that file (test access to the html file via a browser to be sure that it does not require a log-in – the help sub-directory under the common directory is also OK with all prior versions of Connect). This option should be used along with testbuilder as a separate and supporting health check. The following example shows an HTML file called healthmonitortarget.html containing the string You are being served HTML”

send “GET /common/help/healthmonitortarget.html”

“You are being served HTML”

Note: With reference to the testbuilder file output behavior, if, for example, Connect receives a SQL Exception (DB Down, or other SQL anomaly) it does not change the output string of testbuilder, Connect tries to reconnect and only when it cannot reconnect to SQL, then it fails over, and then ultimately restarts Connect. If you have the JDBC restart string in place for SQL robustness (and you should if you do not) in the custom.ini file then, in theory, this is desirable behavior:

DB_URL_CONNECTION_RETRY_COUNT = 15
DB_URL_CONNECTION_RETRY_DELAY= 30

If Connect was more aggressive in the changing of testbuilder or in other words if testbuilder were more sensitive to every acute interruption in the DB connection or the state of the Connect server, then it could trigger superfluous failover at the application VIP on the HLB by marking down a server that may only have a brief reconnection attempts in process to the Connect DB. If Connect fails and restarts due to a DB connection problem and is still unable to connect to the DB once the server restarts, testbuilder will show the following output and trigger the healthmonitor on the HLB appropriately:

“2
status-critical”

If you set up the healthmonitors as described above, then what testbuilder may miss, the HTML health monitor will pick up and vice versa. The key thing is to test the healthmonitors vigorously and inspect the Connect debug logs for any errors they may generate. Since each HLB acts differently and often SSL profiles and other variables will affect behavior, it is prudent to test the health monitors under all server failover conditions.

Be careful: There are hazards commensurate with health monitors; the first concern is that when they are set up incorrectly, the cluster becomes instantly inaccessible through the VIP because the HLD marks all the servers down. Another hazard is with pointing the health check to a page that generates entries in the Connect database. Examples of the latter include hitting the Connect log-in page with a health monitor every few seconds thereby bloating the database with superfluous logging activity. Don’t be like the Second Lieutenant who runs into battle with the loaded .45 pointed at his foot.

Troubleshooting: When a heath check fails, you may look in the Connect debug log for information about the cause. Here is an example of log output from a health monitor pointing to a Connect server-based resource; the monitor, in this case works, but generates this benign error indicating a requirement for log-in credentials from an HLD health check (that is incapable of parsing a log-in). The testbuilder page does not require a log-in and neither should any HTML-based health target. The bug that causes the debug log output depicted below is fixed in Connect 9.1 (bug 3561205):

[10-10 22:35:02] http-bio-8443-exec-103 (INFO) [null:192.1678.1.10] GET https://common/healthmonitortarget.html
[10-10 22:35:02] http-bio-8443-exec-103 (INFO) [breeztxwq7xp4icyvy8d7:192.1678.1.10]
[10-10 22:35:02] http-bio-8443-exec-103 (ERROR) Not able to read the account’s INSECURE_CONNECTION property. Exception :
com.macromedia.airspeed.StatusException$NoAccess$NoLogin: