Understanding High Availability on the NSX Edge Services Gateway

Hello Fellow NSX Operators!

Before I jump into the HA commands, let me briefly preface with a few words about NSX Edge Services Gateway High Availability (simply HA going forward).  You will need to understand the heartbeat path and what type of infrastructure-impacting health events are common to your infrastructure.  You may find yourself troubleshooting High Availability many times because of a change or degradation in the underlying Hosts, Storage or Network.  Be careful with those red herrings.  When HA is implemented with a solid understanding of the underlying infrastructure and its variations, you can enjoy peace of mind in knowing the edge network services are highly available.

This article covers the following topics in regards to HA:
– Implementation considerations
– Troubleshooting commands
– Proactively monitoring HA via syslog


Edge HA topology graphic from the NSX Network Virtualization design guide

A few HA facts/points/considerations/recommendations…

HA Topology
– It uses an Active/Standby topology.
– When HA is enabled, a second VM is deployed.  The new VM will only be networked to communicate with the primary.
– When HA is disabled, the 2nd VM is destroyed.
  – HA appliances will be deployed based on the user-defined mappings (at these these settings are not dynamic).
– Edge mappings are most easily managed using  /api/4.0/edges/<edgeId>/appliances with the REST api
– Changes appliance settings will trigger an OVF re-deployment of the edge.

HA IP Configuration
– Optional.  If not configured, NSX will assign a valid /30 IP pair using an RFC3927 network.
– If configured manually, valid subnets are system enforced. and is not valid. and is valid.

HA vNic Selection
– Optional, it can be left to ANY.
– A minimum of one edge interface is required before enabling HA.
– The recommendation for maximum availability is to configure a network dedicated to the vNIC heartbeating.
– Sharing a vNIC will work without problems as long as the network is not overloaded and available.

HA Timeouts and Heartbeating
  – The default deadtime is 6 seconds
– The current recommended deadtime is 15 seconds (uses a 3 second polling frequency).  There is a tradeoff of service failover time for increased resiliency to lost heartbeats.
– Heartbeats are sent using UDP-694 (the IANA registered port for heartbeats)

HA Appliance Anti-affinity
– Host anti-affinity is handled by system.  When HA is enabled there is a cluster DRS rule added automatically with the name anti-affinity-rule-edge-#, where edge=# is the edge-ID.
– Storage anti-affinity is not handled by default.  For maximum availability of the edge pair, configure the edge appliances to deploy to different physical storage resources.  Especially important in infrastructure that uses centralized storage.


Troubleshooting ESG HA with CLI-based Edge Commands

show service highavailability example output

 nsxe-0> show service highavailability
 Highavailability Status: running
 Highavailability Unit Name: nsxe-0
 Highavailability Unit State: active
 Highavailability Interface(s): vNic_5
 Unit Poll Policy:
    Frequency: 3 seconds
    Deadtime: 15 seconds
 Stateful Sync-up Time: 10 seconds
 Highavailability Healthcheck Status:
    Peer host [vse-1 ]: good
    This host [vse-0 ]: good
 Highavailability Stateful Logical Status:
 File-Sync running
 Connection-Sync running
 xmit        xerr  rcv       rerr
 51219548828 0     42990848  0

show service highavailability connection-sync example output

nsxe-0> show service highavailability connection-sync
connections local:
current active connections: 12693
connections created:            368613263  failed: 0
connections updated:           21695297    failed: 0
connections destroyed:        368600570  failed: 0

connections peer:
current active connections: 0
connections created:          26571 failed: 0
connections updated:         1024 failed: 0
connections destroyed:        26571 failed: 0

traffic processed:
1248602045934 Bytes 6285222215 Pckts

UDP traffic (active device=vNic_5):
51255382200 Bytes sent 43018912 Bytes recv
590146284 Pckts sent 2518471 Pckts recv
0 Error send 0 Error recv

message tracking:
0 Malformed msgs 5863 Lost msgs

show service highavailability connection-sync example output

vse-0> show service highavailability link
Local IP Address:
Peer IP Address:

debug packet display / “sniffing” HA heartbeats

Filter using the High Availability vNIC from the root command “show service highavailability”

nsxe-0> debug packet display interface vNic_# port_694
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vNic_5, link-type EN10MB (Ethernet), capture size 65535 bytes
17:22:50.357722 IP > UDP, length 189
17:22:52.709253 IP > UDP, length 189
17:22:53.360327 IP > UDP, length 190
17:22:55.711667 IP > UDP, length 203
17:22:55.711715 IP > UDP, length 189
17:22:55.742631 IP > UDP, length 203
17:22:56.353520 IP > UDP, length 189
17:22:58.716886 IP > UDP, length 189
17:22:59.357186 IP > UDP, length 189

Viewing Historical HA System Events for an Edge in the Web Client


  • Open the vCentter Web Client
  • Open Networking & Security
  • In NSX Edges, double-click the Edge
  • Select the Montor tab
  • Select System Events
  • On the search widget, click the arrow, click Select Columns…
  • Deselect All > Check Module > Type HighAvailability > Click Ok

REST API-based Commands

Query HA Configuration Details on an Edge

GET https://nsxm-ip/api/4.0/edges/edge-#/highavailability/config Example Output

<?xml version="1.0" encoding="UTF-8"?>

Delete Edge HA Configuration on an Edge

DELETE https://nsxm-ip/api/4.0/edges/edge-#/highavailability/config


Monitoring High Availability Health Proactively

– Open your (vCenter Log Insight ), Splunk or log aggregation solution of choice.
– Build aview of all edge logging (use regex or glob based matches to filter according to your naming convention).

Heartbeat Drops
– Examine matches on the text “lost packet”. Build an alerting rule based on your results.
– When the infrastructure is healthy, there should be not be any HA packets lost.

Example match

Sep 19 11:34:14 nsxe-0 ha[]: [default]: [1371]: WARN: 1 lost packet(s) for [nsxe-0] [37:39]

Late Heartbeats

– Examine matches on the text “Late heartbeat”. Build an alerting rule based on your results.
– Late heartbeats may indicate infrastructure problems.  Possible resource constraints or both edges in the HA pair.
– This can also result in a split brain state.

Example match

Jul  3 09:46:48 nsxe-0 heartbeat: [1454]: WARN: Late heartbeat: Node
nsxe-1: interval 24921 ms

Lost and late heartbeats are the early indicators.  Early indicators are your best friends.  Keep a close eye out for these.

Monitor NSX Manager for Switchover Events

– Filter logging based on NSX Manager SystemEvent, you can use the text [SystemEvent] to filter.
– Examine matches for Event 30202 and 30203 (Edge switching to ACTIVE & STANDBY, respectively)
– Any single event source with more than one or two events should raise a red flag. Any unplanned switchover events should be researched. Build an alerting rule based on your findings.

Example match

Sep 20 20:50:05 nsxm-0 [SystemEvent] Time:'Sat Sep 20 20:49:13.000 GMT 2014', Severity:'High', Event Source:'vm-13950', Code:'30203', Event Message:'vShield Edge HighAvailability switch over happened. VM has moved to STANDBY state.', Module:'vShield Edge HighAvailability'

Split-Brain Indicators 

– Look for the text “returning after partition”; Look for the text “Deadtime value may be too small”
– Matches on these can indicate that the state of HA has most likely entered the split brain state.  Network Services will be mostly unavailable until the condition is resolved.
– Hopefully these do not exist in your environment. Build a preventive alerting rule. Matches are immediately actionable.

That is all folks. Hope this helps.

NSX SSL VPN-Plus | Adding Client Configurations in Bulk

Anyone using NSX SSL VPN-Plus feature for more than one site will quickly find there is no mechanism for importing client configurations.  The native method for accessing additional sites is to browse to the Gateway for each site (then download and run the installer).

That’s pretty tedious as your site count increases.  There is a better, albeit unsupported, way to manage this need.

SSL VPN-Plus naclient on Windows

In windows, client configuration is stored in the registry.  You can manipulate the windows registry using .reg files.

Open up a text editor, and prepare a file with all of your sites using the following format.  Replace the GatewayList value with your site’s gateway IP address

 Windows Registry Editor Version 5.00

 [HKEY_LOCAL_MACHINE\SOFTWARE\VMware, Inc.\SSL VPN-Plus Client\Connection #1]

 [HKEY_LOCAL_MACHINE\SOFTWARE\VMware, Inc.\SSL VPN-Plus Client\Connection #2]

 [HKEY_LOCAL_MACHINE\SOFTWARE\VMware, Inc.\SSL VPN-Plus Client\Connection #20]

Save the file as a .reg file, the name of the file is arbitrary.

Exit the SSL VPN-Plus naclient application

Import the .reg file

Navigate to HKLM\SOFTWARE\VMware, Inc.\SSL VPN-Plus Client and verify the connections were imported.

Update the ConnectionCount to the total number of sites.  This is important; if the number doesn’t match, naclient will not start.

Start the naclient (C:\Program Files\VMware\SSL VPN-Plus Client\SVPclient.exe)

SSL VPN-Plus naclient on MAC OS X 

This one is easier, the client settings are stored in /opt/sslvpn-plus/naclient/naclient.conf

Quit the naclient application.  Add the site configurations to naclient.conf

vi /opt/sslvpn-plus/naclient/naclient.conf
 site1 site1-ip:443 256 
 site2 site2-ip:443 256 
 site20 site20-ip:443 256

Start the naclient.

That is all peeps.  Have a nice day.

System Messages – NSX Edge Services Gateway

[Back to Unofficial System Messages Guide Home]

System Messages – NSX Edge Services Gateway

System Events

CRMD – Cluster Resource Management Daemon

Appname:     cmrd 
Priority:    notice
Message:     run_graph: Transition 6431 (Complete=0, Pending=0, Fired=0, Skipped=0,
             Incomplete=0, Source=/usr/var/lib/pengine/pe-input-6430.bz2): Complete

Appname:     crmd
Priority:    info 
Message:     do_state_transition: Starting PEngine Recheck Timer

Appname:     crmd
Priority:    info
Message:     do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE
             [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]

Appname:     crmd
Priority:    info
Message:     notify_crmd: Transition 6431 status: done - 

Appname:     crmd
Priority:    info
Message:     te_graph_trigger: Transition 6431 is now complete

Appname:     crmd
Priority:    info
Message:     run_graph: ====================================================

Appname:     crmd
Priority:    info
Message:     do_te_invoke: Prinfo: Message:     ocessing graph 6431 (ref=pe_calc-
             dc-1406976970-6473) derived from /usr/var/lib/pengine/pe-input-6430.

Appname:     crmd
Priority:    info
Message:     do_te_invoke: Processing graph 6431 (ref=pe_calc-dc-1406976970-6473) 
             derived from /usr/var/lib/pengine/pe-input-6430.bz2

Appname:     crmd
Priority:    info
Message:     unpack_graph: Unpacked transition 6431: 0 actions in 0 synapses

Appname:     crmd
Priority:    info
Message:     do_state_transition: State transition S_POLICY_ENGINE -> S_TRANS
             ITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_Message origin=handle_
             response ]

Appname:     crmd
Priority:    info
Message:     do_pe_invoke_callback: Invoking the PE: query=6517, ref=pe_calc-dc-
             1406976970-6473, seq=8, quorate=1

Appname:     crmd
Priority:    info
Message:     do_pe_invoke: Query 6517: Requesting the current CIB: S_POLICY_ENGINE

Appname:     crmd
Priority:    info
Message:     do_state_transition: All 2 cluster nodes are eligible to run resources.

Appname:     crmd
Priority:    info
Message:     do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER

Appname:     crmd
Priority:    info
Message:     do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ 
             input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]

Appname:     crmd
Priority:    info
Message:     crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped!

System Messages – NSX Manager

[Back to Unofficial System Messages Guide Home]

System Messages – NSX Manager

Module vShield Edge Gateway

Code Severity Event Message:
30024 Informational Configuration changed for : [$component] on vShield Edge with id : edge-#

Module vShield Edge Health Check

Code Severity Event Message:
30033 Major VShield Edge VM not responding to health check.
30034 Informational None of the VShield Edge VMs found in serving state. There is a possibility of network disruption.
30042 Informational vShield Edge VM has recovered and now responding to health check.

Module vShield Edge Appliance

Code Severity Event Message:
30152 Informational vShield Edge system time sync up happens

Module vShield Edge HighAvailability

Code Severity Event Message:
30202 High vShield Edge HighAvailability switch over happened. VM has moved to ACTIVE state.
30203 High vShield Edge HighAvailability switch over happened. VM has moved to STANDBY state.

Module vShield Edge IPSec

Code Severity Event Message:
30401 Informational IPsec Channel from localIp : <local-endpoint-ip> to peerIp : <peer-end point-ip> changed the status to up
30402 Informational IPsec Channel from localIp : <local-subnet-ip> to peerIp : <peer-subnet-ip> changed the status to down
30403 Informational IPsec Tunnel from localSubnet : <local-subnet-ip> to peerSubnet : <peer-subnet-ip> changed the status to up
30404 Informational IPsec Tunnel from localSubnet : <local-subnet-ip> to peerSubnet : <peer-subnet-ip> changed the status to down

Best (Public) VMware NSX Learning Resources

Let me qualify the title.. I say “best” with the full authority that my opinion carries.  Just trying to give y’all a place to go to get your NSX learn on…

Digital Literature …

VMware Product Walkthroughs – NSX 

The NSX walkthrough is the perfected balance the brevity of a presentation slide-deck with involved hands-on demonstrations.  Very well put together (Check out some of the other walkthroughs).

VMware NSX Design Guide 

The design guide is a PDF~30 pages is a gentle introduction to NSX topologies.  Fundamental read if you’re still trying to get a handle on NSX concepts. 

VMware Network Virtualization Blog

Subject matter content from the experts.  Posts by Martin Casado, Bruce Davie, Brad HedlundRoger Fortier.

VMware Hands on Labs (HOL) Focus: Networking

Get acquainted with NSX Dynamic Routing, the Distributed Firewall & Load Balancing.

VMware NSX 6 Documentation Center

Nothing fancy about this one… ’tis the manuals.  NSX Install and Upgrade Guide & NSX Administration Guide.  Although in the public domain, this resource is extremely difficult (if not impossible) to find via search.  But they are in the public domain.  Whatever is public is not private…right?  

Martin Casado’s Blog – Network Heresy

Scott Lowe’s Blog – Learning NVP/NSX 

Brad Hedlund’s Blog – NSX

If videos are the way you learn …

NSX Architecture Webinar by Ivan Pepelnjak on ipspace.net

VMworld 2013 – Introducing the World to VMware NSX (By Sachin Thakkar)

VMware Interview – Bruce Davie on NSX

VMware NSX Demo

This should at the very least provide a fair start for anyone looking to mentally ramp up for the NSX NVP.

– Gabe

vShield/vCNS 5.1x CLI Operations using Expect

The vCNS(vShield) practical CLI use is limited from a configuration perspective, but you may need to interact with these from time to time.  Troubleshooting /debugging sessions/log purging come to mind.

The options for getting the job done:

1.  Interact with the vCNS Manager virtual machine console in vCenter (not great for debugging, or reading the long exception output)

2.  SSH (ssh server is enabled from the console: vsm> enable, vsm# ssh start)

Expect works well with the vtysh pseudo-terminal used for the vCNS Manager console.   I tried and failed (due to errors interacting with the terminal).   If you manage multiple vCNS environments, it makes sense to wrap the interactions into these expect scripts.  Here’s a small example expect script to change the CLI password from the default.

#!/usr/bin/expect -f
# Synop: SSH to vCNS Appliance console. Auth. Enter priv mode. Auth Enter global config. Change the 
# default password.
# SSH <vsm#ip> # enable [enter] # default [enter] # config t [enter]
# cli password %passwword> [enter] # end [enter] # wr mem
spawn ssh admin@
expect "password: "
send "default\r"
expect ">"
send "en\r"
expect "Password: "
send "default\r"
expect "#"
send "config t\r"
expect "#"
send "cli password mYn3wp@ssw0rd\r"
expect "#"
send "\r"
send "exit\r"

If your operational policy is to update your password every few months; you will find yourself revisiting a script like this.  For passing commands to multiple vCNS Managers, you can extend the script to spawn connections based on a list (outside the scope of this post).