Hello Fellow NSX Operators!
Before I jump into the HA commands, let me briefly preface with a few words about NSX Edge Services Gateway High Availability (simply HA going forward). You will need to understand the heartbeat path and what type of infrastructure-impacting health events are common to your infrastructure. You may find yourself troubleshooting High Availability many times because of a change or degradation in the underlying Hosts, Storage or Network. Be careful with those red herrings. When HA is implemented with a solid understanding of the underlying infrastructure and its variations, you can enjoy peace of mind in knowing the edge network services are highly available.
This article covers the following topics in regards to HA:
– Implementation considerations
– Troubleshooting commands
– Proactively monitoring HA via syslog
A few HA facts/points/considerations/recommendations…
– It uses an Active/Standby topology.
– When HA is enabled, a second VM is deployed. The new VM will only be networked to communicate with the primary.
– When HA is disabled, the 2nd VM is destroyed.
– HA appliances will be deployed based on the user-defined mappings (at these these settings are not dynamic).
– Edge mappings are most easily managed using /api/4.0/edges/<edgeId>/appliances with the REST api
– Changes appliance settings will trigger an OVF re-deployment of the edge.
HA IP Configuration
– Optional. If not configured, NSX will assign a valid /30 IP pair using an RFC3927 network.
– If configured manually, valid subnets are system enforced. 10.0.0.0/30 and 10.0.0.1/30 is not valid. 10.0.0.1/30 and 10.0.0.2/30 is valid.
HA vNic Selection
– Optional, it can be left to ANY.
– A minimum of one edge interface is required before enabling HA.
– The recommendation for maximum availability is to configure a network dedicated to the vNIC heartbeating.
– Sharing a vNIC will work without problems as long as the network is not overloaded and available.
HA Timeouts and Heartbeating
– The default deadtime is 6 seconds
– The current recommended deadtime is 15 seconds (uses a 3 second polling frequency). There is a tradeoff of service failover time for increased resiliency to lost heartbeats.
– Heartbeats are sent using UDP-694 (the IANA registered port for heartbeats)
HA Appliance Anti-affinity
– Host anti-affinity is handled by system. When HA is enabled there is a cluster DRS rule added automatically with the name anti-affinity-rule-edge-#, where edge=# is the edge-ID.
– Storage anti-affinity is not handled by default. For maximum availability of the edge pair, configure the edge appliances to deploy to different physical storage resources. Especially important in infrastructure that uses centralized storage.
Troubleshooting ESG HA with CLI-based Edge Commands
show service highavailability example output
nsxe-0> show service highavailability Highavailability Status: running Highavailability Unit Name: nsxe-0 Highavailability Unit State: active Highavailability Interface(s): vNic_5 Unit Poll Policy: Frequency: 3 seconds Deadtime: 15 seconds Stateful Sync-up Time: 10 seconds Highavailability Healthcheck Status: Peer host [vse-1 ]: good This host [vse-0 ]: good Highavailability Stateful Logical Status: File-Sync running Connection-Sync running xmit xerr rcv rerr 51219548828 0 42990848 0
show service highavailability connection-sync example output
nsxe-0> show service highavailability connection-sync connections local: current active connections: 12693 connections created: 368613263 failed: 0 connections updated: 21695297 failed: 0 connections destroyed: 368600570 failed: 0 connections peer: current active connections: 0 connections created: 26571 failed: 0 connections updated: 1024 failed: 0 connections destroyed: 26571 failed: 0 traffic processed: 1248602045934 Bytes 6285222215 Pckts UDP traffic (active device=vNic_5): 51255382200 Bytes sent 43018912 Bytes recv 590146284 Pckts sent 2518471 Pckts recv 0 Error send 0 Error recv message tracking: 0 Malformed msgs 5863 Lost msgs
show service highavailability connection-sync example output
vse-0> show service highavailability link Local IP Address: 220.127.116.11/30 Peer IP Address: 18.104.22.168/30
debug packet display / “sniffing” HA heartbeats
Filter using the High Availability vNIC from the root command “show service highavailability”
nsxe-0> debug packet display interface vNic_# port_694 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on vNic_5, link-type EN10MB (Ethernet), capture size 65535 bytes 17:22:50.357722 IP 22.214.171.124.24758 > 126.96.36.199.694: UDP, length 189 17:22:52.709253 IP 188.8.131.52.32165 > 184.108.40.206.694: UDP, length 189 17:22:53.360327 IP 220.127.116.11.24758 > 18.104.22.168.694: UDP, length 190 17:22:55.711667 IP 22.214.171.124.32165 > 126.96.36.199.694: UDP, length 203 17:22:55.711715 IP 188.8.131.52.32165 > 184.108.40.206.694: UDP, length 189 17:22:55.742631 IP 220.127.116.11.24758 > 18.104.22.168.694: UDP, length 203 17:22:56.353520 IP 22.214.171.124.24758 > 126.96.36.199.694: UDP, length 189 17:22:58.716886 IP 188.8.131.52.32165 > 184.108.40.206.694: UDP, length 189 17:22:59.357186 IP 220.127.116.11.24758 > 18.104.22.168.694: UDP, length 189
Viewing Historical HA System Events for an Edge in the Web Client
- Open the vCentter Web Client
- Open Networking & Security
- In NSX Edges, double-click the Edge
- Select the Montor tab
- Select System Events
- On the search widget, click the arrow, click Select Columns…
- Deselect All > Check Module > Type HighAvailability > Click Ok
REST API-based Commands
Query HA Configuration Details on an Edge
GET https://nsxm-ip/api/4.0/edges/edge-#/highavailability/config Example Output
<?xml version="1.0" encoding="UTF-8"?> <highAvailability> <version>6</version> <enabled>true</enabled> <vnic>2</vnic> <ipAddresses> <ipAddress>198.18.0.1/30</ipAddress> <ipAddress>198.18.0.2/30</ipAddress> </ipAddresses> <declareDeadTime>15</declareDeadTime> <logging> <enable>true</enable> <logLevel>error</logLevel> </logging> <security> <enabled>false</enabled> </security> </highAvailability>
Delete Edge HA Configuration on an Edge
Monitoring High Availability Health Proactively
– Open your (vCenter Log Insight ), Splunk or log aggregation solution of choice.
– Build aview of all edge logging (use regex or glob based matches to filter according to your naming convention).
– Examine matches on the text “lost packet”. Build an alerting rule based on your results.
– When the infrastructure is healthy, there should be not be any HA packets lost.
– Examine matches on the text “Late heartbeat”. Build an alerting rule based on your results.
– Late heartbeats may indicate infrastructure problems. Possible resource constraints or both edges in the HA pair.
– This can also result in a split brain state.
Jul 3 09:46:48 nsxe-0 heartbeat: : WARN: Late heartbeat: Node nsxe-1: interval 24921 ms
Lost and late heartbeats are the early indicators. Early indicators are your best friends. Keep a close eye out for these.
Monitor NSX Manager for Switchover Events
– Filter logging based on NSX Manager SystemEvent, you can use the text [SystemEvent] to filter.
– Examine matches for Event 30202 and 30203 (Edge switching to ACTIVE & STANDBY, respectively)
– Any single event source with more than one or two events should raise a red flag. Any unplanned switchover events should be researched. Build an alerting rule based on your findings.
– Look for the text “returning after partition”; Look for the text “Deadtime value may be too small”
– Matches on these can indicate that the state of HA has most likely entered the split brain state. Network Services will be mostly unavailable until the condition is resolved.
– Hopefully these do not exist in your environment. Build a preventive alerting rule. Matches are immediately actionable.
That is all folks. Hope this helps.