Following is a brain dump of the trials and errors in troubleshooting MS-SQL AlwaysOn High Availability. There are many dependencies that are not clearly articulated in this article, and the intention of this writing is to show some raw efforts in resolving issues with this technology. There shall be a follow up article and a PowerShell script to fully automate the administrative process of Failover Clusters.
Objective: How to Set SQL Always On to Span Multiple Subnets without DNS “stale records” problems.
Resolution: PowerShell does provide us the tools needed to perform failover of SQL databases and servers on-demand. These procedures can be performed by both DBAdmins and SysAdmins; thus, this would enable responsible and available personnel to resolve any MS-SQL “servers down” scenarios.
Method 1: Ask programmers to code their connection strings to with includes of libraries for multi-subnet aware. No details needed here as our group are precluded from Development.
Method 2: Configure SQL using PowerShell
# Adding Prerequisite Microsoft Cluster
if (!(get-module -Name "FailoverClusters") ){
Install-WindowsFeature Failover-Clustering | out-null;
Install-WindowsFeature RSAT-Clustering-MGMT | out-null;
Install-WindowsFeature RSAT-Clustering-PowerShell | out-null;
Import-Module FailoverClusters | out-null;
}
# Find Listener Name
$listener=(Get-ClusterResource | Get-ClusterParameter | Where-Object {$_.Name -eq "name"})[1].Value
$listenerAlias=(Get-ClusterResource | Get-ClusterParameter | Where-Object {$_.Name -eq "name"})[1].ClusterObject.Name
# Set Listeners to register DNS with only its active IP and with shorter TTL
Get-ClusterResource $listenerAlias | Set-ClusterParameter RegisterAllProvidersIP 0
Get-ClusterResource $listenerAlias | Set-ClusterParameter HostRecordTTL 300
Stop-ClusterResource $listenerAlias
Start-ClusterResource $listenerAlias
# Verify changes
Get-ClusterResource $listenerAlias | Get-ClusterParameter HostRecordTTL, RegisterAllProvidersIP
# Failover: set current SQL server as the Primary Replica after suspending all databases. Resume replication afterward.
Import-Module -Name SqlServer
Import-Module "sqlps" -DisableNameChecking
$cluster=(get-clusterresource).Name[3]
#$databases=(ls SQLSERVER:\Sql\$env:computername\default\AvailabilityGroups\$cluster\AvailabilityDatabases\).Name #This method only obtain the names
$databases= Get-ChildItem "SQLSERVER:\Sql\$env:computername\default\AvailabilityGroups\$cluster\AvailabilityDatabases" #This method obtains the database metadata objects
#$databases | foreach {Suspend-SqlAvailabilityDatabase -Path SQLSERVER:\Sql\$env:computername\default\AvailabilityGroups\$cluster\AvailabilityDatabases\$_}
$databases | Suspend-SqlAvailabilityDatabase
Switch-SqlAvailabilityGroup -Path SQLSERVER:\sql\$env:computername\default\availabilitygroups\$cluster -AllowDataLoss
#$databases | foreach {Resume-SqlAvailabilityDatabase -Path SQLSERVER:\Sql\$env:computername\default\AvailabilityGroups\$cluster\AvailabilityDatabases\$_}
$databases | Resume-SqlAvailabilityDatabase
# SQL Method
ALTER AVAILABILITY GROUP [AvailabilityGroupName] FORCE_FAILOVER_ALLOW_DATA_LOSS;
# Update DNS with the current Live IP of AlwaysOn Listener
Get-ClusterResource $listenerAlias | Update-ClusterNetworkNameResource
# Check DNS Record to verify the correct Active IP
(Resolve-DnsName $listener).IPAddress
# Force DNS Replication (SysAdmin)
repadmin /syncall /Aed
We seem to have fixed the STAGE MS-SQL AlwaysOn Availability Groups fail-over issues. I would propose we apply the findings toward PROD so that we can have proper fail-over in that environment, also. Below is a log of actions to derive at a summary conclusion.
– Configured Availability Replicas of the targeted Availability group name to Select a Secondary Server in the Secondary Zone (192.166.9.0/24 subnet) by enabling Availability mode “Synchronous commit” with Failover “Automatic”. There was a MS-SQL constraint to only allow 1 pair of synchronous automatic replicas. Thus, 3rd replica (SERVER2-Zone1) must be put in Asynchronous Manual mode.
– Ran command on SERVER2-Zone1:
ALTER AVAILABILITY GROUP STG_AG FAILOVER;
Msg 41122, Level 16, State 12, Line 1
Cannot failover availability group ‘STG_AG’ to this instance of SQL Server. The local availability replica is already the primary replica of the availability group. To failover this availability group to another instance of SQL Server, run the failover command on that instance of SQL Server. If local instance of SQL Server is intended to host the primary replica of the availability group, then no action is required.
– Ran command on SERVER1-Zone1:
ALTER AVAILABILITY GROUP STG_AG FAILOVER;
Msg 41142, Level 16, State 34, Line 1
The availability replica for availability group ‘STG_AG’ on this instance of SQL Server cannot become the primary replica. One or more databases are not synchronized or have not joined the availability group, or the WSFC cluster was started in Force Quorum mode. If the cluster was started in Force Quorum mode or the availability replica uses the asynchronous-commit mode, consider performing a forced manual fail-over (with possible data loss). Otherwise, once all local secondary databases are joined and synchronized, you can perform a planned manual fail-over to this secondary replica (without data loss). For more information, see SQL Server Books Online.
– Ran command on SERVER1-Zone1:
ALTER AVAILABILITY GROUP STG_AG FORCE_FAILOVER_ALLOW_DATA_LOSS;
Command(s) completed successfully.
– Connected to SERVER3-Zone2 > viewed Availability Group > noticed that SERVER1-Zone1 (desired primary replica) role was shown as “unknown” > Detached SERVER3 from Availability Group > Rejoined:
ALTER AVAILABILITY GROUP EDMS_STG_AG JOIN;
– Connected to SERVER2-Zone1 > Rejoined:
ALTER AVAILABILITY GROUP STG_AG JOIN;
Summary:
1. Microsoft AlwaysOn High Availability Group should have Primary Replica in Zone1 and Secondary Replica Automatic failover in Zone2 (assuming different subnets) if its Listener (host) record spans both Zone1 & Zone2.
2. Failover Cluster Group Node Owner does not have to match MS-SQL Availability Group Replica’s Primary Replica. Thus, the “Move-ClusterGroup “Cluster Group” -node SERVERX” command does not affect AlwaysOn High Availability Groups.
3. Availability Group Listeners are AD-DNS integrated by default. No manual DNS changes are needed once multiple IPs are set in MS-SQL
# Discover Endpoints
SELECT name,protocol_desc, port, state_desc FROM sys.tcp_endpoints WHERE type_desc = 'DATABASE_MIRRORING'
# Show AlwaysOn Endpoints and Statuses
select r.replica_server_name, r.endpoint_url,
rs.connected_state_desc, rs.last_connect_error_description,
rs.last_connect_error_number, rs.last_connect_error_timestamp
from sys.dm_hadr_availability_replica_states rs join sys.availability_replicas r
on rs.replica_id=r.replica_id
where rs.is_local=1
Error
The Database Mirroring endpoint cannot listen on port 5022 because it is in use by another process. (Microsoft SQL Server, Error: 9692)
# Check listening ports within MSSQL
SELECT * FROM sys.dm_tcp_listener_states WHERE port='XXXX'
# Troubleshooting use only: Delete Database Mirroring connection
Object Explorer > Server Objects > Endpoints > Database Mirroring >
Hadr_endpoint {remove}
# Troubleshooting use only: Manually Create Endpoint
CREATE ENDPOINT [Hadr_endpoint]
AS TCP (LISTENER_PORT = XXXX, LISTENER_IP = (x.x.x.x)
FOR DATA_MIRRORING (ROLE = ALL, ENCRYPTION = REQUIRED ALGORITHM AES)
GO
# Troubleshooting Only: Manually restore database
RESTORE DATABASE TEST
FROM DISK = 'H:\Backup\TEST.bak'
WITH REPLACE,RECOVERY
# Windows CMD method to find the process listening on port XXXX
netstat -aon | findstr :XXXX
# Discover process name of found pid
tasklist /fi "pid eq PID_NUMBER" #result: cvd.exe => CommVault
# Kill the process
taskkill /pid PID_NUMBER /F
# PowerShell Method to kill process (one liner)
Stop-Process -Id (Get-NetTCPConnection -LocalPort XXXX).OwningProcess -Force
# Restart MSSQL
net stop MSSQLSERVER
net start MSSQLSERVER
# Error:
The WSFC cluster could not bring the Network Name resource with DNS name ' online. The DNS name may have been taken or have a conflict with existing name services, or the WSFC cluster service may not be running or may be inaccessible.
# Resolution:
Prestage VCO… Manually create computer object same OU container as the clustername object (computer account)
Defined Objective: re-create Windows Cluster on SQL01, SQL02, SQL03; Cluster Name CLUSTER02: 192.166.9.24, 192.166.16.22; Quorum: ?; Shared folder: ?
RDP into SQL01 to see the Always On Availability Group was removed and Windows Cluster was empty prior to initiating task. Could not connect to CLUSTER02 cluster.
Ran WBEMTEST > Connected Namespace: Root\MSCluster > Query:
SELECT * from MSCluster_Resource
> Error: Number 0x8007012f Facility: win32, Description: The file cannot be opened because it is in the process of being deletedChecked CLUSTER02: File Share Witness \\FILESERVER01\CLUSTER02, No Shared Storage Disks/Pools
Checked SQLCLUSTER02: no File Share Witness (quorum), No Shared Storage Disks/Pools
Obtained confirmation from Ran Wei before running:
Get-Cluster -Name SQLCLUSTER02| Remove-Cluster -Force -CleanupAD
Contemplated an alternative solution of creating a new cluster using these commands:
$nodes="SQL01","SQL02", "SQL03"
foreach ($node in nodes){Remove-ClusterNode -Name $node -Force}
Noted these commands in the case of Cluster Creation:
Set-ClusterQuorum -NodeAndFileShareMajority "\\FILESERVER01\CLUSTER02"
Postulated root cause of cluster failure, correlating to patching cycles, as due to the simultaneous rebooting of all nodes in such cluster(s), without coordinated graceful shutdowns. Adding file share quorum would potentially mitigate this issue as the quorum would ensure that the clustering statistics remain in-tact even when all nodes are unavailable.
Ran various commands:
Test-Cluster CLUSTER02
Get-Cluster -Name CLUSTER02
Get-Cluster : Check the spelling of the cluster name. Otherwise, there might be a problem with your network. Make sure
the cluster nodes are turned on and connected to the network or contact your network administrator.
The RPC server is unavailable
Start-Cluster CLUSTER02
Stop-Cluster CLUSTER02
Set-ClusterQuorum -NodeAndFileShareMajority "\\FILESERVER01\CLUSTER02"
Start-Service ClusSvc
Start-Cluster CLUSTER02
Get-Cluster
$nodes="SQL01","SQL02", "SQL03"
foreach ($node in $nodes){Remove-ClusterNode -Name $node -Force}
IP: 192.166.16.22
IP: 192.166.9.24
EDMS_STG_AG_192.166.9.25
192.166.16.23
192.166.9.25
File Share Witness \\FILESERVER01\CLUSTER02
HQ Network 1040 192.166.16.0/24
ZONE2 network 9: 192.166.9.0/24
Rebuild cluster:
# Connection to each node to clear cluster settings
Clear-Clusternode -Force
# Build Cluster
$clusterName="CLUSTER02"
$nodes="SQL01","SQL02"
$clusterIPs="192.166.16.22","192.166.9.24"
New-Cluster -Name $clusterName -Node $nodes -NoStorage -StaticAddress $clusterIPs
Error: An attempt to use the specified cluster name failed because an enabled computer object with the given name already exists in the domain
Searched AD to find no record of old clustername - proceeded to cleanup traces of DNS entries for CLUSTER02 in association with 192.166.16.22,192.166.9.24
Reran command without success
Performed
repadmin /syncall on DC2
Successfully re-created cluster.
Noticed cluster performance was faster than before, especially no role/resource had yet been added
Added SQL01,SQL02,SQL03 onto permissions list of File Share “\\FILESERVER01\CLUSTER02”
Ran:
Set-ClusterQuorum -NodeAndFileShareMajority "\\FILESERVER01\CLUSTER02"
Error: “… An error occurred looking up the security ID of the cluster name object for ‘CLUSTER02’…”
Decided to give AD some time to see whether new SSID would automatically be generated for new object. If not by tomorrow, manual SSID generation would ensue.
Rebuild MS Failover Cluster CLUSTER02:
Checked CLUSTER02in AD – entity existed now
Checked DNS to find only 1 entry for sqlc1a as 192.166.9.24
Manually created DNS entry sqlc1a with secondary IP 192.166.16.22
Added CLUSTER02$ into the NTFS Permissions List of: \\FILESERVER01\CLUSTER02
Set-ClusterQuorum -NodeAndFileShareMajority “\\\FILESERVER01\CLUSTER02”
Ran Cluster Validation Wizard
Moved SQL03 to SQL OU
Tested Failover successfully
Move-ClusterGroup “Cluster Group” -node SQL01
EDMS_STG_AG
Listener: CLUSTER02, port 1433, IPs: 192.166.16.23, 192.166.9.26
Added KIMCONNECT\svc_cluster02 into NTFS Permissions of \\FILESERVER01\CLUSTER02\backup
File Share Witness \\FILESERVER01\CLUSTER02
HQ Network: 192.166.16.0/24
ZONE2 network : 192.166.9.0/24
netstat -a | find "8225"
netstat -a -b
[macmnsvc.exe] TCP 0.0.0.0:8225 SQL02:0 LISTENING
[svchost.exe] TCP 192.35.16.21:8225 SQL01:58308 ESTABLISHED
[sqlservr.exe] TCP 192.35.16.21:8225 sql01:58329 ESTABLISHED
[sqlservr.exe] TCP 192.35.16.21:8225 sql01:58381 ESTABLISHED
[sqlservr.exe] TCP 192.35.16.21:8225 sql01:58393 ESTABLISHED
[macmnsvc.exe] TCP [::]:8225 SQL02:0 LISTENING
netstat -p tcp -aon
netstat -aon | find /I "5022"
TCP 0.0.0.0:5022 0.0.0.0:0 LISTENING 1292
Retried to see this error:
TITLE: Microsoft SQL Server Management Studio
------------------------------
Joining database on secondary replica resulted in an error. (Microsoft.SqlServer.Management.HadrTasks)
------------------------------
ADDITIONAL INFORMATION:
Failed to join the database 'TEST' to the availability group 'STG_AG' on the availability replica 'SQL01, 8225'. (Microsoft.SqlServer.Smo)
For help, click: https://www.microsoft.com/?ProdName=Microsoft+SQL+Server&ProdVer=11.0.7001.0+((SQL11_PCU_Main).170815-1011+)&EvtSrc=Microsoft.SqlServer.Management.Smo.ExceptionTemplates.FailedOperationExceptionText
------------------------------
An exception occurred while executing a Transact-SQL statement or batch. (Microsoft.SqlServer.ConnectionInfo)
------------------------------
The connection to the primary replica is not active. The command cannot be processed. (Microsoft SQL Server, Error: 35250)
For help, click: https://www.microsoft.com/?ProdName=Microsoft+SQL+Server&ProdVer=11.00.7462&EvtSrc=MSSQLServer&EvtID=35250
Object Explorer > Server Objects > Endpoints > Database Mirroring >
Hadr_endpoint {remove}
RESTORE DATABASE TEST
FROM DISK = 'H:\Backup\TEST.bak'
WITH REPLACE,RECOVERY
Execute:
select r.replica_server_name, r.endpoint_url,
rs.connected_state_desc, rs.last_connect_error_description,
rs.last_connect_error_number, rs.last_connect_error_timestamp
from sys.dm_hadr_availability_replica_states rs join sys.availability_replicas r
on rs.replica_id=r.replica_id
where rs.is_local=1
Result:
SQL01 TCP://SQL01.kimconnect.com:5022 DISCONNECTED The Database Mirroring endpoint cannot listen on port 5022 because it is in use by another process. 0 2019-02-27 00:59:36.917
Categories: