Failover Cluster Troubleshooting
There’s nothing quite like logging in to a customer’s system first thing Monday morning only to be greeted with this:
I discovered this when I wasn’t able to log into the customer’s ILINX Capture implementation. The logged error (failure to locate the SQL Server) led me to take a look at the SQL Server’s configuration to confirm that its service was not running on either node of the cluster, and the error I got when trying to start that (a clustered resource could not be activated) led me to check on the clustered resources themselves.
Just FYI, I’m a software developer by trade with just enough network admin experience to be competent at troubleshooting things like this, so forgive me if butcher this explanation slightly: At a high level, every clustered resource has two components – the resource itself and an IP address tied to it. If either of these components exhibit a failure during initialization, the entire clustered resource becomes unavailable. I’m still kicking myself a bit for not seeing it sooner (it took a comment from a co-worker noting that, “At least the data [was] still there, if you can ever get to it” to get me thinking on the right track), but the above readout actually tells us exactly what we need to know in this case: The resources themselves were unavailable because their IP addresses failed to activate correctly. This could have been caused by a number of things, but whatever the reason it had to be fixed quickly. In Windows Server 2012 R2, this can be handled through the Failover Cluster Manager:
In both Cluster Core and Rules menu, we can see an IP Address value assigned. This needs to match the IP address assigned to the respective resource in DNS and can be set via its Properties dialog, found under its context menu:
From here, the IP address and its subnet mask can be configured, and once that is done, the IP Address resource can be brought online, which will cascade to the other components as well. Taking a final look at get-clusterResource in PowerShell will confirm the change at this point:
This is only one side of the coin, however, the other being if the resource itself fails (in this case, it would be either SQL Server services of the logical disk). We’ll save that for another time though.