SQL Server Health Check – Case Study
SQL Server Health Check – Case Study
Industry vertical: National Automotive Service
Client background:
2000+ employees
$1 Billion+ in revenues
300 locations nationwide
Problem Summary:
-
The client is experiencing severe issues with the Microsoft Windows Failover Clustering stability. Four node Windows Cluster goes down about two-three times per day and takes about 30 minutes to recover each time. While about 2,000 employees can not work and clients can not be serviced.
-
Severe SQL Server performance issues. Multiple queries were taking 1, 5, 10 minutes or longer or timing out. Employees and customers wait… frustrated.
-
Windows Cluster has 4 nodes. Three of them do not work.
After a while client realized that the skill set did not exist internally to resolve stability and SQL performance problems quickly. Therefore CDMBase got the call.
What was done:
-
CDMBase performed “Comprehensive SQL Server Health Check”. This outlined 41 issues in:
-
Reliability (which included maintenance)
-
Performance
-
Security
-
A comprehensive methodology was applied to look at various levels Cluster and SQL depended on: CPU, RAM, networking, BIOS, VMWare setup, Windows Operating System settings, Windows Group Policy, Active Directory, multiple error logs, storage, networking, and few more areas. It may seem like an overkill to look at all the layers, but CDMBase practices prove that comprehensive analysis is effective. Especially when stability and speed matter.
-
-
CDMBase performed “Microsoft Windows Cluster Health Check”.
-
After checking Win Clustering's best practices we found severe problems in the cluster setup. Which caused nodes going down randomly. And on top of that Cluster could not recover. Since the hardware was all different from all four nodes and severely underpowered, and only a single node could realistically handle the SQL Server workload. Therefore we took other 3 useless nodes. Temporarily – until the second node was fixed that could become the second node that can handle the workload.
-
Many cluster options were simply wrong.
-
Storage misconfigured.
-
Networking misconfigured.
-
Priorities and thresholds were misconfigured.
-
-
-
CDMBase did “SAN Performance testing”. This revealed severe slowness.
-
SAN which was just bought and cost $250K was working at USB2 speed or 60 megabytes per second (MBps).
-
SAN had three-level of drives: fast, medium and slow. We worked with SAN/sysadmin and figured out which is which, and moved SQL Server files around by placing data that got “hit” the most on fastest drives​
-
-
CDMBase found a contractor which was periodically bringing SQL down.
-
This external contractor in charge of SQL Server attempted to fix SQL performance problems by running SQL Server Profiler. By using the tool improperly (collecting few wrong events) he was bringing SQL to its knees. Incorrectly concluding that SQL was going to go down anyway, he just caught the moment, when it was about to go down and SQL did go down while he watched. After CDMBase identified the problem, security permissions were tightened.
Situation after CDMBase work was complete:
-
SQL Server Cluster does not go down at random times and is fully stable. Scheduled failovers are now done periodically every 60-90 days.
-
SQL Server Failover Cluster reboot time is now under 5 minutes (was 30min+).
-
About 20-30 SQL queries with performance issues were addressed by significantly cutting resources used or run time duration required.
-
Internal applications that accessed SQL databases in Headquarters are running a lot faster. Employees are happier. Customers don’t have to wait as long as before.
Both the CDMBase analysis reports and the recommendations received praising feedback from the client. We continued the work for this client for 2 years+.