SQL Server Corruption Post Mortem
·
2 min read
Table of Contents
“No matter what method you use to recover from corruption, you should always determine why it happened to avoid future problems.” – Paul Randal
Most Likely Causes
The most likely causes are outlined below, in order of likelihood:
- Run I/O subsystem and server memory diagnostics (almost always the cause)
- Windows OS
- File system filter drivers – e.g. antivirus, defraggers, encryption
- Network cards, switches, cables
- Memory corruption
- Bad memory chips
- Scribblers
- SAN controllers
- RAID controllers
- Disks
Examine the Logs
Check the SQL Server error log and Windows event logs for clues:
- Error 823: A hard I/O error (when Windows can’t return the data to SQL)
- Error 824: A soft I/O error (when SQL detects there is a problem with the data it was given by Windows)
- Error 825: A read-retry error (these show as information alerts however they are critical because they are impending-doom warning signs)
- More information: Microsoft KB 2015757
Also:
- Check that the firmware is up-to-date
- Investigate NTFS filter drivers
What Does NOT Cause Corruption
It is important to remember that corruptions are not caused by the following:
- Anything an application can do
- Anything you can do in SQL Server with supported, documented commands
- Interrupting a database shrink, index rebuild, or long-running batch
- Shutting down SQL Server
Early Detection Strategies
If corruption happened, it is likely that certain steps are not being done inside your organization to protect your data. Here is a list of what can be done to help detect early signs:
- Implement error alerts for Severity 19 and above errors – see Configuring Alerts for High Severity Problems
- Implement Page Verify – Checksum option – see Setting Your Page Verify Database Option to Checksum
- Implement backups with Checksum option
- All databases INCLUDING system databases
- Allows you to use
RESTORE VERIFYONLYwithCHECKSUMto validate your backups
- Implement an integrity check script – either Ola Hallengren’s Maintenance Solution (widely used) or at minimum a
DBCC CHECKDB WITH NO_INFOMSGSjob to check integrity