Check hard drive health with "smartctl" command on Linux - A practical guide

In the world of data storage, hard drives play a crucial role in maintaining the integrity and accessibility of your data. However, hard drives are not immune to failures, and monitoring their health is essential to prevent data loss and ensure smooth operation.

The smartctl command, available on Linux systems, allows users to monitor and manage the "Self-Monitoring, Analysis and Reporting Technology (SMART)" configuration of hard drives.

Most modern storage devices like hard drives, ssds and nvmes provide somekind of S.M.A.R.T implementation inside them which allows software to read the values and make intelligent judgement about the overall performance and health status of the drive.

For large storage setups like data centers, this is an invaluable tool as it can help predict failures in advance and allow system admins to move data safely and avoid data loss.

In this article, we will explore the smartctl command with detailed examples. We shall run the command on local machines with ssds, hdds installed also on cloud servers like amazon elastic servers.

Installing smartctl

Before we begin, ensure that the smartctl utility is installed on your Linux system. Most distributions include it by default, but if needed, you can install it using the package manager.

For Debian-based systems

sudo apt-get install smartmontools

For CentOS-based systems

sudo yum install smartmontools

1. List all devices on the system

The scan option will make smartctl report all the availabl disk drives on the system along with their device paths and device types.

$ smartctl --scan
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device
/dev/sdc -d scsi # /dev/sdc, SCSI device
/dev/sdd -d sat # /dev/sdd [SAT], ATA device
$

In the above output the first 3 drives are internal ssds connected via sata cable to motherboard. The fourth one is a portable samsung ssd connected via USB.

2. Quick health checkup

With the H option we can do a quick health checkup and smartctl will tell us how the drive is doing at present.

$ sudo smartctl -H /dev/sda
[sudo] password for enlightened: 
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.2.0-27-generic] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

$

If the result is PASSED, the drive is probably doing fine, though its not guaranteed.

3. Full S.M.A.R.T information of disks

The SMART technology embedded in modern hard drives offers insights into their health, performance, and reliability. smartctl allows you to extract detailed information pertaining to these aspects.

To print all SMART information about a disk, the syntax is as follows:

smartctl -a /dev/sdX

Substitute "/dev/sdX" with the suitable device identifier corresponding to your disk. For instance, to view SMART information for the first hard drive, you would use:

$ sudo smartctl -a /dev/sdb
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.2.0-27-generic] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 850 EVO 120GB
Serial Number:    S21SNXAGC12532L
LU WWN Device Id: 5 002538 d408f4063
Firmware Version: EMT02B6Q
User Capacity:    120,034,123,776 bytes [120 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
TRIM Command:     Available
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Sep  1 16:13:48 2023 IST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x53) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  64) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       16838
 12 Power_Cycle_Count       0x0032   095   095   000    Old_age   Always       -       4523
177 Wear_Leveling_Count     0x0013   098   098   000    Pre-fail  Always       -       27
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   063   048   000    Old_age   Always       -       37
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       49
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       1937655527

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

Warning! SMART Selective Self-Test Log Structure error: invalid SMART checksum.
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
  255        0    65535  Read_scanning was never started
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

enlightened@enlightened:~$

Note there are 2 sections, first is "INFORMATION SECTION" that reports details about the drive, like the manufacturer, model, size etc. The second is the "SMART DATA" which reports SMART related parameters and their corresponding values and a bunch of other details.

The command can provide details about cloud storage drives as well like amazon elastic (aws).

linuxworld:~# smartctl -a /dev/nvme0n1p1 
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.71.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org 
=== START OF INFORMATION SECTION === 
Model Number: Amazon Elastic Block Store 
Serial Number: vol0bc76da967d23bf84 
Firmware Version: 1.0 
PCI Vendor/Subsystem ID: 0x1d0f 
IEEE OUI Identifier: 0xa002dc 
Controller ID: 0 
Number of Namespaces: 1 
Namespace 1 Size/Capacity: 107,374,182,400 [107 GB] 
Namespace 1 Formatted LBA Size: 512 
Local Time is: Tue Aug 29 07:52:30 2023 CEST 
Firmware Updates (0x03): 1 Slot, Slot 1 R/O 
Maximum Data Transfer Size: 64 Pages 
Warning Comp. Temp. Threshold: 70 Celsius 
Namespace 1 Features (0x12): NA_Fields *Other* 
Supported Power States 
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 
0 + 0.01W - - 0 0 0 0 1000000 1000000 
Supported LBA Sizes (NSID 0x1) 
Id Fmt Data Metadt Rel_Perf 
0 + 512 0 0 
=== START OF SMART DATA SECTION === 
SMART overall-health self-assessment test result: PASSED 
SMART/Health Information (NVMe Log 0x02) 
Critical Warning: 0x00 
Temperature: - 
Available Spare: 0% 
Available Spare Threshold: 0% 
Percentage Used: 0% 
Data Units Read: 0 
Data Units Written: 0 
Host Read Commands: 0 
Host Write Commands: 0 
Controller Busy Time: 0 
Power Cycles: 0 
Power On Hours: 0 
Unsafe Shutdowns: 0 
Media and Data Integrity Errors: 0
Error Information Log Entries: 0 
Warning Comp. Temperature Time: 0 
Error Information (NVMe Log 0x01, max 64 entries) 
No Errors Logged

4. Checking drive information

To view general information about your hard drive, such as its model, serial number, and firmware version, use the following command. The "-i" option prints just basic information about the drive.

smartctl -i /dev/sdX

Replace "/dev/sdX" with your hard drive identifier. Here's an example:

The following is a samsung 850 evo 120GB ssd connected internall via sata.

$ sudo smartctl -i /dev/sdb
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.2.0-27-generic] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 850 EVO 120GB
Serial Number:    S21SNXAGC12532L
LU WWN Device Id: 5 002538 d408f4063
Firmware Version: EMT02B6Q
User Capacity:    120,034,123,776 bytes [120 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
TRIM Command:     Available
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Sep  1 16:12:00 2023 IST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

$

The smartctl command can also provide information about virtual cloud servers like amazon elastic storage (aws).

linuxworld:~# smartctl -i /dev/nvme0n1p1 
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.71.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org 
=== START OF INFORMATION SECTION === 
Model Number: Amazon Elastic Block Store 
Serial Number: vol0bc76da967d23bf84 
Firmware Version: 1.0 
PCI Vendor/Subsystem ID: 0x1d0f 
IEEE OUI Identifier: 0xa002dc 
Controller ID: 0 
Number of Namespaces: 1 
Namespace 1 Size/Capacity: 107,374,182,400 [107 GB] 
Namespace 1 Formatted LBA Size: 512 
Local Time is: Tue Aug 29 08:19:12 2023 CEST

5. Checking SMART attributes

SMART attributes provide valuable information about the health (hardware condition) and performance parameters of the drive. To access the treasure trove of SMART attributes, the "-A" or "--attributes" option can be used

This command provides a comprehensive list of attributes alongside their current, worst, and threshold values. You can list these attributes using:

sudo smartctl -A /dev/sdx

Sample output Here is an example of smart data of an nvme drive.

linuxworld:~# smartctl -A /dev/nvme0n1p1 
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.71.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org 
=== START OF SMART DATA SECTION === 
SMART/Health Information (NVMe Log 0x02) 
Critical Warning: 0x00 
Temperature: - 
Available Spare: 0% 
Available Spare Threshold: 0% 
Percentage Used: 0% 
Data Units Read: 0 
Data Units Written: 0 
Host Read Commands: 0 
Host Write Commands: 0 
Controller Busy Time: 0 
Power Cycles: 0 
Power On Hours: 0 
Unsafe Shutdowns: 0 
Media and Data Integrity Errors: 0 
Error Information Log Entries: 0 
Warning Comp. Temperature Time: 0

Here is another drive and it smart attribute information, which looks very different from the above one. This is a 480 GB Kingston internal sata ssd on my ubuntu desktop machine.

$ sudo smartctl -A /dev/sdb
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.2.0-27-generic] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       16837
 12 Power_Cycle_Count       0x0032   095   095   000    Old_age   Always       -       4520
177 Wear_Leveling_Count     0x0013   098   098   000    Pre-fail  Always       -       27
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   061   048   000    Old_age   Always       -       39
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       49
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       1936278631

$

There are actually many smart attributes and indicators for different parameters of the storage drive and a complete list can be found on the wikipedia page on S.M.A.R.T.

6. Estimating TBW (Terabytes written) for SSDs

For ssds we can calculate the tbw parameter using values other parameters and some math. There is a discussion on askubuntu.com about this.

Here is a quick example. The following command would report the total amount of data (in GB) written to the drive. Just make sure to put the correct device identifier path. Here its /dev/sdb.

echo "GB Written: $(echo "scale=3; $(sudo /usr/sbin/smartctl -A /dev/sdb | grep "Total_LBAs_Written" | awk '{print $10}') * 512 / 1073741824" | bc | sed ':a;s/\B[0-9]\{3\}\>/,&/;ta')"

Here is a shorter version of the same command:

sudo /usr/sbin/smartctl -A /dev/sdb | awk '$0~/LBAs/{ printf "TBW %.1f\n", $10 * 512 / 1024^4 }'

$ sudo /usr/sbin/smartctl -A /dev/sdb | awk '$0~/LBAs/{ printf "TBW %.1f\n", $10 * 512 / 1024^4 }'
TBW 0.9
$

7. Initiating tests

SMART-enabled drives offer self-testing capabilities. To initiate tests, the "-t" option followed by a test type is utilized. For instance, to execute a short self-test, you can use the below command. Also, see terminal output for more information.

sudo smartctl -t short /dev/sdx

Sample output

linuxworld:~# smartctl -t short /dev/nvme0n1p1 
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.71.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org 
NVMe device successfully opened 
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode". Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful. 
Testing has begun. 
Please wait 1 minutes for test to complete. 
Test will complete after Sun Nov 16 12:51:45 2014 
Use smartctl -X to abort test.

After the test concludes (typically within minutes), the results can be scrutinized using the "-l selftest" option. See the command and terminal output below.

sudo smartctl -l selftest /dev/sda

Sample output

linuxworld:~# smartctl -l selftest /dev/nvme0n1p1 
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.71.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org 
NVMe device successfully opened 
=== START OF READ SMART DATA SECTION === 
SMART Self-test log structure revision number 1 
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error 
# 1 Short offline Completed: read failure 90% 492 210841222 # 2 Extended offline Completed: read failure 90% 492 210841222

8. Accessing error logs

The "-l error" option grants access to the drive's error log, providing historical insights into past issues. The command and terminal output are as follows:

sudo smartctl -l error /dev/sda

linuxworld:~# smartctl -l error /dev/nvme0n1p1 
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.71.1.el7.x86_64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION === 
Error Information (NVMe Log 0x01, max 64 entries) 
SMART Error Log Version: 1 
ATA Error Count: 5 
CR = Command Register [HEX] 
FR = Features Register [HEX] 
SC = Sector Count Register [HEX] 
SN = Sector Number Register [HEX] 
CL = Cylinder Low Register [HEX] 
CH = Cylinder High Register [HEX] 
DH = Device/Head Register [HEX] 
DC = Device Command Register [HEX] 
ER = Error register [HEX] 
ST = Status register [HEX] 
Powered_Up_Time is measured from power on, and printed as 
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, 
SS=sec, and sss=millisec. It "wraps" after 49.710 days. 
Commands leading to the command that caused the error were: 
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 
25 da 08 e7 e5 a5 4c 00 00:30:44.515 READ DMA EXT 
25 da 08 df e5 a5 4c 00 00:30:44.514 READ DMA EXT 
25 da 80 5f e5 a5 4c 00 00:30:44.502 READ DMA EXT 
25 da f0 5f e6 a5 4c 00 00:30:44.496 READ DMA EXT 
25 da 10 4f e6 a5 4c 00 00:30:44.383 READ DMA EXT

9. Automating SMART monitoring

Given the demands of large-scale systems, where numerous hard drives need constant monitoring and maintenance, manual intervention becomes not only cumbersome but also impractical. Checking the health status of each drive, running tests, and generating reports manually can consume an enormous amount of time and effort.

This is where the power of automation shines, and the smartctl command comes to the forefront as a valuable tool for seamless integration into scripts and automation tools.

smartctl can be harnessed within scripts to create automated workflows that handle the monitoring and management of hard drive health. By leveraging its capabilities, system administrators can streamline the process of ensuring the integrity and performance of drives, all while minimizing the need for manual oversight.

Here's a practical example of how smartctl can be seamlessly integrated into a Bash script for automated monitoring and reporting:

#!/bin/bash 
EMAIL="[email protected]" 
LOGFILE="/var/log/smartctl.log" 
echo "SMARTCTL Report" > $LOGFILE 
date >> $LOGFILE 
echo "===============================" >> $LOGFILE 
for DEV in /dev/nvme0n1p1 
do 
smartctl -H $DEV >> $LOGFILE 
echo "---------------------------" >> $LOGFILE 
done 
cat $LOGFILE | mail -s "SMARTCTL Report" $EMAIL

In this script, several key components come together to automate the process of monitoring hard drive health using smartctl:

Email Configuration: The script starts by setting the email address (EMAIL) where the SMART monitoring report will be sent. Replace [email protected] with the appropriate email address.
Logfile Specification: The script defines a logfile (LOGFILE) where the SMART monitoring results will be recorded. The specified path /var/log/smartctl.log is just an example; you can adjust it to match your desired directory and naming conventions.
Creating the Report: The script initiates the creation of the SMART monitoring report by echoing a title, date, and a divider into the logfile.
Loop Through Drives: The script employs a loop to iterate through all drive devices (/dev/sd?), where the "?" represents a single character, such as a, b, c, etc. This loop ensures that the script examines all drives.
Run smartctl Command: Inside the loop, the smartctl -H $DEV command is executed for each drive device. This command fetches the health status of the drive and appends the result to the logfile.
Log Separators: After each drive's health status is recorded, a separator line is added to improve readability in the logfile.
Email the Report: Once all drives have been processed, the script uses cat to read the content of the logfile and then pipes it to the mail command with the "-s" flag to send the email with the SMART monitoring report to the specified email address.

By running this script regularly, perhaps as a scheduled task using a tool like cron, system administrators can maintain a watchful eye over the health of their drives without manual intervention. If any issues arise, the automated report will promptly notify them, enabling swift response and resolution.

The smartctl command's seamless integration into automation scripts empowers administrators to tackle the challenges posed by large-scale systems. By automating the monitoring and reporting of hard drive health, time and effort are saved, while the reliability and performance of the system are upheld.

This approach exemplifies the power of technology in easing the burdens of system management and ensuring the stability of complex environments.

Virtual machines

When run inside virtual machines, smartctl will likely not report any smart parameters as most of the time these are not available. For instance i tried running it on ubuntu running in virtualbox as a guest and the output looked like this:

sudo smartctl -a /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.19.0-43-generic] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     VBOX HARDDISK
Serial Number:    VBa61165f2-20eea9f5
Firmware Version: 1.0
User Capacity:    53,687,091,200 bytes [53.6 GB]
Sector Size:      512 bytes logical/physical
Device is:        Not in smartctl database 7.3/5319
ATA Version is:   ATA/ATAPI-6 published, ANSI INCITS 361-2002
Local Time is:    Fri Sep  1 16:19:28 2023 IST
SMART support is: Unavailable - device lacks SMART capability.

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
$

As can be seen above, the command could report disk size and some limited information about the disk itself, but no smart information is available. This also makes sense, since smart data is more related to the actual physical hardware of the drive and in a virtualised environment, everything is emulated.

Conclusion

The smartctl command, nestled within the smartmontools package, emerges as an indispensable tool in the arsenal of system administrators. Its ability to uncover intricate SMART attributes, execute tests, and enable automation equips administrators to proactively safeguard against potential drive failures.

By harnessing its multifaceted capabilities, administrators can ensure data integrity, minimize downtime, and fortify their systems against the perils of hard drive failures. The practical examples and screenshots provided in this guide serve as a stepping stone towards mastering the art of smartctl on Linux systems.