Friday, October 14, 2016

The S.M.A.R.T. Overflow

Before today, I didn't realize that S.M.A.R.T. has overflow issues. I should say that I spent most of yesterday replacing three disks in my home machine's RAID-10, so I constantly was using smartctl to double-check stuff. So when I got to work today, I poked around the disks in my server. And I noticed this:

...
SMART Self-test log structure revision number 1
Num  Test_Description ... LifeTime(hours) ...
# 1  Short offline    ... 2484            ...

# 2  Short offline    ... 2469            ...
# 3  Short offline    ... 2445            ...
...

This made absolutely no sense, because I knew that most of the disks in there were certainly older than a few months. After some digging, I came to realize what the problem is:

...
SMART Self-test log structure revision number 1
Num  Test_Description ... LifeTime(hours) ...
# 1  Short offline    ... 1276            ...
# 2  Short offline    ... 62136           ...
# 3  Short offline    ... 61776           ...

...

The actual value for Power_On_Hours is 66812 for that disk. Let's subtract 1276 from that and what do we find? 65536 of course. So the life time recorded with each self-test is apparently an unsigned 16-bit value, whereas the total hours recorded for the disk itself is stored as something bigger. Here's to hoping that people who write scripts telling them which disks to swap are aware of this...

No comments:

Post a Comment