I’ve recently been getting weird crashes and slow behavior on my home server. I suspected that a disk failure was to blame and I was right. Luckily I’m running btrfs in RAID1 so nothing is lost. Here is how I went about discovering/fixing the problem.
The most helpful thing is that from day one I set up a monthly scrub of the
volume to detect/correct errors. This is trivial on Arch because it provides a
systemd timer and service in the
[Unit] Description=Btrfs scrub on %f [Service] Nice=19 IOSchedulingClass=idle ExecStart=/usr/bin/btrfs scrub start -B %f
[Unit] Description=Monthly Btrfs scrub on %f [Timer] OnCalendar=monthly AccuracySec=1d Persistent=true [Install] WantedBy=multi-user.target
That’s all in the package, so no work for you to do. Just start it with
$ systemctl enable email@example.com $ systemctl start firstname.lastname@example.org
to scrub your
/home partition. This monthly scrub has been showing errors for
$ journalctl --unit=btrfs-scrub@home Mar 01 00:00:00 eleven systemd: Started Btrfs scrub on /home. Mar 01 02:16:28 eleven btrfs: WARNING: errors detected during scrubbing, corrected Mar 01 02:16:28 eleven btrfs: scrub done for 4d3ab8e2-acc4-44bf-bf79-f2870dabb705 Mar 01 02:16:28 eleven btrfs: scrub started at Tue Mar 1 00:00:00 2016 and finished after 02:16:28 Mar 01 02:16:28 eleven btrfs: total bytes scrubbed: 1.76TiB with 63 errors Mar 01 02:16:28 eleven btrfs: error details: read=63 Mar 01 02:16:28 eleven btrfs: corrected errors: 63, uncorrectable errors: 0, unverified errors: 0
But the error numbers were all small and correctable so I ignored them… until a few months ago. I started getting lots of scary “uncorrectable errors” that would pop up one scrub only to disappear the next (I guess they got corrected?). And the total number of errors was steadily climbing.
This was worrying, but the server kept chugging along just fine, so I took a mental note and no further action. When it started acting up with really slow boot times and crashes I finally rolled up my sleeves and looked for the issue:
# btrfs device stats /home [/dev/sdb].write_io_errs 16 [/dev/sdb].read_io_errs 10768 [/dev/sdb].flush_io_errs 0 [/dev/sdb].corruption_errs 0 [/dev/sdb].generation_errs 0 [/dev/sdc].write_io_errs 0 [/dev/sdc].read_io_errs 0 [/dev/sdc].flush_io_errs 0 [/dev/sdc].corruption_errs 0 [/dev/sdc].generation_errs 0
/dev/sdb is the culprit.
First I need to figure out what disk that is, because I sure as hell don’t remember.
$ cat /sys/block/sdb/device/wwid t10.ATA WDC WD20EZRX-00D8PB0 WD-WMC4M2261405
I don’t feel like buying that old model of HDD again, so how big of a drive do I need to replace it?
# btrfs filesystem show /home Label: none uuid: 4d3ab8e2-acc4-44bf-bf79-f2870dabb705 Total devices 2 FS bytes used 573.50GiB devid 1 size 1.82TiB used 812.06GiB path /dev/sdb devid 2 size 1.82TiB used 812.04GiB path /dev/sdc
btrfs needs it to be as big or bigger so any old 2TB drive will do. Plug that sucker in and then out with old, in with the new:
# btrfs replace start -r /dev/sdb /dev/sdd /home # btrfs replace status -1 /home
/dev/sdd is the new drive,
-r tells btrfs to only read from the bad drive as
a last resort while mirroring everything.
/dev/sdc is still in good shape so
this should greatly speed up the replacement process. This starts a background
process which you can monitor with the second command.
Took a bit over an hour:
Started on 17.Dec 12:52:45, finished on 17.Dec 14:18:18, 0 write errs, 0 uncorr. read errs