I’ve recently been getting weird crashes and slow behavior on my home server. I suspected that a disk failure was to blame and I was right. Luckily I’m running btrfs in RAID1 so nothing is lost. Here is how I went about discovering/fixing the problem.

Find it

The most helpful thing is that from day one I set up a monthly scrub of the volume to detect/correct errors. This is trivial on Arch because it provides a systemd timer and service in the btrfs-progs package.

btrfs-scrub@.service

[Unit]
Description=Btrfs scrub on %f

[Service]
Nice=19
IOSchedulingClass=idle
ExecStart=/usr/bin/btrfs scrub start -B %f

btrfs-scrub@.timer

[Unit]
Description=Monthly Btrfs scrub on %f

[Timer]
OnCalendar=monthly
AccuracySec=1d
Persistent=true

[Install]
WantedBy=multi-user.target

That’s all in the package, so no work for you to do. Just start it with

$ systemctl enable btrfs-scrub@home.timer
$ systemctl start  btrfs-scrub@home.timer

to scrub your /home partition. This monthly scrub has been showing errors for years.

$ journalctl --unit=btrfs-scrub@home

Mar 01 00:00:00 eleven systemd[1]: Started Btrfs scrub on /home.
Mar 01 02:16:28 eleven btrfs[20021]: WARNING: errors detected during scrubbing, corrected
Mar 01 02:16:28 eleven btrfs[20021]: scrub done for 4d3ab8e2-acc4-44bf-bf79-f2870dabb705
Mar 01 02:16:28 eleven btrfs[20021]:         scrub started at Tue Mar  1 00:00:00 2016 and finished after 02:16:28
Mar 01 02:16:28 eleven btrfs[20021]:         total bytes scrubbed: 1.76TiB with 63 errors
Mar 01 02:16:28 eleven btrfs[20021]:         error details: read=63
Mar 01 02:16:28 eleven btrfs[20021]:         corrected errors: 63, uncorrectable errors: 0, unverified errors: 0

But the error numbers were all small and correctable so I ignored them… until a few months ago. I started getting lots of scary “uncorrectable errors” that would pop up one scrub only to disappear the next (I guess they got corrected?). And the total number of errors was steadily climbing.

This was worrying, but the server kept chugging along just fine, so I took a mental note and no further action. When it started acting up with really slow boot times and crashes I finally rolled up my sleeves and looked for the issue:

# btrfs device stats /home

[/dev/sdb].write_io_errs    16
[/dev/sdb].read_io_errs     10768
[/dev/sdb].flush_io_errs    0
[/dev/sdb].corruption_errs  0
[/dev/sdb].generation_errs  0
[/dev/sdc].write_io_errs    0
[/dev/sdc].read_io_errs     0
[/dev/sdc].flush_io_errs    0
[/dev/sdc].corruption_errs  0
[/dev/sdc].generation_errs  0

So /dev/sdb is the culprit.

Fix it

First I need to figure out what disk that is, because I sure as hell don’t remember.

$ cat /sys/block/sdb/device/wwid

t10.ATA     WDC WD20EZRX-00D8PB0                         WD-WMC4M2261405

I don’t feel like buying that old model of HDD again, so how big of a drive do I need to replace it?

# btrfs filesystem show /home

Label: none  uuid: 4d3ab8e2-acc4-44bf-bf79-f2870dabb705
        Total devices 2 FS bytes used 573.50GiB
        devid    1 size 1.82TiB used 812.06GiB path /dev/sdb
        devid    2 size 1.82TiB used 812.04GiB path /dev/sdc

btrfs needs it to be as big or bigger so any old 2TB drive will do. Plug that sucker in and then out with old, in with the new:

# btrfs replace start -r /dev/sdb /dev/sdd /home
# btrfs replace status -1 /home

/dev/sdd is the new drive, -r tells btrfs to only read from the bad drive as a last resort while mirroring everything. /dev/sdc is still in good shape so this should greatly speed up the replacement process. This starts a background process which you can monitor with the second command.

Took a bit over an hour:

Started on 17.Dec 12:52:45, finished on 17.Dec 14:18:18, 0 write errs, 0 uncorr. read errs