Back two years ago, I was trying to use hashdeep to make sure that my files particularly JPEGs were not bit rotting. I’ve had dozens of JPEGs die in the past because of bit rot. I had a bad RAID controller 10 years ago that killed a bunch of JPEGs, so I’ve been obsessed with correct copies. I tried using hash deep to validate at the user level that JPEGs were not corrupted, but this would basically run for a while on the Mac and then crash because the network connection to the file server was not stable.
So I abandoned this effort and now I’m just trying to get everything copied properly. I have a bunch of different files on different servers now, so hopefully this will be less of the problem. Particularly since I’m keep decades worth of snapshots now. So even if there is bit corruption in a block, hopefully there is an old block somewhere else that keeps it. I should probably just take complete snapshots and stuff them into AWS Glacier at some point as more insurance but for now, hopefully the storage stays stable. I do have btrfs bit rot checking as well and am running RAID10 drives so that will help a little bit too.
Using Rsync Dryrun to the do the same thing
But in the course of doing all this, I needed to remember how to verify whether two directory trees are the same, so into the hell that is rsync once again. The main flags that are needed are -c
which means don’t just check the date and time of modification but actually go through each file and do a checksum to see if they are identical. This prevents missing files that are corrupted on the target drive, but doesn’t defend against corruption in the source.
The second trick is a subtle one there is a difference when you put a trailing slash in rsync. If you put it in, it assumes that all the files are at the same levels so there is a difference between them and you will almost always want the following slash with the source argument (the first one):
# This command will rsync into ./Backup/Personal so it looks for a child
rsync -vnarcP ./Personal ./Backup
# This command will rsync directly into ./Backup/Personal-2022-03-04
rsync -vnarcP ./Personal/ ./Backup/Personal-2022-03-04
If you squint sideways you can see why this makes sense, basically, the first command is more convenient because it assumes you are copying into something of the same name but in the second, you can change the name of the directory at the same time, but it is very confusing!
Also in terms of the many flags, the most important ones are:
-n
this means a dry run you pretty much always want to try it, then give a list of files that it would change but not make any changes. This is really useful to check if two directories are actually the same without actually doing the copy, so it acts like a poor man’s hashdeep but is more reliable-v
this is verbose mode, you nearly always want this, it will show what directories it is traversing which is a pain, but if you look at the log with vi, then a quick:v/\/$/p
will show what files are actually going to be copied and you can delete the ones that are Mac artifacts like .DS_Store or .afp_deleted if you like to see what is really there-c
as mentioned above, instead of using date modified, it actually does a checksum so it is very slow if you have trouble remembering, it is also –checksum-a
is archive mode so the owners and the permission are copied too.-r
means recursive but this doesn’t seem to be really needed, it is the default.-P
show progress when copies are happening but it is not super meaningful with the -n or dryrun.
Comparing on a Server that is SMB Connected
The simplest thing to do is to just connect say a Synology server by SMB and then do Rsync there, this is a bit slower than the next trick which is to set up a Rsync server. That means that the machine on the other end actually does the rsync work with a daemon so you don’t need to look at all the data over the wire, the checksums are just computed on the server, and then sent to your client.
Setting up an Rsync Server on Synology
Unlike hashdeep, this doesn’t leave a hash file, so you can’t just compare hashes to see if anything changes, but it seems to work way better and does not have the SMB long-running connection problem. For one thing, Synology has a rsync server built-in or if you are on a regular machine, you can start a local rsync server from the command line, but the only problem here is you need to make sure you can see the files. Normally when you ssh into a machine then you will only see the home directory on Synology.
Finally, you can use the remote rsync syntax to compare files on your NAS with files on another system to see if there is bitrot assuming your NAS is named Synology and rich is the user account will tell you to want files are new or different in ./Pictures compared with what you have on the server. If you want it to work on the non-standard port you need the -e command to connect to a non-standard ssh port 1122 and note that it will ask for a password if it is there.
You need to make sure that the account that you are using which in the line below is the rsync account has the proper access to the share you want.
rsync -avncP -e "ssh -p 1122" ./Pictures/ rsync@myserver.local:/Personal/Pictures > pictures.log
See if there are differences with Grep -v
Now that you do this then you need to figure out if it finds differences. The confusing thing about the -v option is that the directories are listed in addition to the files that don’t exist or should be copied. So at least with the Mac, you can run the script to delete the lines that are just directories. These have a follow slash where you can use the -v or invert match and then you can have a bunch of expressions which mean look for things without a trailing slash and get rid of certain files like .DS_Store which are part of the Mac files system so you will get a list of the real differences and you can look at them:
grep -v -e '/$' -e "\.DS_Store$" pictures.log