Remove duplicate files
December 3rd, 2008This is a slightly modified version of the script published here. It allows you to scan files for duplicates based on md5 checksums.
#!/bin/bash
# rd - remove dupliactes
# find the files using the specified 'find arguments'
find "$@" -type f -print0 |
# calculate checksum for each file
xargs -0 -n1 md5sum |
# sort on the checksum
sort --key=1,32 |
# show remove command for each duplicate file
awk 'dup[$1]++{print "rm -f " $2}'
exit 0
The script is safe to use, it it not able to actually delete files itself. Instead, it generates a script that does the risky stuff.
Usage
To see what files are marked as duplicate in the current working directory:
$ rd . rm -f ./config_backup_2008-11-06_11.30.01.tar.bz2 rm -f ./config_backup_2008-11-07_11.30.01.tar.bz2 rm -f ./config_backup_2008-11-08_11.30.01.tar.bz2
If you like the result, you can execute the generated commands. This can by piping the output to the shell:
$ rd . | sh
Processing the rd command might take some time. So you can also copy and paste the output in the terminal when there are a lot of (big) files.
Since the script passes all arguments to the find command. It’s also possible to fine tune the find command. For example, you only want to remove duplicates in the current directory, without searching in sub directories:
# rd . -maxdepth 1
I’m using the script to remove duplicate backup sets.
