linux find identical files (same bytes) is a content problem: compare
checksums, not only names. linux find duplicate files by name only catches
clashing basenames—two different photos both called IMG_0001.jpg are not
necessarily byte-identical. This page focuses on hash-based dedupe, then calls out
the name-only case briefly.
Commands below were run with GNU findutils, GNU coreutils (
sha256sum,sort), and Bash 5.2.37 on Ubuntu 25.04 (kernel 6.14.0-37-generic). Adjust for BSD if you drop-printforsha256sum.
List duplicate files (same content) under a directory
find emits every regular file; sha256sum prints hash path; awk
groups paths that share a hash:
root=/path/to/scan
find "$root" -type f -exec sha256sum {} + | awk '
{
h=substr($0,1,64)
p=substr($0,67)
paths[h]=paths[h] p ORS
c[h]++
}
END {
for (h in c)
if (c[h] > 1)
printf "---- %s (%d copies)\n%s", h, c[h], paths[h]
}'GNU sha256sum prints hash␠␠path (two spaces after the 64 hex digits). The
substr($0,67) slice keeps paths with spaces intact.
linux remove duplicate files: keep one copy, delete the rest (interactive)
Never parse ls. Hash first, sort so equal sums are adjacent, split each line
into hash<TAB>path (GNU sha256sum: two spaces after the digest, path starts
at column 67), then walk groups in Bash:
#!/usr/bin/env bash
set -euo pipefail
root=${1:?usage: $0 /path/to/scan}
prev_sum=
group=()
flush_group() {
((${#group[@]} <= 1)) && { group=(); return 0; }
local keep="${group[0]}"
printf '---- %d identical files, keeping:\n%s\n' "${#group[@]}" "$keep" >&2
local f ans
for f in "${group[@]:1}"; do
read -r -p "delete \"$f\"? [y/N] " ans || true
if [[ ${ans:-N} == [yY]* ]]; then
rm -f -- "$f" && printf 'deleted: %s\n' "$f" >&2
fi
done
group=()
}
while IFS=$'\t' read -r sum path; do
[[ -z $sum ]] && continue
if [[ -n ${prev_sum:-} && $sum != "$prev_sum" ]]; then
flush_group
fi
prev_sum=$sum
group+=("$path")
done < <(
find "$root" -type f -exec sha256sum {} + |
awk '{ print substr($0,1,64) "\t" substr($0,67) }' |
sort -t $'\t' -k1,1
)
flush_groupThat is a practical remove duplicate files linux flow: you always keep the first
path in each sorted group (alphabetical by path—change sort if you prefer
newest wins).
Dry-run: print rm lines for every duplicate except the first
After sort -k1,1, emit rm only when the hash repeats:
find "$root" -type f -exec sha256sum {} + | sort -k1,1 | awk '
{
h=substr($0,1,64); p=substr($0,67)
if (h==prev && prev!="") print "rm -f -- " p
prev=h
}'Review the lines, then re-run through sh or bash only if you accept the
targets. Paths with spaces are already safe here because p is everything after
the first field from sha256sum.
For a packaged dry-run, jdupes -n / fdupes -n lists duplicate groups without
deleting—read man jdupes before jdupes -d on real data.
Faster checks: size first, then hash
linux find identical files faster on huge trees: group candidates by stat
size, then sha256sum only within equal-size buckets. find -printf '%s\t%p\n'
feeds awk cleanly; skip open/read when sizes differ.
linux find duplicate files by name (not content)
Same basename, possibly different bytes:
find "$root" -type f -printf '%f\t%p\n' | sort | awk -F'\t' '
$1==prev { print prevpath ORS $2 }
{ prev=$1; prevpath=$2 }
'Use this when you care about naming collisions, not byte-identical copies.
linux find duplicates in file (repeated lines inside one text file) is a
different problem: use sort file | uniq -d or awk counting—this page is
about duplicate files on disk.
Related
- awk tutorial for heavier grouping logic.
- check if script is already running if you wrap long dedupe jobs with a lock.
Summary
linux find duplicate files with confidence means find … -type f -exec sha256sum {} + (or md5sum) and grouping on the hash column. linux remove
duplicate files should default to listing and prompting, or a dry-run
before rm. linux find identical files is content-driven; linux find
duplicate files by name is a basename report and does not prove files match.
Combine size filters with hashes on large trees, and prefer find -exec … +
over parsing ls output in any bash find duplicate files script you ship.
For duplicate lines inside one file, use sort | uniq -d instead of hashing
paths.

