Linux: find duplicate files by hash, list them, and remove duplicates safely

Use find with sha256sum (or md5sum) to list identical files on Linux, group by hash, then delete duplicate copies interactively or after a dry-run—plus how “duplicate by name” differs from same content.

Published

Updated

Read time 4 min read

Reviewed byDeepak Prasad

Linux: find duplicate files by hash, list them, and remove duplicates safely

linux find identical files (same bytes) is a content problem: compare checksums, not only names. linux find duplicate files by name only catches clashing basenames—two different photos both called IMG_0001.jpg are not necessarily byte-identical. This page focuses on hash-based dedupe, then calls out the name-only case briefly.

Commands below were run with GNU findutils, GNU coreutils (sha256sum, sort), and Bash 5.2.37 on Ubuntu 25.04 (kernel 6.14.0-37-generic). Adjust for BSD if you drop -printf or sha256sum.


List duplicate files (same content) under a directory

find emits every regular file; sha256sum prints hash path; awk groups paths that share a hash:

bash
root=/path/to/scan
find "$root" -type f -exec sha256sum {} + | awk '
{
  h=substr($0,1,64)
  p=substr($0,67)
  paths[h]=paths[h] p ORS
  c[h]++
}
END {
  for (h in c)
    if (c[h] > 1)
      printf "---- %s (%d copies)\n%s", h, c[h], paths[h]
}'

GNU sha256sum prints hash␠␠path (two spaces after the 64 hex digits). The substr($0,67) slice keeps paths with spaces intact.


linux remove duplicate files: keep one copy, delete the rest (interactive)

Never parse ls. Hash first, sort so equal sums are adjacent, split each line into hash<TAB>path (GNU sha256sum: two spaces after the digest, path starts at column 67), then walk groups in Bash:

bash
#!/usr/bin/env bash
set -euo pipefail
root=${1:?usage: $0 /path/to/scan}

prev_sum=
group=()

flush_group() {
  ((${#group[@]} <= 1)) && { group=(); return 0; }
  local keep="${group[0]}"
  printf '---- %d identical files, keeping:\n%s\n' "${#group[@]}" "$keep" >&2
  local f ans
  for f in "${group[@]:1}"; do
    read -r -p "delete \"$f\"? [y/N] " ans || true
    if [[ ${ans:-N} == [yY]* ]]; then
      rm -f -- "$f" && printf 'deleted: %s\n' "$f" >&2
    fi
  done
  group=()
}

while IFS=$'\t' read -r sum path; do
  [[ -z $sum ]] && continue
  if [[ -n ${prev_sum:-} && $sum != "$prev_sum" ]]; then
    flush_group
  fi
  prev_sum=$sum
  group+=("$path")
done < <(
  find "$root" -type f -exec sha256sum {} + |
    awk '{ print substr($0,1,64) "\t" substr($0,67) }' |
    sort -t $'\t' -k1,1
)
flush_group

That is a practical remove duplicate files linux flow: you always keep the first path in each sorted group (alphabetical by path—change sort if you prefer newest wins).


Dry-run: print rm lines for every duplicate except the first

After sort -k1,1, emit rm only when the hash repeats:

bash
find "$root" -type f -exec sha256sum {} + | sort -k1,1 | awk '
{
  h=substr($0,1,64); p=substr($0,67)
  if (h==prev && prev!="") print "rm -f -- " p
  prev=h
}'

Review the lines, then re-run through sh or bash only if you accept the targets. Paths with spaces are already safe here because p is everything after the first field from sha256sum.

For a packaged dry-run, jdupes -n / fdupes -n lists duplicate groups without deleting—read man jdupes before jdupes -d on real data.


Faster checks: size first, then hash

linux find identical files faster on huge trees: group candidates by stat size, then sha256sum only within equal-size buckets. find -printf '%s\t%p\n' feeds awk cleanly; skip open/read when sizes differ.


linux find duplicate files by name (not content)

Same basename, possibly different bytes:

bash
find "$root" -type f -printf '%f\t%p\n' | sort | awk -F'\t' '
$1==prev { print prevpath ORS $2 }
{ prev=$1; prevpath=$2 }
'

Use this when you care about naming collisions, not byte-identical copies.

linux find duplicates in file (repeated lines inside one text file) is a different problem: use sort file | uniq -d or awk counting—this page is about duplicate files on disk.



Summary

linux find duplicate files with confidence means find … -type f -exec sha256sum {} + (or md5sum) and grouping on the hash column. linux remove duplicate files should default to listing and prompting, or a dry-run before rm. linux find identical files is content-driven; linux find duplicate files by name is a basename report and does not prove files match. Combine size filters with hashes on large trees, and prefer find -exec … + over parsing ls output in any bash find duplicate files script you ship. For duplicate lines inside one file, use sort | uniq -d instead of hashing paths.

Deepak Prasad

R&D Engineer

Founder of GoLinuxCloud with more than 15 years of expertise in Linux, Python, Go, Laravel, DevOps, Kubernetes, Git, Shell scripting, OpenShift, AWS, Networking, and Security. With extensive …