dupd(1)                     General Commands Manual                    dupd(1)

NAME
       dupd - find duplicate files

SYNOPSIS
       dupd COMMAND [OPTIONS]

DESCRIPTION
       dupd scans all the files in the given path(s) to find files with dupli‐
       cate content.

       The sets of duplicate files are not displayed during a scan.   Instead,
       the  duplicate  info is saved into a database which can be queried with
       subsequent commands without having to scan all files again.

       Even though dupd can be used as a simple duplicate reporting tool simi‐
       lar  to  how  other duplicate finders work (by running dupd scan ; dupd
       report), the real power of dupd comes from interactively exploring  the
       filesystem  for  duplicates after the scan has completed. See the file,
       ls, dups, uniques and refresh commands. Read  also  the  section  STALE
       DATABASE for more context.

       Additional  documentation and examples are available under the docs di‐
       rectory in the source tree. If you don't have the  source  tree  avail‐
       able, see https://github.com/jvirkki/dupd/blob/master/docs/index.md

COMMANDS
       As  noted  in the synopsis, the first argument to dupd must be the com‐
       mand to run.  The command is one of:

       scan - scan files looking for duplicates

       report - show duplicate report from last scan

       file - check for duplicates of one file

       ls  - list info about every file

       dups - list all duplicate files

       uniques - list all unique files

       refresh - remove deleted files from the database

       validate - revalidate all duplicates in database

       rmsh - create shell script to delete all duplicates (use with care!)

       help - show brief usage info

       usage - show this documentation

       man - show this documentation

       license - show license info

       version - show version and exit

OPTIONS
       scan - Perform the filesystem scan for duplicates.

       -p, --path PATH
              Recursively scan the directory tree starting at this path.   The
              path  option can be given multiple times to specify multiple di‐
              rectory trees to scan.  If no path option is given, the  default
              is to start scanning from the current directory.

       -m, --minsize SIZE
              Minimum  size  (in  bytes)  to  include in scan.  By default all
              files with 1 byte or more are scanned.  In  practice  duplicates
              in  files that small are rarely interesting, so you can speed up
              the scan by ignoring smaller files.

       --buflimit LIMIT
              Limit read buffer size. LIMIT may be an integer in bytes, or in‐
              clude  a  suffix of M for megabytes or G for gigabytes. The scan
              animation shows the percentage of buffer space in use (%b).  Un‐
              less that value goes up to 100% or beyond during a scan there is
              no need to adjust this limit.  Setting this limit to a low value
              will  constrain dupd memory usage but possibly at a cost to per‐
              formance (depends on the data set).

       -X, --one-file-system
              For each path scanned, do not cross over to a different filesys‐
              tem.   This  is  helpful, for example, if you want to scan / but
              want to avoid any other mounted filesystems such as  NFS  mounts
              or external drives.

       --hidden
              Include  hidden  files (and hidden directories) in the scan.  By
              default these are not included.

       --db PATH
              Override the default database file  location.   The  default  is
              $HOME/.dupd_sqlite.   If  you override the path during scan, re‐
              member to provide this argument and the path for subsequent  op‐
              erations so the database can be found.

       -I, --hardlink-is-unique
              Consider  hard links to the same file content as unique.  By de‐
              fault hard links are listed as duplicates.  See HARD LINKS  sec‐
              tion  below.   Note that if this option is given during scan, it
              cannot be given during interactive operations.

       --stats-file FILE
              On completion, create (or append to) FILE and  save  some  stats
              from the run.  These are the same stats as get displayed in ver‐
              bose mode but are more suitable for programmatic consumption.

       report - Display the list of duplicates.

       --cut PATHSEG
              Remove prefix PATHSEG from the file paths in the report  output.
              This  can  reduce  clutter  in  the output text if all the files
              scanned share a long identical prefix.

       --minsize SIZE
              Report only duplicate sets which consume at least this much disk
              space.   Note  this is the total size occupied by all the dupli‐
              cates in a set, not their individual file size.

       --format NAME
              Produce the report in this output format.  NAME is one of  text,
              csv, json.  The default is text.

       Note:  The  database  format  generated by scan is not guaranteed to be
       compatible with future versions. You should run  report  (and  all  the
       other  commands below which access the database) using the same version
       of dupd that was used to generate the database.

       file - Report duplicate status of one file.

       To check whether one given file still has known duplicates use the file
       operation.   Note  that this does not do a new scan so it will not find
       new duplicates.  This checks whether the duplicates  identified  during
       the  previous  scan still exist and verifies (by hash) whether they are
       still duplicates.

       --file PATH
              Required: The file to check

       --cut PATHSEG
              Remove prefix PATHSEG from the file paths in the report output.

       --exclude PATH
              Ignore any duplicates  under  PATH  when  reporting  duplicates.
              This  is  useful  if  you intend to delete the entire tree under
              PATH, to make sure you don't delete all copies of the file.

       --hardlink-is-unique
              Ignore the existence of hard links to the file for  the  purpose
              of considering whether the file is unique.

       ls, uniques, dups - List matching files.

       While  the  file  command checks the duplicate status of a single file,
       these commands do the same for all the files in a given directory tree.
       Please  note  that  files  which are smaller than the minimum file size
       processed during the scan operation will not be shown.

       ls - List all files, show whether they have duplicates or not.

       uniques - List all unique files.

       dups - List all files which have known duplicates.

       --path PATH
              Start from this directory (default is current directory)

       --cut PATHSEG
              Remove prefix $PATHSEG from the file paths in the output.

       --exclude PATH
              Ignore any duplicates under PATH when reporting duplicates.

       --hardlink-is-unique
              Ignore the existence of hard links to the file for  the  purpose
              of considering whether the file is unique.

       refresh - Refreshing the database.

       As  you remove duplicate files these are still listed in the dupd data‐
       base.  Ideally you'd run the scan again to rebuild the database.   Note
       that  re-running  the  scan  after deleting some duplicates can be very
       fast because the files are in the cache, so that is the best option.

       However, when dealing with a set of files large enough that they  don't
       fit  in the cache, re-running the scan may take a long time.  For those
       cases the refresh command offers a much faster alternative.

       The refresh command checks whether all the files in the  dupd  database
       still exist and removes those which do not.

       Be sure to consider the limitations of this approach.  The refresh com‐
       mand does not re-verify whether all  files  listed  as  duplicates  are
       still  duplicates.   It also, of course, does not detect any new dupli‐
       cates which may have appeared since the last scan.

       In summary, if you have only been deleting duplicates since the  previ‐
       ous scan, run the refresh command.  It will prune all the deleted files
       from the database and will be much faster than a scan.  However, if you
       have been adding and/or modifying files since the last scan, it is best
       to run a new scan.

       validate - Validating the database.

       The validate operation is primarily for testing but is documented  here
       as it may be useful if you want to reconfirm that all duplicates in the
       database are still truly duplicates.

       In most cases you will be better off re-running the scan operation  in‐
       stead of using validate.

       Validate  is  fairly slow as it will fully hash every file in the data‐
       base.

       rmsh - Create shell scrip to remove duplicate files.

       As a policy dupd never modifies the filesystem!

       As a convenience for those times when it is desirable to  automatically
       remove  files,  this operation can create a shell script to do so.  The
       output is a shell script (to stdout) which can you run to  delete  your
       files (if you're feeling lucky).

       Review  the generated script carefully to see if it truly does what you
       want!

       Automated deletion is generally not very useful because it takes  human
       intervention  to decide which of the duplicates is the best one to keep
       in each case.  While the content is the same, one of them  may  have  a
       better file name and/or location.

       Optionally,  the shell script can create either soft or hard links from
       each removed file to the copy being kept.  The options are mutually ex‐
       clusive.

       --link Create symlinks for deleted files.

       --hardlink
              Create hard links for deleted files.

       Additional global options

       -q     Quiet, suppress all output.

       -v     Verbose  mode.  Can be repeated multiple times for ever increas‐
              ing verbosity.

       -V, --verbose-level N
              Set the logging verbosity level directly to N.

       -h     Show brief help summary.

       --db PATH
              Override the default database file location.

       -F, --hash NAME
              Specify an different hash function.  This applies to any command
              which  uses  content  hashing.   NAME is one of: md5 sha1 sha512
              xxhash

HARD LINKS
       Are hard links duplicates or not?  The answer depends on "what  do  you
       mean by duplicates?" and "what are you trying to do?"

       If your primary goal for removing duplicates is to save disk space then
       it makes sense to ignore hardlinks.  If, on the other hand,  your  pri‐
       mary  goal  is to reduce filesystem clutter then it makes more sense to
       think of hardlinks as duplicates.

       By default dupd considers hardlinks as duplicates. You can switch  this
       around  with the --hardlink-is-unique option.  This option can be given
       either during scan or to the interactive reporting commands (file,  ls,
       uniques, dups).

STALE DATABASE
       Much  of the flexibility of dupd comes from the fact that the duplicate
       sets (and optionally unique files) are saved in the database, which en‐
       ables  the  exploratory usage dupd enables. However, it is important to
       be aware of the limitations of this database.

       The database is created during a dupd scan operation.  Unless  you  are
       scanning  a  read-only  filesystem, it is possible that files get modi‐
       fied, added and deleted after that point in time by actions outside the
       knowledge  of  dupd. This means that over time the database becomes in‐
       creasingly stale compared to the content of the  files.  To  discourage
       use of an old database, dupd prints a warning if the db is older than 3
       days. But this is an arbitrary value and files may change instants  af‐
       ter being scanned.

       If  you  are  only deleting duplicate files from the data set, refer to
       the refresh command to update the database. However, if files are being
       modified  and/or added, then it is best to re-run a dupd scan operation
       whenever the data set has changed sufficiently,  which  is  a  judgment
       call.

EXAMPLES
       Scan  all files in your home directory and then show the sets of dupli‐
       cates found:

              % dupd scan --path $HOME

              % dupd report

       Show duplicate status (duplicate or unique) for all files in docs  sub‐
       directory:

              % dupd ls --path docs

       I'm  about  to delete docs/old.doc but want to check one last time that
       it is a duplicate and I want to review where those duplicates are:

              % dupd file --file docs/old.doc -v

       Read the documentation in the dupd 'docs' directory or online  documen‐
       tation for more usage examples.

EXIT
       dupd exits with status code 0 on success, non-zero on error.

SEE ALSO
       sqlite3(1)

       https://github.com/jvirkki/dupd/blob/master/docs/index.md

                                                                       dupd(1)
