File: mergeall-products/unzipped/docetc/miscnotes/pre-symlinks-code-jan2617/diffall.py

#!/usr/bin/python
# Python 3.X is recommended for trees with Unicode filenames
"""
################################################################################
Usage:
    [py[thon]] diffall.py dir1 dir2 [-recent [days=90]] [-skipcruft]
    
Recursive directory tree comparison: report unique files that exist in only
dir1 or dir2, report files of the same name in dir1 and dir2 with differing 
contents, report instances of same name but different type in dir1 and dir2,
and do the same for all subdirectories of the same names in and below dir1 
and dir2.  A summary of diffs appears at end of output, but search this
script's redirected output for "*DIFFER", "*UNIQUE", and "*MISSED" strings
for further difference details (per Sep-2016's update).

In sum, diffall compares full, byte-by-byte file content to verify that files
are truly the same.  It does not compare file modification times, as these
are not relevant to content equivalence.  See mergeall for a quicker but
shallower alternative that checks modification times but not full content
to detect file changes that warrant synchronization.

--------------------------------------------------------------------------------

CHANGE LOG:

New PP3E: limit reads to 1M for large files, catch same name=file/dir mixed
type cases.

New PP4E: avoid extra os.listdir() calls in dirdiff.comparedirs() by passing
results here along.

New March-2015, for mergeall 2.0: add "-recent [days]" limited comparisons
option to compare just files changed in last days (else compares all files),
plus simple stats at end of report.  Also for 2.0, added explicit file.close
calls, for use outside CPython; we don't care about catching exceptions here,
as any kill the script, and we're just reading in any event.

New Jan-2016: change incorrect "dirdiff" in usage message to "diffall".
Also print total diffall runtime for speed analysis and drive comparisons.
Caveat: may run quicker with os.scandir() instead of os.listdir() in Python
3.5+ (only!), but runtime is likely dominated by the exhaustive file reads
here, not listings; see mergeall.py for os.scandir() alternative in action.

New Sep-2016:

1) Changed difference labels slightly, so users can search the report for
uppercase '*UNIQUE' and '*DIFFER' (and the rare '*MISSED') to jump to
differences quickly.

2) Use mergeall's extended OPEN() to support long file pathnames on Windows.

3) The new '-skipcruft' mode, also added to mergeall, skips system cruft files
in both folders so they do not register as differences (and clutter the report
to the point of near unusability on some platforms that rhyme with "Mac"!).
See mergeall_configs.py for more on cruft metadata; the implementation of
cruft skipping is shared with mergeall and cpall.

4) Recoded the diffall algorithm so that rare mixed-type differences are
detected _before_ recurring into any subdirs.  This way, any "*MISSED" log
messages appear in the subject folder's section - not arbitrarily far ahead
after all subfolders' sections.  As it was, these showed up in the last
subfolder's section of the report, and listed their dirs only in the summary.

Note that the diffall algorithm must still use multiple loops over items, in
order to report file comparisons (and now mixed-type cases) _before_ starting
a new report section for subdirectories' content.  This structure differs
from mergeall's single-loop data builder scheme, but is deliberate.

5) The algorithm was also optimized slightly, to avoid running os.path.join()
on an item more than once (though the gain is likely negligible versus file IO,
and the speed tradeoff for the added list operations was not determined).

6) Further optimized later to replace os.path.join(x, y) with x + os.sep + y;
join() seems complex and slow overkill in simple and known path+file cases,
especially on Windows (see Python's Lib\ntpath.py).

OPTIMIZATION RESULTS:

The prior 2 point's optimizations had NO significant effect on diffall
runtimes.  See file:
    test/expected-output-3.0/optimizations-3.0/diffall-results.txt
for typical speed test results.  In sum, a diffall for an 87G SSD folder with
59K files and 3.5K folders runs in roughly 4 minutes 20 seconds - in BOTH the
prior and new diffall code.  The optimized version here may shave very low
single-digit seconds in some runs, but this is trivial in a 4 minute task.
Caveat: timing tests were run on Windows; other platforms may or may not agree.

Further optimizations based on different codings or the os.scandir()
alternative to os.listdir() in Python 3.5+ used by mergeall are also likely
to be pointless (and os.scandir() may run _slower_ on Mac OS X).  As expected,
the vast majority of this script's time is spent reading files in full, not in
analysis of structure.  As another metric, the mergeall comparison phase for
this same test folder runs in just 7.2 seconds - versus 4 minutes for a
byte-for-byte diffall.  The latter is clearly too IO-bound to speed further
in code, which is why mergeall was developed in the first place!

Given these results, the cpall script was not optimized; its runtimes are
even _more_ IO-bound by the need to write files (and probably reach hours on
slow drives).  Faster devices seem a better bet for speeding such programs.
################################################################################
"""

from __future__ import print_function     # ADDED: Py 2.X compatibility

import os, time, sys, dirdiff
from sys import argv
from fixlongpaths import OPEN             # [3.0] or 'as open', but too obscure 
from skipcruft import filterCruftNames    # [3.0] filter out metadata files

blocksize = 1024 * 1024                   # up to 1M per read
numdir = numfile = numskip = 0            # [2.0] a few sats

# [jan16] python/platform-specific current time (secs)
gettime = (time.perf_counter if hasattr(time, 'perf_counter') else
          (time.clock if sys.platform.startswith('win') else time.time)) 



def intersect(seq1, seq2):
    """
    Return all items in both seq1 and seq2;
    a set(seq1) & set(seq2) would work too, but sets are randomly 
    ordered, so any platform-dependent directory order would be lost
    """
    return [item for item in seq1 if item in seq2]



def recentlychanged(path1, path2, numdays=90):
    """
    [mergeall 2.0] return True if either path1 or path2 was modified
    in last "days" days (default 90, if not passed, or not listed in the
    command-line).  This is really days-worth-of-seconds, but close enough.
    In large achives, most files will not have been changed recently, so
    this test can speed limited comparisons.  Library calls used here:
    --------------------------------------------------------------------
    >>> t1 = os.path.getmtime('python')
    >>> t2 = time.time()
    >>> t1, t2
    (1390862766.9136598, 1426117651.752781)
    >>> time.ctime(t1), time.ctime(t2)
    ('Mon Jan 27 14:46:06 2014', 'Wed Mar 11 15:47:31 2015')
    --------------------------------------------------------------------
    """
    modtime1 = os.path.getmtime(path1)      # in seconds since epoch 
    modtime2 = os.path.getmtime(path2)      # float in 3.X, int in 2.X?
    nowtime  = time.time()
    secsback = numdays * (24 * 60 * 60)
    return (modtime1 > nowtime - secsback) or (modtime2 > nowtime - secsback)


    
def comparetrees(dir1, dir2, diffs,
                 recent=False, numdays=0,
                 skipcruft=False,
                 verbose=False):
    """
    Compare all subdirectories and files in two directory trees;
    uses binary files to prevent Unicode decoding and endline transforms,
    as trees might contain arbitrary binary files as well as arbitrary text;
    may need bytes listdir arg for undecodable filenames on some platforms;
    [2.0] compare only files changed in last "numdays" days if "recent";
    [3.0] use mergeall's OPEN to support long file pathnames on Windows;
    [3.0] ignore system metadata files in dir1 and dir2 if skipcruft is True;
    [3.0] detect and report mixed-type diffs before processing any subdirs; 
    [3.0] optimized to avoid calling os.path.join() more than once per item;
    [3.0] optimized to use +os.sep+ instead of likely slower os.path.join();
    """
    global numdir, numfile, numskip   # [2.0]
    
    # compare file name lists (new report section)
    numdir += 1
    print('-' * 20)
    names1 = os.listdir(dir1)
    names2 = os.listdir(dir2)
    if skipcruft:
        # [3.0] ignore metadata files
        names1 = filterCruftNames(names1) 
        names2 = filterCruftNames(names2) 

    # detect and report unique items
    if not dirdiff.comparedirs(dir1, dir2, names1, names2):
        diffs.append('items UNIQUE at [%s] - [%s]' % (dir1, dir2))

    print('Comparing contents')
    common = intersect(names1, names2)

    # compare contents of files in common
    # report before any subdirs, and try this most common case first
    notfiles = []
    for name in common:
        path1 = dir1 + os.sep + name  # [3.0]
        path2 = dir2 + os.sep + name
        if os.path.isfile(path1) and os.path.isfile(path2):
            if recent and (not recentlychanged(path1, path2, numdays)):  # [2.0]
                numskip += 1                                             # [2.0]
                if verbose: print(name, 'skipped')
            else:
                numfile += 1
                file1 = OPEN(path1, 'rb')  # [3.0]
                file2 = OPEN(path2, 'rb')
                while True:
                    bytes1 = file1.read(blocksize)
                    bytes2 = file2.read(blocksize)
                    if (not bytes1) and (not bytes2):
                        if verbose: print(name, 'matches')
                        break
                    if bytes1 != bytes2:
                        diffs.append('files DIFFER at [%s] - [%s]' % (path1, path2))
                        print('*DIFFER:', name)
                        break
                file1.close()
                file2.close()  # [2.0]
        else:
            notfiles.append((name, path1, path2))

    # detect same name but not both files or dirs (rare)
    # [3.0] report before subdirs, and use cached paths for speed
    notmixed = []
    for (name, path1, path2) in notfiles:
        if not (os.path.isdir(path1) and os.path.isdir(path2)):
            diffs.append('items MISSED at [%s] - [%s]: [%s]' % (dir1, dir2, name))
            print('*MISSED:', name)
        else:
            notmixed.append((path1, path2))

    # recur to compare directories in common (the rest)
    # each subdir starts a new report section for its own content
    for (path1, path2) in notmixed:
        comparetrees(path1, path2, diffs,
                     recent, numdays, skipcruft, verbose)



def getargs():
    """
    [2.0] Args for command-line mode
    """
    try:
        extramsg = None
        recent, numdays = False, 90         # defaults
        skipcruft = False
        
        dir1, dir2 = sys.argv[1:3]          # first 2 command-line args
        if not os.path.isdir(dir1):         # exists and is a dir [2.0] [3.0]
            extramsg = 'dir1 is invalid'
            assert False
        if not os.path.isdir(dir2):         # exists and is a dir [2.0] [3.0]
            extramsg = 'dir2 is invalid'    # was: assert os.path.isdir(dir2)
            assert False
        if '-skipcruft' in sys.argv:
            skipcruft = True                # [3.0] skip metadata files
            sys.argv.remove('-skipcruft')
        if len(argv) > 3:
            assert argv[3] == '-recent'     # [2.0] last N days only
            recent = True
            if len(argv) > 4: numdays = int(argv[4])   # listed else 90
    except:
        print('Usage: '
            '[py[thon]] diffall.py dir1 dir2 [-recent [days=90]] [-skipcruft]')
        if extramsg: print('Additional details:', extramsg)
        sys.exit(1)
    else:
        return (dir1, dir2, recent, numdays, skipcruft)



if __name__ == '__main__':
    """
    stand-alone/command-line mode;
    diffall isn't very useful otherwise, as it prints instead of returning,
    but its output might be parsed;  see also mergeall's variation on the
    comparisons run here, that builds explicit results data-structures;
    """
    dir1, dir2, recent, numdays, skipcruft = getargs()

    # walk, compare, change diffs in-place
    diffs = []
    starttime = gettime()                                  
    comparetrees(dir1, dir2, diffs, recent, numdays, skipcruft, True) 
    tottime = gettime() - starttime 

    # report time [jan6], stats [2.0]
    hours   = tottime // (60*60); tottime -= hours * (60*60)
    minutes = tottime //  60;     tottime -= minutes * 60
    print('=' * 80)
    print('Runtime hrs:mins:secs = %.0f:%.0f:%.2f'
                      % (hours, minutes, tottime))          
    print('Dirs checked %d, Files checked: %d, Files skipped: %d'
                      % (numdir, numfile, numskip))
    if skipcruft: print('System metadata (cruft) files were skipped')

    # report collected diffs list
    if not diffs:
        print('No diffs found.')
    else:
        print('Diffs found:', len(diffs))
        for diff in diffs: print('-', diff)
    print('End of report.')



[Home page] Books Code Blog Python Author Train Find ©M.Lutz