File: mergeall-products/unzipped/docetc/miscnotes/pre-longpaths-code-mar0617/diffall.py

File: mergeall-products/unzipped/docetc/miscnotes/pre-longpaths-code-mar0617/diffall.py
#!/usr/bin/python
# Python 3.X is recommended for trees with Unicode filenames
"""
################################################################################
Usage:
    [py[thon]] diffall.py dir1 dir2 [-recent [days=90]] [-skipcruft]
    
Recursive directory tree comparison: report unique files that exist in only
dir1 or dir2, report files (and symlinks) of the same name in dir1 and dir2
with differing contents, report instances of same name but different type in
dir1 and dir2, and do the same for all subdirectories of the same names in
and below dir1 and dir2.  A summary of diffs appears at end of output, but
search this script's redirected output for "*DIFFER", "*UNIQUE", and "*MISSED"
strings for further difference details (per Sep-2016's update below).

In sum, diffall compares full, byte-by-byte file content to verify that files
are truly the same.  It does not compare file modification times, as these
are not relevant to content equivalence.  See mergeall for a quicker but
shallower alternative that checks modification times but not full content
to detect file changes that warrant synchronization.

--------------------------------------------------------------------------------
CHANGE LOG

New PP3E: limit reads to 1M for large files, catch same name=file/dir mixed
type cases.

New PP4E: avoid extra os.listdir() calls in dirdiff.comparedirs() by passing
results here along.

New March-2015, for mergeall 2.0: add "-recent [days]" limited comparisons
option to compare just files changed in last days (else compares all files),
plus simple stats at end of report.  Also for 2.0, added explicit file.close
calls, for use outside CPython; we don't care about catching exceptions here,
as any kill the script, and we're just reading in any event.

New Jan-2016: change incorrect "dirdiff" in usage message to "diffall".
Also print total diffall runtime for speed analysis and drive comparisons.
Caveat: may run quicker with os.scandir() instead of os.listdir() in Python
3.5+ (only!), but runtime is likely dominated by the exhaustive file reads
here, not listings; see mergeall.py for os.scandir() alternative in action.

New Sep-2016..Jan-2017:

0) Compare, but do not follow symbolic links (symlinks).  Otherwise, may
compare arbitarily-large items referenced by intra-archive links > once.
Coded as a pretest to avoid changing existing code, treats links to both
files and dirs the same, and compares their reference-path strings.

1) Changed difference labels slightly, so users can search the report for
uppercase '*UNIQUE' and '*DIFFER' (and the rare '*MISSED') to jump to
differences quickly.

2) Use mergeall's extended OPEN() to support long file pathnames on Windows.

3) The new '-skipcruft' mode, also added to mergeall, skips system cruft files
in both folders so they do not register as differences (and clutter the report
to the point of near unusability on some platforms that rhyme with "Mac"!).
See mergeall_configs.py for more on cruft metadata; the implementation of
cruft skipping is shared with mergeall and cpall.

4) Recoded the diffall algorithm so that rare mixed-type differences are
detected _before_ recurring into any subdirs.  This way, any "*MISSED" log
messages appear in the subject folder's section - not arbitrarily far ahead
after all subfolders' sections.  As it was, these showed up in the last
subfolder's section of the report, and listed their dirs only in the summary.

Note that the diffall algorithm must still use multiple loops over items, in
order to report file comparisons (and now mixed-type cases) _before_ starting
a new report section for subdirectories' content.  This structure differs
from mergeall's single-loop data builder scheme, but is deliberate.

5) The algorithm was also optimized slightly, to avoid running os.path.join()
on an item more than once (though the gain is likely negligible versus file IO,
and the speed tradeoff for the added list operations was not determined).

6) Further optimized later to replace os.path.join(x, y) with x + os.sep + y;
join() seems complex and slow overkill in simple and known path+file cases,
especially on Windows (see Python's Lib\ntpath.py).

OPTIMIZATION RESULTS:

The prior 2 point's optimizations had NO significant effect on diffall
runtimes.  See file:
    test/expected-output-3.0/optimizations-3.0/diffall-results.txt
for typical speed test results.  In sum, a diffall for an 87G SSD folder with
59K files and 3.5K folders runs in roughly 4 minutes 20 seconds - in BOTH the
prior and new diffall code.  The optimized version here may shave very low
single-digit seconds in some runs, but this is trivial in a 4 minute task.
Caveat: timing tests were run on Windows; other platforms may or may not agree.

Further optimizations based on different codings or the os.scandir()
alternative to os.listdir() in Python 3.5+ used by mergeall are also likely
to be pointless (and os.scandir() may run _slower_ on Mac OS X).  As expected,
the vast majority of this script's time is spent reading files in full, not in
analysis of structure.  As another metric, the mergeall comparison phase for
this same test folder runs in just 7.2 seconds - versus 4 minutes for a
byte-for-byte diffall.  The latter is clearly too IO-bound to speed further
in code, which is why mergeall was developed in the first place!

Given these results, the cpall script was not optimized; its runtimes are
even _more_ IO-bound by the need to write files (and probably reach hours on
slow drives).  Faster devices seem a better bet for speeding such programs.
################################################################################
"""

from __future__ import print_function     # ADDED: Py 2.X compatibility

import os, time, sys, dirdiff
from sys import argv
from fixlongpaths import OPEN             # [3.0] or 'as open', but too obscure 
from skipcruft import filterCruftNames    # [3.0] filter out metadata files

blocksize = 1024 * 1024                   # up to 1M per read
numdir = numfile = numskip = 0            # [2.0] a few sats

# [jan16] python/platform-specific current time (secs)
gettime = (time.perf_counter if hasattr(time, 'perf_counter') else
          (time.clock if sys.platform.startswith('win') else time.time)) 



def intersect(seq1, seq2):
    """
    ---------------------------------------------------------------------------
    Return all items in both seq1 and seq2;
    a set(seq1) & set(seq2) would work too, but sets are randomly 
    ordered, so any platform-dependent directory order would be lost
    ---------------------------------------------------------------------------
    """
    return [item for item in seq1 if item in seq2]



def recentlychanged(path1, path2, numdays=90):
    """
    ---------------------------------------------------------------------------
    [mergeall 2.0] return True if either path1 or path2 was modified
    in last "days" days (default 90, if not passed, or not listed in the
    command-line).  This is really days-worth-of-seconds, but close enough.
    In large achives, most files will not have been changed recently, so
    this test can speed limited comparisons.  Library calls used here:
    ---------------------------------------------------------------------------
    >>> t1 = os.path.getmtime('python')
    >>> t2 = time.time()
    >>> t1, t2
    (1390862766.9136598, 1426117651.752781)
    >>> time.ctime(t1), time.ctime(t2)
    ('Mon Jan 27 14:46:06 2014', 'Wed Mar 11 15:47:31 2015')
    ---------------------------------------------------------------------------
    """
    modtime1 = os.path.getmtime(path1)      # in seconds since epoch 
    modtime2 = os.path.getmtime(path2)      # float in 3.X, int in 2.X?
    nowtime  = time.time()
    secsback = numdays * (24 * 60 * 60)
    return (modtime1 > nowtime - secsback) or (modtime2 > nowtime - secsback)


    
def comparetrees(dir1, dir2, diffs,
                 recent=False, numdays=0,
                 skipcruft=False,
                 verbose=False):
    """
    ---------------------------------------------------------------------------
    Compare all subdirectories and files in two directory trees;
    uses binary files to prevent Unicode decoding and endline transforms,
    as trees might contain arbitrary binary files as well as arbitrary text;
    may need bytes listdir arg for undecodable filenames on some platforms;
    [2.0] compare only files changed in last "numdays" days if "recent";
    [3.0] use mergeall's OPEN to support long file pathnames on Windows;
    [3.0] ignore system metadata files in dir1 and dir2 if skipcruft is True;
    [3.0] detect and report mixed-type diffs before processing any subdirs; 
    [3.0] optimized to avoid calling os.path.join() more than once per item;
    [3.0] optimized to use +os.sep+ instead of likely slower os.path.join();
    ---------------------------------------------------------------------------
    """
    global numdir, numfile, numskip   # [2.0]
    
    # compare file name lists (new report section)
    numdir += 1
    print('-' * 20)
    names1 = os.listdir(dir1)
    names2 = os.listdir(dir2)
    if skipcruft:
        # [3.0] ignore metadata files
        names1 = filterCruftNames(names1) 
        names2 = filterCruftNames(names2) 

    # detect and report unique items
    if not dirdiff.comparedirs(dir1, dir2, names1, names2):
        diffs.append('items UNIQUE at [%s] - [%s]' % (dir1, dir2))

    # get names common to both dirs
    print('Comparing contents')
    common = intersect(names1, names2)

    #----------------------------------------------------------------------
    # compare contents of files (and links) in common
    # report before any subdirs, and try this most-common case first
    #----------------------------------------------------------------------
    
    notfiles = []
    for name in common:
        path1 = dir1 + os.sep + name  # [3.0]
        path2 = dir2 + os.sep + name

        if os.path.islink(path1) or os.path.islink(path2):
            # [3.0] handle symlinks to files and dirs specially here
            if os.path.islink(path1) and os.path.islink(path2):
                # both are links: read
                numfile += 1
                link1 = os.readlink(path1)   # str path name
                link2 = os.readlink(path2)
                if link1 == link2:
                    if verbose: print(name, 'matches')
                else:
                    diffs.append('links DIFFER at [%s] - [%s]' % (path1, path2))
                    print('*DIFFER:', name)
            else:
                # only one link: mixed
                diffs.append('items MISSED at [%s] - [%s]: [%s]' % (dir1, dir2, name))
                print('*MISSED:', name)
        
        elif os.path.isfile(path1) and os.path.isfile(path2):
            # file+file: skip full reads if not recently changed 
            if recent and (not recentlychanged(path1, path2, numdays)):  # [2.0]
                numskip += 1                                             # [2.0]
                if verbose: print(name, 'skipped')
            else:
                numfile += 1
                file1 = OPEN(path1, 'rb')  # [3.0]: long paths
                file2 = OPEN(path2, 'rb')
                while True:
                    bytes1 = file1.read(blocksize)
                    bytes2 = file2.read(blocksize)
                    if (not bytes1) and (not bytes2):
                        if verbose: print(name, 'matches')
                        break
                    if bytes1 != bytes2:
                        diffs.append('files DIFFER at [%s] - [%s]' % (path1, path2))
                        print('*DIFFER:', name)
                        break
                file1.close()
                file2.close()  # [2.0]

        else:
            # pass others to next phase (non-link dirs, mixes, fifos)
            notfiles.append((name, path1, path2))

    #----------------------------------------------------------------------
    # detect same name but not both files or dirs (rare)
    # [3.0] report before subdirs, and use cached paths for speed
    #----------------------------------------------------------------------
    
    notmixed = []
    for (name, path1, path2) in notfiles:
        if not (os.path.isdir(path1) and os.path.isdir(path2)):
            diffs.append('items MISSED at [%s] - [%s]: [%s]' % (dir1, dir2, name))
            print('*MISSED:', name)
        else:
            notmixed.append((path1, path2))

    #----------------------------------------------------------------------
    # recur to compare non-link directories in common (the rest)
    # each subdir starts a new report section for its own content
    #----------------------------------------------------------------------
    
    for (path1, path2) in notmixed:
        comparetrees(path1, path2, diffs,
                     recent, numdays, skipcruft, verbose)



def getargs():
    """
    ---------------------------------------------------------------------------
    [2.0] Args for command-line mode
    ---------------------------------------------------------------------------
    """
    try:
        extramsg = None
        recent, numdays = False, 90         # defaults
        skipcruft = False
        
        dir1, dir2 = sys.argv[1:3]          # first 2 command-line args
        if not os.path.isdir(dir1):         # exists and is a dir [2.0] [3.0]
            extramsg = 'dir1 is invalid'
            assert False
        if not os.path.isdir(dir2):         # exists and is a dir [2.0] [3.0]
            extramsg = 'dir2 is invalid'    # was: assert os.path.isdir(dir2)
            assert False
        if '-skipcruft' in sys.argv:
            skipcruft = True                # [3.0] skip metadata files
            sys.argv.remove('-skipcruft')
        if len(argv) > 3:
            assert argv[3] == '-recent'     # [2.0] last N days only
            recent = True
            if len(argv) > 4: numdays = int(argv[4])   # listed else 90
    except:
        print('Usage: '
            '[py[thon]] diffall.py dir1 dir2 [-recent [days=90]] [-skipcruft]')
        if extramsg: print('Additional details:', extramsg)
        sys.exit(1)
    else:
        return (dir1, dir2, recent, numdays, skipcruft)



if __name__ == '__main__':
    """
    ---------------------------------------------------------------------------
    stand-alone/command-line mode;
    diffall isn't very useful otherwise, as it prints instead of returning,
    but its output might be parsed;  see also mergeall's variation on the
    comparisons run here, that builds explicit results data-structures;
    ---------------------------------------------------------------------------
    """
    dir1, dir2, recent, numdays, skipcruft = getargs()

    # walk, compare, change diffs in-place
    diffs = []
    starttime = gettime()                                  
    comparetrees(dir1, dir2, diffs, recent, numdays, skipcruft, True) 
    tottime = gettime() - starttime 

    # report time [jan6], stats [2.0]
    hours   = tottime // (60*60); tottime -= hours * (60*60)
    minutes = tottime //  60;     tottime -= minutes * 60
    print('=' * 80)
    print('Runtime hrs:mins:secs = %.0f:%.0f:%.2f'
                      % (hours, minutes, tottime))          
    print('Dirs checked %d, Files checked: %d, Files skipped: %d'
                      % (numdir, numfile, numskip))
    if skipcruft: print('System metadata (cruft) files were skipped')

    # report collected diffs list
    if not diffs:
        print('No diffs found.')
    else:
        print('Diffs found:', len(diffs))
        for diff in diffs: print('-', diff)
    print('End of report.')