File: mergeall-products/unzipped/test/test-path-normalization-3.3/prototype-recoding-oct22/py-split-join.py

r"""
=========================================================================
Mergeall demo script: prototype for path normalization.

Scan arbitrary paths one component at a time, on Windows and Unix.
This allows each component's NFC/NFD Unicode representation to be
normalized to match its counterpart in the same path in a target tree.

This coding is convoluted mostly to support wacky Windows paths
and Python's uneven handling of them.  os.path.join(), for example,
doesn't do Windows absolute (C:\xxx) and drive-relative (C:xxx) very
well; both spit off a drive (C:) but a post-split join may drop \ in 
both (os.path is ntpath on Windows and posixpath on Unix):

>>> import ntpath as nt
>>> nt.join(*r'C:\aaa\bbb'.split(nt.sep))
'C:aaa\\bbb'
>>> 
>>> nt.join(*r'C:aaa\bbb'.split(nt.sep))
'C:aaa\\bbb'

A similar dissonance arises for the leading / in Unix absolute and 
relative paths:

>>> import posixpath as pt
>>> pt.join(*'/aaa/bbb'.split(pt.sep))
'aaa/bbb'
>>> pt.join(*'aaa/bbb'.split(pt.sep))
'aaa/bbb'

For all the gory background details on Windows path syntax, try this:
https://learn.microsoft.com/en-us/dotnet/standard/io/file-path-formats

(Why the mess?  In general, Windows allows network storage to be accessed
by arguably ad-hoc and convoluted syntax rather than Unix-style mount
points in a uniform tree model, though network storage can optionally 
and similarly be associated with Windows drive letters too.)

This coding handles almost everything, _except_ exceedingly rare Windows 
"\\.\UNC\" and "\\?\UNC"--which are officially not supported here barring
a user request.  These paths will generate a warning and skip normalizing
to the target device's format.  The only real alternatives are to never
normalize paths, or normalize the entire path (not each component) and 
assume all components' Unicode encoding flavors will be the same.  That
is unsound, given that content may pass through a mix of hosts and apps.

Context: 

This path-walker procedure is used when about to delete a file whose 
content-relative path is listed in the __added__.txt file of a deltas
or backup set.  The listed path is partial, and relative to the content
root folder in both the FROM and TO trees.  Because this path is not 
absolute, it cannot contain drive or network specifiers; hence, we're 
interested only in normalizing its component folder and file names 
to the same in the destination-device tree.  

Importantly, though, the NFC/NFD Unicode variants in path names in TO 
may differ arbitrarily from those in __added__.txt, because the TO and 
FROM trees may reside on different platforms, and may have been processed
by programs with arbitrary Unicode policies over their lifespans.

Also note:

- The __added__.txt partial path has already had its separators 
  changed for the target device, to make it portable.

- This test script doesn't care about too-long Windows paths, but 
  the live code will; it wraps each path in FWP() to test existence.
  Paths here have not yet had FWP() prefixing applied.

- The "TO" root path which is prefixed to the __added__.txt path to 
  yield the destination path we walk may come in here as relative, 
  absolute, or other; it originates in a command line, and hasn't 
  been made absolute prior to this procedure.

Though complex, it's crucial to get this right, because the resulting 
normalized path will be deleted in TO.  That said, most (if not all) 
__added__.txt paths exist without normalization, and will simply 
bypass this procedure entirely.  At least until they do not.
=========================================================================
"""
trace = lambda *a: None    # print or lambda *a: None



def walkparts(path, mod):                       
    """
    mod would be os.path in portable code
    it's ntpath or posixpath here to test both
    """
    print()
    print('==>', '"%s"' % path)
    trace('===', mod.abspath(path))             # to see what it really means
    trace('-->', path.split(mod.sep))           # preview the parts split list
    trace('~~~', mod.splitdrive(path))          # required for Windows shenanigans

    drive, rest = mod.splitdrive(path)
    if drive and rest.startswith(mod.sep):      # drive for abs and drive-relative
        sofar = drive                           # rest starts with sep iff abs
        parts = rest.split(mod.sep)
        trace(':::', parts)
    else:
        sofar = ''                              # no drive, normal components
        parts = path.split(mod.sep)

    if parts[0] == '':                          # empty for abs path/rest, win+ux
        sofar += mod.sep                        # make join() work, skip the empty
        parts = parts[1:]
        
    while parts:
        next, *parts = parts
        newpath = mod.join(sofar, next)
        # test/mod 'next' extension to 'sofar' here
        print('...', '"%s" =' % newpath, 
              '(%s) + [%s]' % (sofar, next), '<%s>' % mod.exists(newpath))
        sofar = newpath

    return sofar



def testWindowsPaths():
    print('\n\ntesting windows paths'.upper() + '=' * 40)
    import ntpath as mod
    paths = [
        r'C:\Users\lutz\file.txt',        # absolute
        r'C:Users\lutz\file.txt',         # relative to drive's cwd
        r'\Users\lutz\file.txt',          # relative to current drive's root
        r'Users\lutz\file.txt',           # relative to process cwd

        r'\\Server\Share\folder\file.ext',             # unc network shares, generally
        r'\\readyshare\USB_Storage\Temp\file.txt',     # a live samba server path

        r'\\.\C:\Users\lutz\file.txt',                 # device paths
        r'\\?\C:\Users\lutz\file.txt',                 # enable long paths

        r'\\127.0.0.1\c$\Users\lutz\file.txt',         # don't ask
        r'\\LOCALHOST\c$\Users\lutz\file.txt',

        r'\\.\UNC\127.0.0.1\c$\Users\lutz\file.txt',   # FAILS
        r'\\.\UNC\LOCALHOST\c$\Users\lutz\file.txt',   # FAILS
        r'\\?\UNC\LOCALHOST\c$\Users\lutz\file.txt',   # FAILS
 
        r'C:\Users\lutz\Desktop\..\file.txt',     # parent reference
        'C:\\',                                   # drive root
        'C:',                                     # drive relative
        '.',                                      # process cwd
        r'.\Users\lutz',                          # process-cwd relative
        r'c:\users\Lutz\fILE.txt',                # case insensitive
        ''                                        # empty - cannot happen, but okay
    ]
    for path in paths:
        result = walkparts(path, mod)
        if path == '':
            print('~~~', result)                # '' yields '\' but impossible
        else:
            assert result == path               # all others should be as passed



def testUnixPaths():
    print('\n\ntesting unix paths'.upper() + '=' * 40)
    import posixpath as mod
    paths = [
        '/Users/lutz/file.txt',      # absolute
        'Users/lutz/file.txt',       # relative to process cwd

        '/Users/lutz/Desktop/../file.txt',      # parent reference
        '/',                                    # drive root
        '.',                                    # process cwd
        './Users/lutz',                         # process-cwd relative
        '/users/Lutz/fILE.txt',                 # case sensitive (when run on Unix)
        ''                                      # empty - cannot happen, but okay
    ]
    for path in paths:
        result = walkparts(path, mod)
        if path == '':
            print('~~~', result)                # '' yields '/' but impossible
        else:
            assert result == path               # all others should be as passed



if __name__ == '__main__':
    # Goto the relative-paths root (or not)
    import os, sys
    if sys.platform.startswith('win'):
        os.chdir('C:\\')
    else:
        os.chdir('/')

    """
    ---------------------------------------------------------------------
    Test Windows and Posix paths on either Windows or Unix, by directly
    using ntpath and posixpath modules (os imports one as path, by host).
    Both modules do path parsing on either platform, but file existance
    will vary: most Windows-paths won't exist on Unix, Unix paths will 
    be treated as drive- or cwd-relative on Windows (where "/" == "\"),
    and case may matter on Unix but not Windows (subject to filesystems).
    User name may vary too; on Unix, did "su; ln -s me lutz" to equate.
    ---------------------------------------------------------------------
    """

    testWindowsPaths()
    testUnixPaths()



[Home page] Books Code Blog Python Author Train Find ©M.Lutz