mergeall — Folder Synchronization for Manual "Clouds"

Latest substantial update: June 2017 (f/k/a Usage-Overview.html)

Preface

This is the original whitepaper on mergeall usage. It has been subsumed by the newer and more user-focused User Guide as of version 3.0, which you should consult first for up-to-date coverage. This older document has been retained here with only minor edits, for the additional background context it provides on mergeall's features and roles. For version-specific development history, see also the Revisions document.

Please note: all of the older screenshots referenced in this document have been removed to minimize mergeall package size, and their links here were removed. See the newer screenshots collection for up-to-date GUI examples, and please pardon the dearth of GUI shots here; this document has been retained for its alternative perspective and project-history value only.

Introduction

This page describes and gives usage pointers for mergeall—an open source Python 3.X/2.X script and tkinter GUI useful for managing backups and changes in multiple copies of large data sets (a.k.a. archives) stored in directory trees (a.k.a. folders). mergeall is specifically targeted at quickly synchronizing changes in content mirrored across multiple devices such as laptops, tablets and USB flashdrives, and in some contexts can provide a manual alternative to cloud-based storage.

Contents

 

The Short Story

This system makes a destination folder the same as a source folder quickly.

It first detects differences by walking the two directory trees in full, comparing their structure, and checking their files' timestamps and sizes, with an optional limited content test. It then resolves variances by automatically or selectively running in-place changes for differing items only—copying changed and unique items in the source to the destination, and pruning unique items in the destination. These changes are applied to both files and folders, as described in more detail ahead.

The net result is a fast one-way synchronization that makes the entire destination tree identical to the source, without requiring full-tree copies. This may be used to synchronize multiple folder copies to each other directly, or to and from a common base (e.g., a USB stick or local network drive). In the latter mode, the base device serves the same intermediary role as cloud storage, and program runs achieve the same effect as cloud transfers.

Though broader in scope, the most common role for this system is mirroring large archives across multiple devices—on changes, run mergeall once to synchronize to a USB flashdrive (or other), run again to propagate to other computers as desired, and run diffall occasionally to verify archive integrity, as covered ahead here and here.

A Few Details

The system spans 3,518 source-code lines. It consists of a 1,318-line main script (roughly half of which is docs); a 680-line threaded tkinter/Tkinter GUI launcher; a 303-line interactive console launcher; and new and modified utility scripts and modules that span 1,217 lines. It's related to, and reuses parts of PP4E's diffall and cpall examples, but is designed to merge trees quickly, without byte-for-byte compares or exhaustive copies. To protect data, the system can also automatically make backup copies of items changed, and restore a tree's prior version altogether if needed (recent upgrades described ahead).

For its author's perhaps pathological use case—a currently 73G archive with 45K files and 2.6K folders—full copies and compares can run for hours, regardless of the volume of changes made. Because the mergeall in-place merge system updates only for actual differences, it typically finishes in just 1 minute when changes are moderate. Running twice to merge changes both to and from an intermediary storage device serves to synchronize two computers, and usually takes just 5 minutes or less. These times reflect USB 2.0 ports on some devices and might be better for all-USB 3.0 usage, but merges are strikingly quicker than copies either way (see more timing details here and here).

Update, version 3.1: as of Mergeall 3.1, the system has grown to 10,201 source lines among 34 files (per this, and up from 9,962 lines in 3.0), and its development and use has spanned 4 years of occasional efforts. The line count includes build and zip-utility scripts, but the factor-of-3 growth largely reflects new functionality, multiple platform support, and multiple distribution formats. The example archive also grew to 102K files, 10K folders, and 128G space (and larger USB flashdrives...).

Sync versus Merge

It may be useful to note that mergeall is not the same as a Windows file explorer folder merge. Unlike mergeall, Windows 7's merge does not automatically skip unchanged items (or even process them distinctly); handle mixed-mode names; or prune unique items in the destination. It also doesn't report on its plans, backup changes outside the archive, or support true rollbacks. Windows 8's merge improves on this by allowing users to manually skip all unchanged files whose timestamps and sizes match; this requires complex and error-prone user interaction, though, and still does not remove unique items in the destination, which means that renames and deletions in the source are not propagated.

In other words, Windows explorer merge does not synchronize two trees, it simply combines them; depending on user choices, it forms only their sum or union. By contrast, mergeall does not combine trees—it quickly makes a destination the same as a source. This makes its role overlap much more closely with cloud storage, which seeks to unify a single archive across multiple devices. With mergeall, archive copies are unified to and from local storage devices instead of a remote server, but the goal and effect are similar. The key difference is that cloud storage generally resides on devices owned by a third-party which fully controls access and price; with mergeall, your media is your own.

Why mergeall?

In short: privacy and control. The mergeall system supports just one approach to archive copy synchronization, based on manual whole-archive merges (see the usage notes ahead); may require changes on some platforms with more exotic file types (see its limitations ahead); and does not address the more difficult problem of multiple differing copies of the same file. On the other hand, it might just help you avoid giving away your personal property to cloud providers (and/or advertisers and intelligence agencies!). There's more on the tradeoffs of cloud storage in the conclusion, after we explore mergeall usage details.

 

Code, Docs, and Screenshots

This section summarizes available resources for readers who prefer to jump into a program right away (and others who might more shrewdly return here after perusing the usage modes guide and other recommendations ahead). Subsections here:

 

Code and Docs

Fetch the mergeall distribution package at the following listed link, and please mind its usage warnings—this system changes a directory permanently by design, though 2.0's backups option also saves prior versions of items replaced or deleted:

Some of the important bits in the package:

You can view the entire contents of the zipfile, including all its source code, either on your own computer after unpacking, or online at this site.

 

Screenshots and Examples

Please note: this section grew fully defunct and has been removed as of version 3.0. It documented now-dated screenshots and runlogs from prior releases which are no longer included in the mergeall package. For the most recent GUI screen captures, see the new screenshots folder. For text-based examples, see the new test and docetc/miscnotes folders, especially test's expected output HTML files.

 

Recent Upgrades

This section documents the most noteworthy enhancements made to the system in recent releases. It currently covers:

 

Release 2.0: Automatic Backups for Changes

Version 2.0 adds an automatic backups option for changes. When enabled, this option makes backup copies of all files and directories in the destination directory that will be destructively replaced or removed in-place during a mergeall synchronization run, and notes new items added. This makes mergeall runs generally safer, as unwanted or failed changes can be later undone by restoring backup copies—for both individual items and entire runs.

How Backups Work

Specifically, in both the automatic and selective updates modes described ahead, the prior versions of items about to be changed or deleted in the destination (TO) tree are saved in an automatically created __bkp__ folder. This folder resides at the top of the destination archive; has one date/time-stamped subfolder for each mergeall run with backups; and recreates the full original directory paths of items stored within it.

Backup folders are local to an archive copy (each TO target has its own) and not synchronized across trees by mergeall. To minimize the space they require, their per-run subfolders are automatically pruned by age when their number exceeds a changeable limit. As of version 2.1, new additions are also listed in __added__.txt files in __bkp__ run subfolders; see the 2.1 section ahead.

For instance, an archive copy rooted at D:\MY-STUFF will have backup data of the following form after serving as the TO folder for a mergeall run with backups enabled (specific date/time values in folder names allow for by-name sorts):

D:\MY-STUFF\__bkp__                                        # all backups for this copy
D:\MY-STUFF\__bkp__\dateyymmdd-timehhmmss                  # this run's subfolder
D:\MY-STUFF\__bkp__\dateyymmdd-timehhmmss\__added__.txt    # list of items added, by pathname (2.1)
D:\MY-STUFF\__bkp__\dateyymmdd-timehhmmss\items            # items removed and replaced, top level 
D:\MY-STUFF\__bkp__\dateyymmdd-timehhmmss\subfolders       # items removed and replaced, at original paths 

How Backups Help

Beyond their obvious data-safety benefit, backups are particularly useful when using mergeall with a common base device to synchronize changes between multiple computers, as backups for changes are maintained independently on both the base device and each target machine. For example, if you configure mergeall to save 15 backups, each archive copy's __bkp__ can contain the 15 most-recent backup copies of frequently changed files—one for each backups-enabled mergeall run in which the archive copy was the TO destination.

Backups also serve as a record of changes made that is an alternative to the logfile and perhaps more easily inspected, and are required for version 2.1 restores (per the section ahead). The only downsides to change backups are that they take up extra space, and may slow the merge's resolution phase for extra copies; these penalties are incurred only for items recently changed, though, and are generally far outweighed by the extra data safety that backups provide.

How to Use Backups

To enable backups for changes, simply use either the new -backup command-line argument in mergeall itself, or the corresponding widgets and replies in its GUI and console launchers. In the GUI, the backups switch is on by default when updates are selected, so you normally don't need to do anything to save backups during a mergeall run.

When backups are enabled, if you ever need to restore prior versions of files, you can choose from the __bkp__ folders of any of the latest backup-enabled mergeall runs, on any of your archive copies. If ever needed, you can also rollback an entire run's changes from its backups folder after a catastrophic mistake (human or machine), per the next section's restore options.

You can also change backup folders arbitrarily (e.g., deleting if too large), and can generally ignore any diffall.py differences generated by their per-run subfolders (they'll register as unique or differing items, normally). See backup.py for implementation details, and Revisions.html for more on this and other 2.0 changes.

Postscripts

Usage update: as of version 2.4, a new "-quiet" script flag (and corresponding toggles in the GUI and replies in the console launchers) suppresses per-file backup log messages to minimize clutter in displays and logs. See version 2.4 release notes for details.

 

Release 2.1: Automatic Restores from Backups

Version 2.0's automatic change backups described in the preceding section were intended to allow one or a small group of files or folders to be restored by manual copies or nested subfolder merges, in the unlikely event that some mergeall changes went askew or were unwanted.

This suffices for most cases, but doesn't help much if there are very many changes to back out. If, for example, a user inadvertently swaps the FROM and TO folders and winds up using an old archive copy for FROM and the current copy for TO—a worst case user error—there may be hundreds or thousands of changes to undo, and the entire run should be backed out.

mergeall could formerly almost handle this itself by merging from an archive's backup folder to the archive's root, because the backup folder saves all items replaced and deleted. The merge would simply put all these items back. Unfortunately, merges also normally delete items unique to the TO tree—when merging from backup folder to archive root, this would erase all content not changed and hence not recorded in the backup. Moreover, such a merge would do nothing about backing out new items added to the TO archive root, because they are not recorded in backup folders.

The 2.1 Solution

Version 2.1 addresses this very rare case with full restores (a.k.a. rollbacks) that take the form of a merge from an archive's backup folder (FROM) to the archive's root folder (TO). To make this work, it enhances the 2.0 -backup command-line option available in all launch modes, and adds a new -restore option available in the main mergeall.py script only:

In automatic updates mode (described ahead), the combined net effect is a complete rollback of all changes made in a preceding run—restoring all items replaced or removed, and deleting all items added. For a more concrete look at how this works, browse backups in tests folder to see __bkp__ folders and __added__.txt files in action.

How to Run a Rollback

To back out all the changes made by a prior run with backups enabled, simply locate your archive's most recent backup by its date/time name in the archive's __bkp__ folder, and run mergeall with the archive root as TO, and the latest backup's subfolder as FROM, with a command line of one of the following forms:

mergeall.py archiveroot\__bkp__\dateyymmdd-timehhmmss archiveroot -auto -restore    # automatic rollback (see ahead)
mergeall.py archiveroot\__bkp__\dateyymmdd-timehhmmss archiveroot -restore          # selective rollback (see ahead)
Run this in mergeall's source directory, or give the script's full path. You can also delete the __added__.txt file in __bkp__ first if you wish to back out only replacements or removals, and may use older backups (though they are best used with the selective updates mode described in the next section, as they may be arbitrarily out of synch with the current tree). Here's a more concrete example for reference and cut-paste-edit—backing out a run from a USB drive and saving the log:
mergeall.py D:\MY-STUFF\__bkp__\date150325-time165817 D:\MY-STUFF -auto -restore > C:\...\Desktop\logfile.txt

If you're not a fan of complicated command lines, version 2.1 also includes a convenience script—rollback.py—that builds and runs an automatic updates mode restore command line, by globbing and sorting to find the archive's latest backups folder automatically. This script also verifies the run and its inputs for safety. Run it with just the root path, or with no arguments to be asked for the root interactively:

rollback.py archiveroot        # convenience script, one argument or input
rollback.py                    # input root path interactively: command line or click 
rollback.py > logfile.txt      # save mergeall output (only) to a logfile 

You can also click this script's filename or icon to run it on Windows and skip the command line altogether. However, you may still want to use command lines to save mergeall's output to a logfile with ">" (the script's interactive prompts go to the console only), or to use selective updates mode during the restore (this requires a manual command line).

A Few Assumptions

However they are invoked, restores generally assume that:

  1. You used -backup in the prior run. This is a requirement for restores in all usage modes; without backups, there is nothing to restore.

  2. You have not made additional changes to the tree since the run you're rolling back. Restoring after any additional changes are made in TO won't fully reset the tree's prior state—and may erase more recent work in automatic updates mode. Devices used only for backups or transfers, however, may retain their restorability indefinitely.

If your tree meets those criteria, there are three additional usage notes to be aware of:

Finally, keep in mind that -restore is just a failsafe, designed primarily to be used immediately after a run you wish to undo. Luckily, you will probably never need it if you use mergeall with a normal amount of care. As a guideline, you should generally be cautious with FROM/TO selection to avoid having to restore, and should ideally run with -report before -auto to see what will be changed. Backups of changes in __bkp__ are still intended mainly for manual piecemeal restores, though its new __added__.txt also serves as additional run documentation.

For an example of the restore option at work that demonstrates its general usage, see this folder's HTML example sessions. For an example __bkp__ folder with its new __added__.txt file, see this backups folder in the shipped test folders (if it's still present in your copy). For backup and restore implementation-level details, see Revisions.html's change notes, and backup.py where most of the code resides.

Postscripts

Usage update (defunct): restores should generally be run on the same platform as the prior mergeall, for reasons detailed in the Revisions.html note.

Usage update update: as of version 3.0, the prior note is no longer true—the backup folder's __added__.txt file's paths are now made portable, so backups created on Unix can be rolled back on Windows, and vice versa.

A manual option: a clever end user pointed out that if—and only if—the FROM and TO trees have absolutely no files in common, a full rollback is simply a matter of moving the backup folder to the archive root. Because the backup records all items replaced and removed in their original folder paths, it will in this rare context be the same as the full TO tree from the prior run. This only works, however, if you can be completely sure that the FROM and TO trees were entirely disjoint (i.e., had no files in common) during the prior run. Otherwise, common files that were unchanged were not saved to the backups folder, and will not be restored by this shortcut. When in doubt, run the rollback.py script described above for a general-purpose restore.

 

Release 2.2: Faster Execution via Python 3.5's os.scandir()

Update: as of version 3.0, the scandir() optimization described below is not used on Mac OS X, because it was found to slow mergeall's comparisons phase by a factor of 3 on that platform in Python 3.5. Runtime for a large archive on a fast machine increases from 2 to 6 seconds when this call is used. It's unclear whether the slowdown on Mac OS X is due to the implementation of scandir() itself or the coding patterns its use implies, but the effect is clearly a negative on this platform. The call's speed gains on Windows and Linux were also eventually negated in 3.0 by an os.lstat() recoding, and os.scandir() is no longer used—a complete reversal of the note here.

Version 2.2 is a performance optimization. As of this version, mergeall uses Python 3.5's os.scandir(), if available, to speed up tree comparisons radically. This new function eliminates system calls for some attributes of files, and is available both as a standard library tool in 3.5, and a PyPI package install. When this function is present, the mergeall comparison phase uses a custom implementation that leverages the new call instead of os.listdir(), which is retained to support older Pythons.

Timing Results

So how much faster is mergeall with the new call? Timing results on Windows show that the new function speeds tree comparisons—a major time component in most mergeall runs—by a factor of 5 to 10 depending on devices. This can shave dozens of seconds off total mergeall runtime for larger trees, and perhaps more. For an example use-case archive that's now 78G and has 50k files in 3k folders:

With a different archive, comparisons clocked in at 10x faster on a Windows 10 tablet. Run with command lines of the form "py -3.X launch-mergeall-GUI.pyw" to test specific 3.X's on your own.

How to Enable the Speedup

This speed gain is fully automatic, but requires that a scandir() be present. You can satisfy this requirement and take advantage of the mergeall 2.2 optimization by either:

  1. Running mergeall and its GUI with Python 3.5 or newer, where scandir() is standard in the os module
  2. Installing the PyPI package version of scandir() that supplies the call for older Pythons, including 2.7 and older 3.X
Because the current PyPI package version of scandir() requires a C code compile for most Pythons, using Python 3.5+ (option #1) may the simplest way to speed your merges, but watch the PyPI package for new developments on this front. You can use Python 3.5 for mergeall without breaking programs that rely on older Pythons; see this note for pointers.

If you do neither of the options listed above, mergeall falls back on the original os.listdir() scheme so that it still runs on older Pythons, albeit with the original and slower speed. Given its potentially major speed boost, though, a scandir() is now recommended for most mergeall users who manage non-trivial trees.

For more about this change, see Revisions.html's version 2.2 notes, including its list of related links.

Postscripts

Timing update: After posing the timing results above, the unoptimized version of mergeall's comparison phase for Python 3.4 and earlier was recoded to use os.lstat() and the stat module to check file types, instead of os.path.is*() tests (this was motivated by new support for symbolic links, which added new type tests). While full timing results were not run after this change, the optimized Python 3.5+ os.scandir() variant is still slower on the Mac than the non-optimized 3.4- os.listdir() version—by a factor of 2, for a test comparison which takes 9 and 4.5 seconds, respectively. os.scandir() is still a win on Windows for some types of code, though manual os.stat() or os.lstat() use may yield similar gains; in mergeall, it made the os.scandir() variant obsolete (and it was removed altogether in 3.0).

 

Release 3.0: Excluding cruft (metadata) files

Version 3.0 adds a "-skipcruft" mode, and a corresponding toggle in mergeall's GUI, which allows users to skip platform-specific system cruft (a.k.a. metadata) files and folders in both the FROM and TO trees during a merge. In mergeall, the net effect is that these items won't be listed as differences in report mode, and won't be copied to, replaced in, or deleted from the TO tree in update modes. This both allows cruft files to remain on the creating platform, and avoids propagating them to other copies and computers.

A Cruft by Any Other Name

This option was added primarily as a way to deal with the many hidden metadata files generated on the Mac OS X platform. With "-skipcruft", Mac metadata files remain on the Mac, but are not transferred to other archive copies or non-Mac computers. This option does the same for less-common cruft files generated on Windows and Linux, as well as Python bytecode files: these files remain on the creating machine only. For users of single platforms, merges without the "-skipcruft" option still treat cruft files like any other, copying them to and from archive copies whenever they differ.

The "-skipcruft" option was also added to both diffall, where it eliminates cruft files from difference reports, and cpall, where it prevents cruft files from being copied to the destination tree. Cruft filename patterns are defined and documented in the user-editable mergeall_configs.py, which also gives more comprehensive mergeall cruft-skipping examples.

In addition, a new utility script, nuke-cruft-files.py, supports removal of cruft files in trees on demand. This may be used for trees that are never managed by mergeall; Windows-filesystem drives accessed from a Mac via networks or USB, for instance, may be decrufted with this utility.

For More Details

This extension has been thoroughly documented elsewhere, so we'll cut this overview short here. For more on the "-skipcruft" option, see:

And additional version 3.0 changes described in Revisions and marked as "[3.0]" in other source-code files.

 

Usage Modes Guide

Feel free to use and modify this system as you wish, but this section provides some pointers on its intended roles. If you're looking for quick advice, skip ahead to the recommended usage modes below. This section's contents:

 

What mergeall Does

In general, this system is designed to make an entire destination (TO) directory tree the same as a source (FROM) directory tree much more quickly than brute-force copies. It achieves this by first scanning the tree to detect differences (using structural inspection for folders and primarily modification times for files), and then running the following updates in the following order:

  1. Differing same-named files are copied from FROM to TO.
  2. Unique items (files and folders) in TO are removed from TO.
  3. Unique items (files and folders) in FROM are copied to TO.
  4. Mixed-mode same-named items (file or folder) are replaced in TO by their FROM version.

Along the way, changes are backed up as described earlier. The net result mirrors the FROM tree to TO. How you utilize this tool, though, involves choices between automatic and selective update modes, and common base devices or direct transfers. The following sections define these terms and explore intended patterns of usage.

 

A Few Definitions

To described usage patterns, we need to first define some terms:

Common base device

This refers to a storage device— a USB flashdrive, local network drive, or other—that will be required in most cases to serve as an intermediary between different computers. Data is uploaded to the common base, and from there downloaded to other devices to synchronize them. Besides supporting such indirect transfers, a common base device also serves as a backup copy.

Direct transfer

This means a data transfer between trees performed without a common base—possible for synchronizing folders on the same machine, and for some types of device interfaces. A device that appears as a drive when connected by USB, for instance, allows for direct transfers.

Automatic updates mode

This is a mergeall option which automatically resolves tree differences without user intervention. This mode automatically applies all data-set changes from one tree or device to another, making a destination folder the same as a source. This can be useful both for quick backups, and for synchronizing multiple trees. Because this mode mirrors one whole tree to another, it generally requires that you work in only one archive or subfolder copy at a time, to avoid erasing another copy's changes on later full tree merges. In exchange, this mode provides the simplest and least error-prone option.

Automatic updates mode can be invoked in manual mergeall command lines, the console launcher, or the GUI launcher—which by design supports this updates mode only, along with reports. Choose this mode by using -auto (and omitting -report) in command lines, or by using inputs and widgets in the launchers.

Selective updates mode

This is a mergeall option which asks the interactive console user to approve or skip each individual file or folder update. Though more user-intensive and error-prone, this mode allows you to work in multiple trees simultaneously, and reconcile their changes in a more ad-hoc file-by-file fashion. It still requires that change sets be disjoint, to avoid changing files already changed in other trees but not yet synchronized. Unlike automatic updates mode, though, it can incorporate multiple and arbitrary change sets without having to treat entire archives or subfolders as locked while one copy is modified—albeit a substantial cost in user interaction requirements.

Selective updates mode can be invoked in manual mergeall command lines, or the console launcher. Choose this mode by omitting both -auto and -report in command lines, or by using launcher inputs.

 

How to Use the System

With the preceding definitions in mind, this section describes usage patterns—approaches to using the mergeall system to manage your data. Some use automatic updates, some selective, and some may use both. All may be applied with or without a common base. One note up front: in the first two sections that follow the notion of "tree" generally refers to a whole archive copy, but this need not always be so; the third section on subfolders will tighten up this concept.

 

Working in One Archive Copy at a Time

This system was designed in part for automatic merging of all data-set changes from one tree or device to another—the automatic updates mode described earlier. This mode can be useful both for general backups, and for synchronizing multiple trees. As a backup tool, for example, merging changes to a device in automatic updates mode will change only items modified since the prior backup.

In its synchronization role, because automatic updates mode makes the destination tree a mirror copy of the source, it works best if you're careful to make changes in only one copy at a time. When you want to work in another tree copy or device, first synchronize to propagate changes as follows:

  1. When using a common base, run mergeall to automatically upload changes from the changed tree to the base; then run again to automatically download the changes from the base to other trees to synchronize.

  2. When no common base is used, the upload step effectively goes away, as you'll run mergeall to automatically transfer the changed tree's updates to other trees directly.
This might become a frequent task if you work in multiple trees or devices often, but the merge steps run quickly (see the example run times above), and can use a simple USB stick or shared network drive as a common base. Moreover, you need to perform synchronization runs only when an entire batch of changes is ready for transfer, not on changes to each individual file. For example, in mode #1, when you're ready to make changes on a different device, simply run mergeall to upload from the last active device and download to the next active device; in between these transfers, no synchronization tasks are required.

The chief downside of this approach is that it requires some discipline to follow properly. Automatic updates mode assumes the destination tree should mirror the source tree exactly and in full, no matter how the destination may have changed. If you change multiple copies without synchronizing, automatically uploading or downloading from one tree may overwrite and thus erase the changes made in other trees. This can occur even if the trees' change sets are disjoint, because automatic merges work on a whole-tree basis. All changes, additions, deletions, and renames made in one tree are propagated to other copies, regardless of other trees' states.

That is, automatic updates mode works well and is simple to use, but requires some procedural diligence to avoid losing prior changes if multiple trees are modified but not synchronized between uploads. Specifically: You must treat the entire tree or device you're working in as the effectively "locked" copy until its changes are propagated; other copies must be treated as "read-only" until they incorporate the locked tree's updates by mergeall runs. Because mergeall makes synchronizations relatively quick and easy, though, this isn't necessarily more difficult than interfacing with cloud services on changes, and need not be run at all until switching active devices or propagating data for viewing (there's more on clouds in the wrap-up).

 

Working in Multiple Archive Copies at the Same Time

This system also has a selective updates mode described earlier, which allows you to choose updates to be applied. This mode supports working in multiple trees or devices simultaneously, and combining the changes made in them since the latest synchronization step on a file-by-file basis. Unlike automatic updates mode, it can incorporate multiple and arbitrary disjoint change sets without having to treat an entire tree as locked while any one copy is modified. At the same time, it also requires much more user interaction, and is much more prone to user error.

Like its automatic relative, selective updates mode can be used with or without a common base, and supports a variety of usage patterns. To reconcile two changed trees, do the following (and generalize these procedures for more than two trees):

  1. When using a common base, run once to selectively upload just the first tree's changes to the base; run again to selectively do the same for the second tree; and then run again to download the resulting combined base to each tree (perhaps in automatic updates mode, as the base should have both change sets).

  2. With no common base, simply run mergeall twice—once to selectively merge just the first tree's changes to the second; and once more with swapped from/to roles to merge just the second tree's changes to the first (perhaps via automatic updates mode, as only the second tree's changes should remain as differences).

To broadcast just one tree's changes to multiple possibly changed trees:

  1. When using a common base, run once to selectively upload just the changed tree's changes to the base; then run again to selectively download just those changes to each other tree.

  2. With no common base, run mergeall to selectively transfer just the changed tree's changes into each other tree.

With a common base, you can also defer downloading changes, but this seems a recipe for disaster. Merges will grow more complex over time, as the base will grow more and more different from individual copies. Synchronizing from the base immediately when changes are integrated will minimize the risk of accidentally losing its changes in later mergeall runs, or changing a file already changed in another tree but not yet synchronized—a worst case scenario for shared data sets, and a state this more piecemeal mode seems likely to foster.

In other words, selective usage patterns require some diligence too, to integrate changes before trees grow too out of synch to reconcile. In fact, you still must treat the entire set of all modified files in any tree as "locked" until they are transferred to other trees or devices; other copies of these files should be "read-only" till synchronized. This constraint doesn't apply to an entire tree (as it does in automatic updates mode), but it's an inherent consequence of working in multiple copies simultaneously. Selective updates ultimately trade procedure for reliance on user memory—you don't have to restrict edits to one tree copy at a time, but you do have to keep track of which files in which tree are current.

Selective updates mode also requires careful choice of updates to apply, and is manual to be sure, but reconciling two arbitrarily disparate trees by nature requires some sort of manual human intervention. This may be useful in limited contexts, but seems too manual to be a primary synchronization technique.

Sidebar: Selective Updates Alternatives

A future variant of this script could support the preceding's peer merges more directly instead of requiring multiple runs—by asking which version of changed files to use, and whether unique items in either tree should be copied over or pruned—but awaits some end-user experience. It's not clear whether this would be less or more confusing than separate one-way runs, and the merit of selective-mode usage in general remains to be shown. On the other hand, a direct peer merge would avoid analyzing differences twice. To try this extension as an exercise yourself, see mergeall.py's reusable comparetrees(), which already does half the work.

An automatic peer-to-peer merge, however, is impossible; without user input, it could not choose from differing same-named files, and could produce only the union of two trees' unique items in response to deletions or renames. A merge could, perhaps as an option, use the newest version whenever two same-named files differ, regardless of which tree it belongs to. This would pick up the latest changes, but was not pursued as it seems highly prone to error—it makes the extreme assumption that any change in any copy should invalidate all others, regardless of divergence since the last merge. It's also unclear in this scheme which tree to prefer for unique items (are they deletions or additions?). A more manual selective approach that asks the user about each difference seems more rational and safe.

 

Working in Multiple Subfolders at the Same Time

So far, we've seen how to apply automatic updates to work in one archive tree copy at a time, and selective updates to work in multiple trees simultaneously, but the automatic/selective dichotomy isn't quite as orthogonal as this may imply, and other schemes are possible.

For example, although mergeall makes an entire tree the same as another, this doesn't necessarily have to include every piece of data you've accumulated since the dawn of digital time. It's always possible to use automatic updates to synchronize just selected subfolders nested within an archive—rather than the whole archive tree—to and from a common base (or to another copy directly). This has some advantages, but they come with cautions:

In fact, this scheme is essentially the same as the preceding section's topic—with subfolders representing change sets, and automatic updates on subfolders replacing selective updates on a broader tree. To synchronize, simply use the prior section's modes #3 through #6 with these translations, and be sure to treat the currently active subfolder copy as "locked" and all others as "read-only" just as for the prior section's more arbitrary change sets.

Subfolders have an advantage over the prior section's approach: parallel changes are easier to manage when limited to specific tree locales, and automatic updates mode is much easier than selective updates from a user's perspective. However, subfolder synchronization also comes with most of the same burdens and dangers: unlike full-tree approaches, keeping track of tree changes still becomes more your task than the system's.

Hence, selected subfolder synching is not generally recommended, except in limited cases. You're more likely keep trees in synch if your automatic updates are made on a whole-archive basis, and you restrict your edits to one full-tree copy at a time. That said, your merges will run faster if you organize your data wisely, with rarely changed files in archive trees that need rarely be merged. If you do wish to make changes on different machines in parallel, though, you'll have to exercise some caution to avoid losing changes.

 

Recommended Usage Modes

Because of the complexities—and perils—of both selective updates mode and changing multiple trees simultaneously, mergeall's automatic updates mode and usage patterns #1 or #2 listed in the preceding section are generally recommended for most users (and frankly, have been the only techniques used in practice by the system's original developer). To summarize the model:

For data sets shared by multiple devices:

When changes are made on one device, run mergeall's automatic updates mode to upload them to a common base device, and run mergeall's automatic updates mode again to download them from the base to other devices when needed. You can make changes on just one device at a time, but need to synchronize this way only when switching to another device for edits, or propagating current data for viewing on other devices.

For data sets shared by multiple folders on the same device:

When changes are made in one folder, run mergeall's automatic updates mode to transfer them to other folders directly when needed. You can make changes in just one folder copy at a time, but need to synchronize this way only when switching to another folder for edits, or propagating current data for viewing in other folders.

Although you can apply these procedures to any subfolder nested in an archive's directory tree, it's generally simpler and recommended to run them on a whole archive. That way, mergeall is responsible for locating changes anywhere in the tree; for most real-world usage, this is much easier than keeping track of them yourself.

For multiple devices, this model is essentially a manual emulation of some cloud storage interfaces, where mergeall runs replace network transactions to and from a cloud server, and a local device used as the common base replaces remote cloud storage. Especially when augmented by version 2.0's automatic backup of items changed on each device, the common base's role becomes functionally very similar to many cloud services (again, more on clouds ahead).

The recommended automatic usage modes listed above offer the simplest and least error-prone solution, where their procedural requirements can be met. If you really must work on disjoint file sets in multiple trees or devices at the same time, though, be sure to synchronize regularly to avoid version skew—transfer your changes to other tree copies as soon as possible (if not immediately), per modes #3 through #6 above. You can perform these transfers with automatic updates mode if your changes are isolated in disjoint subfolders, but must use selective updates mode if they are more haphazard.

Put more strongly, version 2.0's automatic backup of changes helps protect your data in both automatic and selective updates modes, and 2.1's rollbacks provide a failsafe for catastrophic mistakes, but there's nothing mergeall can do if the same file is changed in two trees without synchronizing—a case that seems more likely when using a more sporadic simultaneous changes model. Following the recommendations above is the simplest way to avoid this situation.

Because it's generally easier, automatic updates mode is the only updates mode supported by the mergeall GUI launcher—the recommended way to use this system for most users. Selective updates mode is available in both the console launcher and manual mergeall command lines, which are more powerful alternatives for more advanced use cases.

 

Other Usage Recommendations

Apart from the preceding section's usage pattern suggestions, a variety of general techniques can help make mergeall more effective for your data. Here's a quick rundown of additional usage suggestions:

Experiment with the GUI live:

There is no formal usage guide for mergeall's GUI, because it is simple enough to qualify as self-explanatory. The screenshots above give a static picture of the GUI, but your best bet may be to experiment with it live—open the GUI launcher script, launch-mergeall-GUI.pyw; select your FROM and TO folders; choose a report-only or auto-updates run; make your logfile and backups choices as appropriate for your run; and press the "GO" button at the GUI screen's bottom to start the mergeall process. mergeall output appears in the GUI's text area, and the GUI changes its structure to present only items relevant to the selections you make. Be sure to start out with report-only mode, and use a TO folder you don't mind changing in auto-updates mode.

Design your archives wisely:

As a rule, all files and media that you wish to be managed by mergeall should be saved in your mergeall archive tree (or trees), not in any platform-specific default folders; this requires some discipline, but allows for quick copies and backups. To make merges faster, store infrequently changed data in a different archive tree than data you typically change; that way, you can run mergeall on just the regular-changes tree and skip the rest. Decade-old photo collections, for example, are unlikely to change often enough to warrant regular mergeall inspection. On the other hand, any data that may change should be in a folder mergeall visits so that updates are propagated, and version 2.2's speed optimization can make comparisons much faster for larger trees. Also note that your archive trees must be no larger than the storage space of devices to which they will be propagated; split up your tree if it's too big for your external drives.

View reports before updating:

Especially when first using the system, it's a good idea to run it in report-only mode before running it to perform updates—automatic updates in particular. The report shows differences found and describes the changes that automatic updates would make, allowing you to preview and verify the plan. In command-line usage, this means run with -report before -auto; in the launchers, use inputs and widgets to report first.

Use automatic backups:

As of version 2.0, for data safety it is recommended to always use mergeall's automatic backups option for changes described earlier, in both automatic and selective updates modes. While not foolproof, this option allows unwanted or erroneous mergeall run changes to be backed out if needed. Because this helps protect your archives (which are your digital property), it's enabled by default in the GUI; don't disable it unless backup copies would be too large or slow for your devices. Backups are also required for the next bullet's restores.

Use automatic restores if needed:

Though primarily intended for piecemeal restores, the prior bullet's backups also allow for complete rollbacks of immediately preceding runs as of version 2.1, in unlikely but catastrophic scenarios (if you mix up FROM and TO folders, for example). For details, see the new restore option described earlier. You should generally be cautious with folder selection to avoid restores altogether, and full rollbacks should be very rarely required; restores provide a failsafe recovery option if ever needed.

Keep multiple copies:

To protect your data further, keep multiple archive copies, and rotate their mergeall updates by age (always merge to the oldest copy). This way, you'll have additional backups to fall back on in case of rare but catastrophic device failures.

Run diffall.py:

For further archive fidelity, run the accompanying diffall.py script occasionally, to verify the integrity of archive copies by byte-for-byte comparisons. Unlike mergeall, diffall compares full file content instead of just file modification times, and so gives a slower but more complete proof of data equality. To run diffall, use mergeall's -verify command-line option or direct command lines; see manual-commands-cheat.txt for examples. For more on diffall, see its script's docstring, and its -recent option documented in version 2.0 change notes in Revisions.html. Also note that some differences are normal, including __bkp__ per-run subfolders used for change backups, and files changed trivially by Excel as discussed in Lessons-Learned.html and a version 1.4 usage note in Revisions.html.

Fix file permissions:

Some file permissions preclude mergeall updates. This includes read-only and hidden/system files; some may be copied over to a destination, but cannot be updated there on changes. As the system does not modify your files' permissions automatically (your files are your property), you may want to change these yourself if they register as errors and skips in the mergeall log. In-use file errors can be addressed by rerunning. See the related usage note in Revisions.html for more details.

Handle DST rollover:

If you use FAT devices (e.g., most flashdrives) on Windows, you'll probably want to adopt a policy for dealing with the 1-hour modtime skew that occurs at Daylight Savings Time (DST) rollovers. See the version 1.4 usage note in Revisions.html for options—including the new script workaround in 2.0—and Lessons-Learned.html for additional context. This is easy to handle, but the default policy means that your FAT archive copies will be rewritten in full twice a year.

Update: as of version 3.0, users are advised to format external drives with the exFAT filesystem to work around the DST rollover issue. This fixes the problem completely, though Linux users (only) may have to install exFAT support. See the new coverage of this topic in the User Guide.

Use shorter names and paths:

On most systems (including Windows), there is a limit on both filename and directory path length, above which mergeall updates may fail. To avoid this, try to avoid excessively long filenames and excessively deep directory trees. If items fail due to length, you may need to manually shorten or prune them, or move them closer to the archive root; saved web pages are notorious in this department. On the upside, mergeall can handle any filename that works on your platform (including those containing spaces and other odd characters on Windows), even though some may be difficult to use in other platforms' shells; and properly propagates mixed-case file renames, even on platforms whose filenames are case-insensitive (including Windows).

Update: version 3.0 lifts the Windows pathname length limit, by formatting too-long pathnames in such a way as to invoke different API tools that support much longer paths. Thus, you probably won't run into pathname limits anymore in mergeall (even in website save folders), though pathologically long filenames or pathnames may still fail on any platform.

Log to different drives:

Routing mergeall's output to a logfile that is located in the TO destination folder may cause mergeall to run substantially slower due to the extra writes, especially on flashdrives. Make sure your logfiles are routed to a different drive (e.g., on Windows, use C: for the log if D: is the TO destination tree). Also note that this may not be a significant issue on some drives (e.g., SSDs); try it on yours to be sure.

Set your shell Unicode type:

Be sure to set Python's Unicode environment variable PYTHONIOENCODING to UTF8 (or other) in your shell or Control Panel if you receive Unicode errors when scripts like mergeall.py attempt to print non-ASCII filenames on your platform. This manual setting is not required for the GUI launcher—it automatically sets and propagates this variable to its mergeall.py subprocess, and does not route text to a console (only to a GUI and bytes-mode logfile). However, this setting may be required for both the console launcher, and mergeall.py when run directly from a command line—because both print filenames to the console, visiting any file with a non-ASCII name may otherwise abort these scripts, especially in 3.X. For more on this variable, see PP4E or LP5E.

Consider using Python 3.X and 3.5+:

This note pertains to the source-code mergeall package only. As described in the Revisions' release notes, it's generally recommended that mergeall and its launchers be run under Python 3.X instead of 2.X for trees having many non-ASCII filenames, and under Python 3.5 or later for larger trees. Using 3.X avoids some minor display issues, and using 3.5+ allows mergeall to run much quicker. Note that you should be able to install these Pythons without breaking programs that rely on prior releases; on Windows, install without filename associations, and run mergeall and its launchers from command lines instead of clicks (e.g., "py -3.5 launch-mergeall-GUI.pyw")

Update: the speed gains of the Python 3.5+ optimization are realized only on Windows and Linux. On Mac OS X, this optimization is irrelevant and unused, as it actually slows comparisons by a factor of 3 in Python 3.5 (also see the note above: the 3.5+ scandir() optimization was subsumed by an alternative coding). Moreover, it now appears that some of Python 2.X's Unicode filename issues may be limited to Windows (this is to be further explored). Still, the latest Python is generally recommended on all platforms.

 

Limitations and Cautions

Though mergeall works as intended and continues to see regular action, it's not without the usual dark corners inherent in system-related tools. None of these are unique to mergeall—in fact, cloud providers and other backup systems must deal with many of the same issues. This section's subsections summarize the set, though, so you can be fully aware of issues that may crop up:

 

Same-File Differences

Keep in mind that change sets must be disjoint to reconcile two trees at all (e.g., working on only website files in one tree, and spreadsheet files in another). The recommended usage modes described earlier avoid this issue altogether by limiting changes to one copy at a time. If you change the same file in two or more trees without synchronizing, though, you'll have to select a single version, and may have to manually reconcile in-file changes.

This is a dilemma that source control systems aim to address, and for which some products may attempt to apply proprietary solutions for a limited number of file types, but it remains an unavoidable potential pitfall in the general case. Whether you merge local copies with this system or resort to a network cloud, you must still be careful to avoid changing the same file in multiple trees or devices without synchronizing the file to all copies after each set of changes, by either manual or automated transfers.

 

Other Limitations

Beyond basic usage models, this system also comes with some open issues and caveats, described in the docstrings at the top of its code files; search for TBD and CAVEAT for details. Among them:

For notes on unusual file types, see version 1.5. For more on device failures, see Lessons-Learned.html in the docs folder. For more on timestamp and filesystem issues, see Lessons-Learned.html; the CAVEATS section in the docstring of the main mergeall.py script; and the usage notes in the version history of the top-level Revisions.html file. The latter of these also includes specific usage notes and workarounds, some of which are omitted here; among its coverage and solutions:

The good news is that update failures are not generally harmful—they produce error messages in the log, but simply leave a difference to be resolved either manually or on the next mergeall run. See also other recommendations earlier, for pointers on dealing with some of these limitations.

 

Use With Care

However you use this system, also keep in mind that it may change its destination tree in-place. Moreover, by default it does so without making backups of files or directories added, replaced, or deleted (see the next paragraph). This is all by design, to optimize speed and space requirements; after all, the goal of this system is to synchronize large trees faster than brute-force copies. If in doubt, though, please try it on a temporary copy first, and make manual backups as needed. There is also a more detailed warning in the UserGuide.html file that you should read before use.

Update: for added data safety, see the 2.0 automatic backup for changes option described earlier. When enabled, this option mitigates some data loss risk by automatically saving all files and directories replaced or deleted in-place, and noting all files added. This allows you to back out changes if needed—either by manual piecemeal copies, or by version 2.1's complete rollbacks—and should generally always be used. However, this should still not be considered foolproof, given the many ways that storage devices can fail. See the preceding recommended usage and other recommendations if you haven't already, for more pointers on promoting archive integrity.

 

Manual Merges versus Cloud Storage

In closing, here are a few words on this system's purpose. As noted earlier, mergeall's recommended usage mode corresponds closely to cloud services, where program runs replace cloud server transfers, a local base device replaces remote cloud storage, and backed-up changes on each destination device can help protect your data.

Compared to some of the current claims of cloud storage providers, though, the recommended mergeall usage model may require extra manual steps to synchronize; its on-demand whole-archive resolution must be run only when needed, but it must be run. On the other hand, some cloud services come with interface tasks of their own, and may not be quite as automatic as their marketing may imply. More crucially, cloud servers are controlled, in most cases, by financially interested third parties, on which your digital property becomes wholly dependent—a massive downside, and a primary motivation for starting this project.

To be clear, data stored on commercial and/or public clouds—including Google Drive, Dropbox, Microsoft's OneDrive (formerly SkyDrive), Apple's iCloud, and Amazon's Cloud Drive:

If that's not enough to raise a red flag or two, also keep in mind that clouds are not a panacea for all the issues inherent in data storage (despite the Orwellian language on some of their web sites). Multiple copies or devices raise difficult problems that require careful resolution under any regime. The more a cloud promises simple solutions, the less likely it is to deliver them.

This issue has grown more acute as the ongoing computer revolution has coaxed more of us to move important personal property to digital storage. This is convenient to be sure, but comes with substantial tradeoffs and risks. Uploading your photo libraries to a public cloud is no different than giving a shoebox full of them for safe keeping to a complete stranger you met on the bus. Some such strangers may not only pass along your shoebox to others without getting your okay, they might just hold it for ransom in the future. If you wouldn't do this in the "real" world, why would you do so on the web?

Regardless of how you proceed, please be careful out there. Trusting your personal digital property to a third party is inherently perilous, especially when that party is laden with agendas. For better or worse, the computer industry at present seems to have no shortage of companies jockeying to establish points of control that can be used to squeeze nickels out of people with fewer nickels left to be squeezed. Random example: a company that abruptly adopts an advertising-or-subscription model for a game that had been freely available for over two decades may not have your best interest at heart (see Windows 8 Solitaire!).

Postscripts

May-17-14: Adobe's Creative Cloud goes offline for a day leaving subscribers in the dark, as reported here, here, and here.

Mar-13-15: Per the web, Apple, Amazon, Google, Microsoft, Dropbox, and Facebook aren't immune to service outages either.

Mar-18-15: For an example of how sensitive an issue cloud storage can be, see this controversy regarding a cloud provider.

Mar-28-15: Speaking of changing the rules after you've become dependent, see Amazon's change here and here.

Later: And from the hate-to-say-I-told-you-so department, Samsung axed most of its cloud-storage service in 2021.



[Host site] Guide Code Mergeall Apps Blog Input © M. Lutz