mergeall — Post-Implementation Notes

Please note: this programmer-oriented page is a supplement to the original mergeall Whitepaper, and is similarly dated. In fact, this page is older still, and something of a relic: it reflects mergeall's very-early development and has not been significantly updated in years (despite more-recent, hard-won lessons), but it still provides extra project history and context. For space, former GUI screenshots and their links here have been removed. See the newer User Guide for the latest documentation and screen captures.

The mergeall system today is a general tool aimed at a wide variety of users. It was initially coded both for personal use and as an extra instructional example for PP4E book readers. For the latter role, and for developers of similar tools, this page discusses some of the issues and tradeoffs confronted along the way during mergeall's early development. For notes from more recent mergeall development, see both the Revisions history, and the latest User Guide added in version 3.0.

In this document:


Timestamp and Filesystem Issues

In the somewhat limited usage that this system has seen to date, a slew of issues arose regarding file modification timestamp fidelity, which call their reliability—and usability in general—into question. These issues are not unique to mergeall, but shared by all timestamp-based tools (including some cloud providers). Among these were issues with:

Python itself has an issue regarding timestamp precision in 2.X (only), that also required a workaround here—an inherent part of library dependence on general, and Python's "batteries included" motif in particular.

Platforms whose filenames are case-insensitive impose additional and subtle constraints on some file operations. Windows, for example, requires that deletions happen before additions when implementing file case renames (and when backing them out), because a deletion will happily remove a file of different name case (see mergeall.py's mergetrees() docstring for more details).

If we also factor in exotic file types such as links and fifos (a.k.a. symlinks and named pipes)—not yet tested or fully addressed here—it makes for a fairly large functionality matrix. In the end, supporting the timestamp and other idiosyncrasies of all file systems and all file types on all platforms may be a daunting task, suggesting that other techniques (e.g., checksums?) might be worth exploring.

File systems also vary in other usage details such as maximum path lengths, permissions, and more; these are impossible for any file-processing tool to repair, but could become customer support issues in a widely used synchronization system.

3.0 update: actually, Windows pathname-length limits did prove possible to repair; symbolic links are now supported and fifos skipped; and exFAT formatting is a workable solution to the DST rollover issue. But this point is still valid—there are many other variables in the content-archiving domain.

Workarounds and Resources

For workarounds to some timestamp and filesystem issues not addressed by code, see the usage notes (and other changes) in the version history sections of this project's Revisions.html file, as well as the higher-level usage pointers and limitations sections in Whitepaper.html. In sum:

In all cases, file update failures are simply left as differences for a next run, but may produce log messages.

For more details on implementation issues, see the CAVEATS section in mergeall.py's docstring, and Python demos of some timestamp issues in the (now defunct) examples/issues folder of this system's tree. Further technical details are also available on the Web. FAT (and other) file timestamp issues are well known in the Windows world at large; for more details, try a web search or see:

Maximum path length limits vary by platform and filesystem. On Windows, there is a general 256-character limit, as mentioned in Revisions.html file, and described in the following:

3.0 update: Windows pathname-length limits have been removed as of mergeall 3.0. See the details.


Python 3.X/2.X Compatibility

This proved more elusive than expected. Issues related to supporting both Pythons' flavors of stream buffering and encoding, as well as their models for file processing in general, made this program more complex than it might have been. It's possible to support both Pythons, but requires substantial extra effort; especially for system-related work, supporting a single Python line still makes for simpler code.

See the code files' internal documentation, as well as Revisions.html file's version changes for more on specific compatibility issues confronted.

One specific note: even Python 3.X seems to still be struggling to come to grips with the full ramifications of Unicode, in some of its standard libraries. Its subprocess.Popen object, for example, always uses a Unicode encoding setting in the locale module to automatically decode text-mode output streams of spawned programs. This is clearly insufficient in the presence of potentially wide encoding variability.

For instance, printing the names of files in a tree—as done by this and other systems—may require broad and arbitrary encodings if the tree includes files obtained from the Internet. At the least, encoding should be controllable with an optional argument. Binary-mode streams help in 3.X itself, and were adopted in the end, but at the same time aggravate 3.X/2.X compatibility issues.

Update, Jul-14:

See also the Revisions.html file's notes for version 1.6: this version works around a further Unicode filename decoding issue which reflects a fundamental difference in process text stream content in Python 2.X. In the end, Python 2.X/3.X compatibility for system-level programs like mergeall may be a pipedream, and is certainly substantial extra work.


Storage Device Reliability and Speed

During this project's development and testing, there were some spectacular backup device failures, including:

In other words, personal peripheral storage devices are not entirely reliable today. If you add to this the long but limited write/connect cycle lifespans of USB flashdrives, your data can be on somewhat shaky ground.

That being said, this system has gone on to be used regularly for over a year to date without any device problems; both the program and the devices it has run against proved to be robust enough to trust with valued content (and certainly no less reliable than network cloud storage would be; see, for instance, the recent outages at the end of the preceding link's section).

Still, because all backup and synchronization tools can be only as reliable as the devices on which they write files, the best policy seems to be a defensive one—make as many copies of your data as possible, so that you will have functional versions when needed. Storage devices are much cheaper than loss of important data.

Update: New Faster Devices (and a Word on Clouds)

It turns out that not all flashdrives are created equal, and one does seem to get what one pays for. After writing the above, the target use case adopted two new 128G SanDisk storage cards: a USB 3.0 "Extreme Pro", and a MicroSD SDXC UHS-1 "Ultra". Especially on a USB 3.0 port, the USB card was radically faster—5 to 10 times as quick as some former USB drives, even for the many-small-files mix of typical archives. Full comparisons that formerly took 1.5 hours now finished in just 8 minutes, and full copies that ran 1.5 to 2 hours (or more) on other devices came in at just 20 minutes. This is for a combined data set that is now 73G, with 45K files and 2600 directories as this update is being written.

On a USB 3.0 port, the USB card's 260/240 MB per second read and write rated speeds would be faster than harddrives in some cases, and roughly 80 and 320 times quicker for reads and writes, respectively, than typical broadband Internet—and hence Internet-based cloud storage transfer speed. At just 28 and 6 Mb/sec (my current Wifi broadband's average speeds for downloads and uploads), an Internet-based cloud could seem painfully slow by comparison, especially for large data sets or files (see calculations below). The MicroSD card wasn't quite as fast, but was still an improvement in both speed and space.

These new devices are quick enough for occasional brute-force full copies, but it's still important to minimize writes on flashdrives to extend their lifespan—exactly what mergeall's changes-only updates achieve. Moreover, thanks to its selective updates model, a typical mergeall run for this same archive takes just 1 minute, which is still some 20 times quicker than a full copy even at USB 3.0 and premium card speeds. Given that a complete copy also generally merits a complete verification compare, a mergeall run is really some 30 times quicker (1 minute versus 20 + 8).

The newer drives also are not cheap: at $200 each when first released, their combined cost is roughly the same as that of the secondary device they are used to support (a Dell 8" Windows 8.1 tablet with a decent processor and screen, but crippled by a single data/charge port and what's left of 64G storage—calculated or not, a limitation seemed destined to encourage cloud storage). As always, though, data matters more than device.

For more on the comparison of flashdrives to cloud storage, see the Whitepaper write-up. For pointers on using drives shared on a local network, see the top-level Revisions.html file's usage note (spoiler: they also seem very slow by comparison to USB drives, checking in at 35 to 50 times slower on tests run). Direct Wifi results may or may not vary, but physically connected devices will likely always be quicker.

Corrected, Oct-14:

The preceding was originally off by a factor of 8 due to measurement unit differences, but has been corrected. On closer analysis, the USB 3.0 card on a USB 3.0 port is rated at 260/240 MB (bytes) per second for read/write transfers, but my Wifi broadband checks in at 28/6 Mb (bits) per second for download/upload transfers, the equivalent of read/write for Internet-based cloud usage. Given 8 bits per byte, this means the USB card is actually 8 times quicker than originally stated above—USB is some 80 and 320 times quicker for reading and writing data, respectively, than typical broadband Internet. For instance, 320 = (240 MB * 8 bits/byte) / 6 Mb) for writes. Some such statistics may reflect ideal conditions, of course, and some cloud services may gain from synchronizing on a frequent file-by-file (not occasional whole archive) basis; but the difference is clearly striking.


Decoupled versus Single-Process Architectures

The launchers both use a decoupled model, which spawns mergeall in a separate process and intercepts its output streams. Both launchers also duplicate intercepted output to logfiles on request, and the console launcher and spawned mergeall share the console input stream in interactive mode.

This is a standard technique, and is required for running non-Python programs. As the change log attests, though, it led to a handful of implementation issues related to stream decoding and buffering, especially for dual Python 3.X/2.X usage. In retrospect, there might be advantages to a single-process model, which imports mergeall as a module and calls its functions directly, instead of spawning it as a separate program.

In a single process model, the input and output streams may still be intercepted, by assigning their sys module attributes to class instance objects with read/write methods that route data to and from console or GUI components, and possibly echo it to a logfile. The GUI launcher would still need to thread calls to mergeall and route its intercepted lines to a queue to be polled and processed in the main GUI thread as it is now. The console launcher would simply print and possibly log intercepted text. Stream reads, however, would be replaced with write methods that receive printed text from mergeall by direct calls, and decode as appropriate.

For details and examples of intercepting streams with objects in a GUI, see the chapter GUI Coding Techniques in the book (PP4E), especially its sections GuiStreams: Redirecting Streams to Widgets and Adding a GUI As a Separate Program: Command Pipes. A single process model implementation would use nearly identical techniques, but the book's GuiOutput.write() method would simply queue mergeall's output lines to be displayed by a timer loop in the main GUI thread.

Tradeoffs and Conclusions

The single-process model may hold promise, but it's unclear if it would be simpler than the current decoupled scheme. Moreover, among its generic disadvantages, a single process cannot be distributed to run on multiple CPU cores, and problems in the spawned program become problems in the launcher—in a decoupled model, mergeall exceptions or crashes don't impact the GUI, though this might be mitigated in the single-process approach by thread exception handlers and stderr routing.

The current decoupled mergeall system works both well and as intended for its initial target use case, so any further refactoring here is left as suggested exercise. Replacing decoupled streams with in-process stream interception may or may not prove simpler; given the nature of software development, though, it seems more likely to come with an entirely new set of implementation issues.


[Python Logo] Books Programs Posts Feedback © M. Lutz