Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Git Your MS Office Docs

DZone's Guide to

Git Your MS Office Docs

Commit docs to Git, Use Diff to "Patch" Text

· DevOps Zone
Free Resource

Download the blueprint that can take a company of any maturity level all the way up to enterprise-scale continuous delivery using a combination of Automic Release Automation, Automic’s 20+ years of business automation experience, and the proven tools and practices the company is already leveraging.

I’d love Git to grok MS Office docs, but it doesn’t really. It came up again today at work, and coincidentally in Disqus comments for an older blog entry of mine--The rise of version control.

Anyway, I though I’d spike what Git would do, were it reworked to silently unzip (for commit) and rezip (as it makes working copy). Here’s a repo – git-word-diff-test. Here’s a commit of a simulated storage of a Word doc (Mac Office ) – Mary.docx – which just contains one word per line “Mary had a little lamb,” and a single commit that changed that to “Karl had a little iPad.”

Turns out the commit is pretty noisy – search in page for “iPad.” It was the only intended change. It’s a shame that a whole bunch of random and temporal stuff changes at the same time. Microsoft: try to make idempotent things please.

Never mind, the second commit was only an 170 byte addition to the .git/ blobs, when the .docx file is ordinarily 23Kb in size.

Byte diff calc for the HEAD commit:

COMMITSHA=$(git log | grep "commit " | head -n 1 | sed 's/commit //')
CURRENTSIZE=$(git ls-tree -lrt $COMMITSHA | grep blob | sed -E "s/.{53} *([0-9]*).*/\1/g" | paste -sd+ - | bc)
PREVSIZE=$(git ls-tree -lrt $COMMITSHA^ | grep blob | sed -E "s/.{53} *([0-9]*).*/\1/g" | paste -sd+ - | bc)
echo "$CURRENTSIZE - $PREVSIZE" | bc

(modified from stackoverflow)

As I sorta said in the Disqus comments, I’d pay for Git to be changed to silently unzip .docx, .xlsx and .pptx documents and only reconstitute them in the working copy as I checkout. I’m only interested in the carriage-return-delimited text diffs (incl XML), if I’m diffing at all. Diffs on the binary aspects of zips are a wrong as an operating system not being Unix derivative despite what Cosmin says.

Update: (Spet 1st) Inside the zip, there’s vbaProject.bin for Word/Excel/Ppowerpoint docs that have VBA. This raises the bar on the idea – as it is binary. Luckily there is open-source know-how that will allow this otherwise less-than-open-standard piece to be unpacked too: Philippe Lagadec’s oletools.

Download the ‘Practical Blueprint to Continuous Delivery’ to learn how Automic Release Automation can help you begin or continue your company’s digital transformation.

Topics:
git ,ms office ,devops ,version control

Published at DZone with permission of Paul Hammant, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}