Modern Single Page Applications (SPAs) often embed terms and conditions pages into the app itself for a slick and modern feel. While this makes for a great user experience, it can be tedious and time consuming for developers to convert long Microsoft Word documents of legal copy into HTML/JSX that can be embedded into a terms and conditions component or modal. But fret not, fellow developer! With some macOS and shell utilities, you can let the computer handle the drudgery, so you can focus on something more important.
Simply using some built-in command line programs in macOS will do the trick, converting the Word document into clean HTML you can paste into your terms and conditions component! And it even works on other document formats such as .txt and .rtf files. Let’s see how — and why.
Laziness Is a Virtue
Larry Wall, the creator of the Perl programming language, argued that laziness is one of the primary virtues of a good programmer. It’s that virtue that makes coders “write labor-saving programs that other people will find useful.” And indeed, a good engineer will “go to great effort to reduce overall energy expenditure” by finding opportunities to save time and effort and make processes more efficient.
Along these lines, I’d say that a primary coding virtue is the ability to identify which tasks are rote, repetitive, and best delegated to a computer, and which tasks instead require human creativity, problem solving, and ingenuity. We’ll only ever have time and energy for the latter category if we find a way to let the computers handle the boring, repetitive stuff.
The Inevitable Terms & Conditions Ticket
It’s inevitable. When developing a new SPA, there will come a day that you or your teammate will be assigned a ticket to create an embedded terms and conditions page or modal. (That’s our litigious modern society. Sigh…) Typically, a developer is handed a Word document from the legal department and some fancy designs, and left to figure out the rest.
The most painstaking approach would be to manually copy each paragraph, add any bold and italic formatting, and wrap it in appropriate HTML tags. This can take a while if it’s a long Word document! And it won’t be pleasant. Our laziness instincts should be kicking in about now.
We can make this process a little less manual through this nifty VS Code extension. It will let us wrap each paragraph or sentence of text in the appropriate
<i> tags. But it’s still a pretty manual process of copying, pasting, and formatting. How can we fully automate this?
There’s a CLI for That
Good news! macOS ships with a command line tool called
textutil that excels at converting documents into different formats. It can convert a Word document into HTML in a single terminal command:
textutil -convert html -strip terms.docx. This will take your Word document, strip out all the metadata, and convert it into basic HTML markup. Paragraphs will be wrapped in
<p> tags, and bold and italic formatting tags will be added as well. No more need to go through the document paragraph by paragraph yourself. Joy!
Much Too Classy
textutil creates some basic CSS styles for you based on the source Word document and attaches very generic class names such as
Apple-converted-space to seemingly every tag it creates. But you probably don’t want these generated class names polluting your markup. Not only does it just look ugly and hard to read, but these highly generic class names could clash with other classes in your app, leading to unintended consequences.
textutil lacks any built-in option to suppress these class names. Sure, we could manually remove all the classes from the generated markup, but we don’t want to do that either.
Right Sed Fred
Fear not — we can clean up the HTML that
textutil gives us using
sed, a shell tool for text manipulation that comes built into Bash and Zsh. We’ll pipe the HTML that
textutil generates into
sed, strip out all the class names, and save the result to a file.
sed command we’ll use to delete the class names is
sed ‘s/class=”[^”]*”//g’. Let’s break that down. The leading
s in the argument means we’ll substitute text matching the pattern between the first and second
/ characters with the text between the second and third
/’s. The regex pattern we’ll match is
class=”[^”]*” (explained below). Then, we’ll replace the text matching that pattern with the text between the last two slashes — here, an empty string. And we’ll do it for every occurrence with the global modifier,
/g. That is, we’ll simply delete the text matching the pattern throughout the document.
About that funky-looking regex…
sed will greedily match far more text than you intended, well beyond the end of the HTML tag.
Instead, we can mock lazy matching in
sed with this technique: we can match the opening
”, followed by any character except a
”, then the closing
/class=”[^”]*”/ will get us the lazy matching we need — effectively
textutil’s output through this
sed command, we’ll have nice, clean markup without all the random class names.
Putting It All Together
Last, we’ll save the cleaned HTML to a file. The final command line script is
textutil -convert html -strip -stdout terms.docx | sed ‘s/ class=”[^”]*”//g’ > output.html, which (1) converts the Word document to HTML with
textutil, (2) strips out the class names that
textutil adds to each tag with
sed, and (3) saves the cleaned HTML to a file. From there, we can simply paste the HTML into our terms and conditions component in our SPA, style it, and call it a day. Building on this command, we could even take it a step further and strip out unnecessary
<span> tags, and anything else we wanted to get rid of.
If a development task is manual, repetitive, time-consuming, and boring, that’s a sign. As developers, we should hone a keen awareness of this feeling, which is usually a clear indication that it’s time to automate the task and move on to more creative, higher-value problem solving. It’s a unique privilege of being a software engineer that we can (and should!) automate these annoying parts of our jobs. So, embrace your laziness, fellow devs! It’s the virtuous thing to do.
Convert your Word document to clean HTML on macOS by running this command in your shell:
textutil -convert html -strip -stdout terms.docx | sed ‘s/ class=”[^”]*”//g’ > output.html