Embrace Your Laziness: Automatically Convert Word Documents into Terms & Conditions Pages

Josh Stillman

January 7, 2022


Modern Single Page Applications (SPAs) often embed terms and conditions pages into the app itself for a slick and modern feel. While this makes for a great user experience, it can be tedious and time consuming for developers to convert long Microsoft Word documents of legal copy into HTML/JSX that can be embedded into a terms and conditions component or modal. But fret not, fellow developer! With some macOS and shell utilities, you can let the computer handle the drudgery, so you can focus on something more important.

Simply using some built-in command line programs in macOS will do the trick, converting the Word document into clean HTML you can paste into your terms and conditions component! And it even works on other document formats such as .txt and .rtf files. Let’s see how ⁠ — and why.

Laziness Is a Virtue

Larry Wall, the creator of the Perl programming language, argued that laziness is one of the primary virtues of a good programmer. It’s that virtue that makes coders “write labor-saving programs that other people will find useful.” And indeed, a good engineer will “go to great effort to reduce overall energy expenditure” by finding opportunities to save time and effort and make processes more efficient.

Along these lines, I’d say that a primary coding virtue is the ability to identify which tasks are rote, repetitive, and best delegated to a computer, and which tasks instead require human creativity, problem solving, and ingenuity. We’ll only ever have time and energy for the latter category if we find a way to let the computers handle the boring, repetitive stuff.

The Inevitable Terms & Conditions Ticket

It’s inevitable. When developing a new SPA, there will come a day that you or your teammate will be assigned a ticket to create an embedded terms and conditions page or modal. (That’s our litigious modern society. Sigh…) Typically, a developer is handed a Word document from the legal department and some fancy designs, and left to figure out the rest.

The most painstaking approach would be to manually copy each paragraph, add any bold and italic formatting, and wrap it in appropriate HTML tags. This can take a while if it’s a long Word document! And it won’t be pleasant. Our laziness instincts should be kicking in about now.

We can make this process a little less manual through this nifty VS Code extension. It will let us wrap each paragraph or sentence of text in the appropriate <p><b>, or <i> tags. But it’s still a pretty manual process of copying, pasting, and formatting. How can we fully automate this?

There’s a CLI for That

Good news! macOS ships with a command line tool called textutil that excels at converting documents into different formats. It can convert a Word document into HTML in a single terminal command: textutil -convert html -strip terms.docx. This will take your Word document, strip out all the metadata, and convert it into basic HTML markup. Paragraphs will be wrapped in <p> tags, and bold and italic formatting tags will be added as well. No more need to go through the document paragraph by paragraph yourself. Joy!

Much Too Classy

One problem! textutil creates some basic CSS styles for you based on the source Word document and attaches very generic class names such as p2 and Apple-converted-space to seemingly every tag it creates. But you probably don’t want these generated class names polluting your markup. Not only does it just look ugly and hard to read, but these highly generic class names could clash with other classes in your app, leading to unintended consequences.

Sadly, textutil lacks any built-in option to suppress these class names. Sure, we could manually remove all the classes from the generated markup, but we don’t want to do that either.

Right Sed Fred

Fear not⁠ — we can clean up the HTML that textutil gives us using sed, a shell tool for text manipulation that comes built into Bash and Zsh. We’ll pipe the HTML that textutil generates into sed, strip out all the class names, and save the result to a file.

The sed command we’ll use to delete the class names is sed ‘s/class=”[^”]*”//g’. Let’s break that down. The leading s in the argument means we’ll substitute text matching the pattern between the first and second / characters with the text between the second and third /’s. The regex pattern we’ll match is class=”[^”]*” (explained below). Then, we’ll replace the text matching that pattern with the text between the last two slashes⁠ — here, an empty string. And we’ll do it for every occurrence with the global modifier, /g. That is, we’ll simply delete the text matching the pattern throughout the document.

About that funky-looking regex… sed doesn’t have the same regex capabilities you’re familiar with in modern languages such as JavaScript. It doesn’t have lazy matching, meaning that if you try to match class=”.*”sed will greedily match far more text than you intended, well beyond the end of the HTML tag.

Instead, we can mock lazy matching in sed with this technique: we can match the opening , followed by any character except a , then the closing . So /class=”[^”]*”/ will get us the lazy matching we need — effectively /class=”.*?”/ in JavaScript’s regex dialect. Lazy matching for lazy programmers!

After running textutil’s output through this sed command, we’ll have nice, clean markup without all the random class names.

Putting It All Together

Last, we’ll save the cleaned HTML to a file. The final command line script is textutil -convert html -strip -stdout terms.docx | sed ‘s/ class=”[^”]*”//g’ > output.html, which (1) converts the Word document to HTML with textutil, (2) strips out the class names that textutil adds to each tag with sed, and (3) saves the cleaned HTML to a file. From there, we can simply paste the HTML into our terms and conditions component in our SPA, style it, and call it a day. Building on this command, we could even take it a step further and strip out unnecessary <span> tags, and anything else we wanted to get rid of.


If a development task is manual, repetitive, time-consuming, and boring, that’s a sign. As developers, we should hone a keen awareness of this feeling, which is usually a clear indication that it’s time to automate the task and move on to more creative, higher-value problem solving. It’s a unique privilege of being a software engineer that we can (and should!) automate these annoying parts of our jobs. So, embrace your laziness, fellow devs! It’s the virtuous thing to do.


Convert your Word document to clean HTML on macOS by running this command in your shell: textutil -convert html -strip -stdout terms.docx | sed ‘s/ class=”[^”]*”//g’ > output.html

About the author

Josh Stillman

Web Development, Music, Politics, etc.


Stay in the loop

Keep up to date with our newest products and all the latest
in technology and design.
Keep up to date with our newest products
and all the latest
in technology and design.

Other blog posts

Helping to Change the Face of Fintech

Giant Machines leads FinTech Focus – a program designed for rising first year college students who have an interest in finance, computer science, and technology.

Enablement, Upskilling, and the Meaning of Learning

The foundation of Giant Machines is in learning, education, and growth, whether that's for software engineers or developing other helpful skills. Here's how it helps us—and our clients.

How We Celebrated Our Company Culture with Giant Machines Week

Giant Machines week celebrates our company's community and culture with presentations, workshops, and a yacht ride around the Hudson River.


Learn more about us here at Giant Machines and how you can work with us.

What we do

We leverage best-in-class talent to create leading edge digital solutions.


Know your next move


Develop beautiful products


Enrich your tech knowledge

Our work

Learn more about our partnerships and collaborations.

Our perspective

Stay up to date with the latest in technology and design.