What we learned while building AI-powered tools in 2023.
THE CURRENT STATE OF AI
If you lead any sort of digital team, your news algorithm has spent the last few months serving you an unlimited buffet of stories about what tools powered by LLMs (Large Language Models) like ChatGPT might mean for the future of your business. Whether it’s expressed with exuberance or with terror, the thesis is almost always the same: “ChatGPT is faster and cheaper than your employees.”
As I read these articles, I can’t shake an ominous parallel to Jurassic Park. The team at OpenAI has brought to life something as massive and powerful as a brachiosaurus, and I’m listening intently to make out which John Williams song is playing in the background. One minute it feels like this moment in history deserves the sweeping title theme, and the next it feels like the eerie “Hatching a Baby Raptor” might be more fitting. It’s a landmark moment either way, and one thing is viscerally clear—this technological shift is going to change everything.
If that’s true, what does it mean for your organization? Should GPT-3 or GPT-4 materially change your headcount projections? Could AI help your team work more effectively and efficiently? Most importantly, is the AI good enough? Should you trust it?
That’s what Giant Machines aimed to figure out when we launched our own AI/ML project in early 2023. We wanted to know how much of our work could be performed by Generative AI in its current state, and whether immersing ourselves in this space could set us up to deliver more value to our clients in both what we make and how we make it.
THE VISION AND THE OPTIONS
With a few test calls to the GPT-3 and GPT-4 APIs under our belt, and with excitement about what might be possible, we assembled a small engineering team to understand the landscape and start prototyping. If you manage a team that relies on an ever-growing body of knowledge, our problem statement probably resonates.
The problem: When someone joins an in-progress project, orienting them is a mammoth undertaking. The more time engineering teams spend explaining the anatomy of the product, the less time they have to make new contributions.
The vision: Create a Slackbot to answer onboarding questions that might normally require either arduous GitHub spelunking or face time with a project’s technical lead. We could work so much more efficiently if questions like “who wrote most of the forms in this app?” could be answered quickly and independently.
With a Slackbot prototype as our goal, we were immediately faced with the classic question—build or buy? Four clear options emerged:
- Build our own LLM to compete with OpenAI. This is going to take quite some time and cost a lot of money. This probably isn’t a wise or realistic option for newcomers to this space.
- Buy the service from an org that already built it. If the task is a really common one like a customer service chatbot, it might be better to pay for the service than to reinvent the wheel.
- Pay for the GPT-3 or GPT-4 API and add in our own layer(s). Out of the box, ChatGPT isn’t going to know which tech stacks we tried before settling on the current one. With just a little extra context, it could learn.
- Privately host and refine an open-source model. The open-source models are a little less robust, and refining them would require more Machine Learning (ML) expertise than options 2 and 3.
We’re a pretty nimble team that loves to learn, and that can’t throw millions at an experimental initiative, so options 1 & 2 were out. We bookmarked option 4 for later and set out to explore option 3.
We focused on providing context with LlamaIndex, and created a sample project called Planterbox—so named for our excitement about how fertile this soil is, but also for the boundaries we knew we’d need in order to maintain a responsible and focused approach.
HANDS-ON FIRST IMPRESSIONS
We started by connecting the PlanterBot to a Readme file on GitHub, and the first few tests with that context really blew us away. We found that the GPT-3 endpoint was able to answer our questions about that document with stunning accuracy. More importantly, when there wasn’t enough information to give a verbose answer, it told us so instead of hallucinating. With an AI whose prime directive is to produce language, less is more. We’d rather get a short, correct response than a longer, more dubious one.
Five years ago, I could easily have spent a month setting up Zapier integrations to mirror vital information across different tools. Now it seems like a frictionless, almost instantaneous option is emerging.
But before you try to switch out your project manager for an AI, you should know that the longer we investigated, the more we saw the fog of new technology coalesce into two specific obstacles. Any responsible and effective implementation of AI would need two things: a very clear plan for securing data, and a clear understanding of the stakes when (not if) the AI gets it wrong.
PITFALL 1: PRIVACY NOT GUARANTEED
For our purposes, we decided to guard against oversharing by creating mock data for Planterbox – we even used ChatGPT to generate some of it. Security for our own data is important, and for our clients’ data, it’s critical. Everything we sent in an API call was either already publicly available online, or generated for this project specifically.
Longer term, if we were ever going to create a tool secure enough to manage client data, we had to move away from the OpenAI services altogether. It made the most sense to take an open-source model, provide it with project-specific context, and host it ourselves. The mechanics of that are beyond the scope of this post, but the upshot of that decision was that we had to work with a less advanced AI than the one that powers ChatGPT. In the name of protecting client privacy, we find that to be a necessary safeguard.
PITFALL 2: OVERCONFIDENCE
As we refined and tested our privacy strategy, we had another obstacle to tackle in parallel.
After our early successes with Planterbox, progress had begun to slow. The more sources we added to the prompt context, the more likely the AI was to give incorrect answers, even when we knew the most recent and correct information was in the Notion. We were able to cut down on the hallucinations by switching to a more powerful OpenAI endpoint, but there’s an inherent tradeoff, as the more powerful AIs come with noticeable latency and higher costs. Coupled with our stance on privacy, we had to change course, and that started with diagnosing the problem.
The real issue here is that we were changing the level of task we had given to the LLM. We were no longer asking it to summarize a single source; we needed it to synthesize information from multiple sources. We had switched from a task that the LLM could perform excellently to one it could only perform with intermittent fidelity. There are some promising gains with GPT-4, but when privacy is a first-order concern, using any OpenAI product may be a nonstarter.
It’s easy to see how we got here. ChatGPT feels like a human interface for the corpus of knowledge it was trained on. As the models increase in size, they gain more emergent capabilities—as “life finds a way,” so too does AI. We ask ChatGPT a question it wasn’t designed to answer, and it responds so coherently that we believe it by default. But introducing human-style reasoning with a neural network has effectively introduced human error into an arena where we aren’t used to seeing it. That has led to an ever-growing compendium of evidence that the real trouble arises when we confuse what a GPT product can do and what it can almost do.
HOW TO WIN: DECOMPOSITION & GUARDRAILS
That realization ultimately led us to some organizational guidelines that we’d like to share. If you’re ready to leverage AI to make your team more efficient, this is how you walk the middle path and win:
- Understand what your LLM can and cannot do. Leverage your LLM for what it does best. In our case, that meant summarizing and interrogating a single document, and synthesizing broad takeaways across multiple documents. We’re de-emphasizing search and more pointed factual questions until we see stronger results.
- Break up the task for easier debugging. At first brush, LLMs might seem to eliminate the need to decompose problems into smaller, more manageable parts. But we’re finding that without decomposition, an LLM can feel like a black box, and debugging a black box is really challenging. Tackle a small problem, and experiment with different prompts and context-loading methods to learn what works best.
- Use your LLM selectively for the right use cases. An LLM falls under the category of Generative AI, so if you need text generated, it might very well be the best tool. Decomposition is your friend here again. It will help you avoid using the LLM in situations where more straightforward, cost-effective, and task-appropriate tools already exist.
- Prepare for the inevitable failures. At their core, Large Language Models are designed to do one thing excellently: produce a situationally-appropriate string of words. For any more specific tasks, you’ll encounter some errors and unexpected behaviors. You and your team must think through the consequences of getting it wrong, the safeguards (human or machine) that you’ll put in place to prevent that, and the level of transparency you’ll share with your users about those policies.
For the foreseeable future, you and your team are going to be asked to build LLM-powered everything. After these internal builds, we’re pretty certain that’s the wrong way to go.We’re ready and excited to partner with client engineering teams to build some things.
GIANT MACHINES: READY TO BUILD
At Giant Machines, we’re seeing some potential clients approach us to build with an LLM because it feels revolutionary. ChatGPT seems to be able to do almost anything, so leaders are justifiably eager to deploy it somewhere. We love that eagerness in our potential partners, but we also know that just building a product almost always leaves a client unhappy. We pride ourselves on building scalable solutions, which means that it’s more important that our client has clarity around their problem—not their product.
When we can work with a potential client with that problem-first approach in place, we’re energized and ready to jump in. With generative AI, there’s a whole new set of solutions coming into focus. For the right problem, an LLM-powered tool can be a game changer. We’re building those tools quickly and securely.
FINAL THOUGHTS: FERTILE SOIL; BOUNDARIES REQUIRED
When newsfeeds are overrun with discussion of a single piece of technology that you haven’t mastered, it’s easy to get swept up in the frenzy. Understandably so—there’s a huge opportunity here. But we’ve already seen that recklessly racing to be among the earliest adopters is risky, both ethically and financially. In fact, that’s the villain’s fatal flaw in every single one of the Jurassic Park movies. As Dr. Ian Malcolm put it, “Your scientists were so preoccupied with whether or not they could, they didn’t stop to think if they should.”
Even if you aren’t swept up in the excitement, it’s tempting to operate from a place of fear. The Fear Of Missing Out can drive you to implement new ideas without understanding their potential pitfalls, and on the opposite extreme, fear of the unknown can cause you to fall completely behind.
The studio team that’s working on this is learning more and more each day, and we’re seeing firsthand that clients are falling behind. We’re ready to work with clients who want to move quickly without moving recklessly.
If you want to leverage new advances in AI to unlock new efficiency and productivity within your engineering team, you’ll need to find the happy medium between these two fear-based extrema. Responsible AI requires that we’re honest with ourselves, our teams, and our users about what we’re doing, how sure we are about it, and what’s still on the horizon. Pragmatic excitement is how you win.