Neural Notes Episode #5: Solving API Code Generation with Gorilla

Chase Roberts
5 min readSep 28, 2023

GitHub Copilot has captured the imagination of developers everywhere, and justifiably so. With a 35% code acceptance rate, the productivity gains are meaningful. At the same time, AI pair programmers like Copilot can’t take you entirely from zero to one. Software is both written and assembled, as most applications depend on decoupled services that talk to each other via application programming interfaces (APIs). API calls are notoriously tricky for experienced developers, let alone a tool based on a large language model (LLM). This disconnect means there’s still a gap between the potential of code generator tools to democratize coding and the reality that they’re not quite as good as humans.

AI-Generated using Midjourney

Enter Gorilla, an LLM that addresses this gap by serving correct API calls without additional coding. We interviewed Shishir Patil, one of Gorilla’s creators, on Neural Notes — our podcast diving deep with some of the most influential researchers and practitioners in AI and infrastructure — about the seminal research work, which you can watch here:

The problem 😬

LLMs are well-suited for code generation because the task is forgiving. For example, there are multiple ways to add two numbers together using JavaScript:

function add(num1, num2) {
return num1 + num2;

console.log(add(5, 3)); // Outputs: 8
function add(num1, num2) {
return num1 + num2;

console.log(add(5, 3)); // Outputs: 8

On the other hand, APIs require precision. A minor syntax error or typo will cause the API call to fail, which is why code generation tools haven’t excelled at this sort of task. Hallucination compounds this problem since LLMs will confidently generate an API call with no regard for the correctness of the code or even an API’s existence. Consider what this deficiency implies about software development: a developer can leverage code generation tools to accelerate their work but still must study API documentation and carefully craft an API call. Code generation tools can get amateur and experienced developers close to shippable code, but API calls currently act as a gatekeeper that hampers the potential productivity gains LLMs can bring. Given the ubiquity of APIs in software development, developers need a solution.

In a perfect world, I could declare a task like updating the IAM role for this service in this Kubernetes cluster, and an AI pair programmer would generate the code for me. So, what are my options today? Fine-tuning a model on an internal code base might be feasible, but retraining this model with each subsequent API change would be cost-prohibitive — especially for a web-scale collection of millions of changing APIs.

Another approach is retrieval augmented generation (RAG), where a retriever model, combined with a few labeled sample responses, retrieves the relevant API documentation for a foundation model like GPT-4 to reason about. Not only is the accuracy of this approach not high enough for APIs (~60%), but this approach doesn’t align with the developer workflow. Imagine having to provide well-crafted examples of the target API call — would you still need a co-pilot? We should also note that LLMs tend to forget the middle sections of long texts, introducing another hurdle.

An AI pair programmer for APIs 🦍

APIs are the dominant mode of communication with applications on the web. User interfaces (UIs) represent an abstraction for people to interact with these APIs, but despite attempts, they aren’t the optimal abstraction for autonomous machines. If you — a human, and yes, we’re making this distinction now — wanted to book a vacation, you would interact directly with the UIs of various travel websites (flight, car rental, hotel, etc.). This same task wouldn’t be so easy for an autonomous [travel] agent working on your behalf. An agent is a machine, and machines communicate via APIs.

Gorilla makes this communication possible, bringing us closer to a world of autonomous agents working on our behalf. This quote from the paper summarizes Gorilla’s significance:

This transition from a small set of hand-coded tools, to the ability to invoke a vast space of changing cloud APIs could transform LLMs into the primary interface to computing infrastructure and the web.¹

Consider the impact of a tool like Gorilla in an enterprise software context. API documentation is meant to be distributed to ease the adoption of the corresponding APIs. But learning new documentation takes time and effort — whether you’re a new engineer onboarding internal tools or working with an unfamiliar external API. Keep in mind: the objective of companies like Stripe, Twilio, Google Cloud, Microsoft Azure, and AWS isn’t for developers to study API docs — it’s to use the APIs. Gorilla promises a future where developers don’t need to learn the API documentation but instead ask Gorilla to generate an API call for them.

How Gorilla works 🍌

I’ll highlight two novel capabilities of Gorilla mentioned in the paper that explain how this is possible:

When combined with a document retriever, Gorilla demonstrates a strong capability to adapt to test-time document changes, enabling flexible API updates and version changes.¹

AWS updates its API multiple times per day. Considering the velocity of changes to this API and others, it’s not feasible to retrain an LLM to learn these changes as they’re happening. Gorilla automatically adapts API changes using a technique called Retriever Aware Training (RAT). In retrieval mode, Gorilla recognizes when a retriever augments a prompt and treats the retrieved data differently from the user prompt. Upon fetching up-to-date API documentation, the model reasons whether a document is relevant given the prompt and references the corresponding documentation changes. This technique solves inevitable API changes.

Gorilla also substantially mitigates the issue of hallucination, commonly encountered when prompting LLMs directly.¹

To avoid hallucinations, the researchers adopted a common AST sub-tree matching technique to evaluate the functional correctness of an API and which API in the dataset the LLM is calling. First, they parse the generated code into an AST tree and then find a sub-tree whose root node is the API call we care about and use it to index the dataset. If the API dataset has a sub-tree match in the input dataset, we can be confident the generated code isn’t a hallucination.


Gorilla supports 5k+ APIs today, including Kubernetes, the hyperscalers, Linux commands, and GitHub. For those of you inspired by Gorilla’s potential, here’s how to teach Gorilla your API. Today, there are two methods to use Gorilla: a chat completion API and a command line interface.

For more posts like these, subscribe here and check out our other interview on our YouTube page.

[1] Gorilla: Large Language Model Connected with Massive APIs