Junior Developer GPT 1 – What happens if a team of LLMs program an application?(4 min read)

With this article, I am starting a small blog series called Junior-Dev-GPT where I explore ideas to turn LLMs into autonomous junior developers.

Articles: Part 1 | Part 2 | Part 3

A Python calculator with a GUI that was programmed fully autonomously

This Python calculator was programmed by a team of LLMs. It is able to do basic calculations and also implemented a nice store and recall function which allows the user to keep a history of the calculations done. Nevertheless, the layout of the interface and the interaction design is a little dubious, which is a problem that I dive deeper into in part 2 of this series.

Introduction

Recently, I had many interesting conversations about how LLM-based assistants like GitHub Copilot and OpenAI’s ChatGPT impact developers. The bottom line of these conversations was that the assistants are impressive, but they are just tools and heavily rely on skilled developers.

Yet, I could not stop to wonder what would happen if we removed human developers from the equation and let LLMs program autonomously. I, therefore, combined approaches from Park et al. 2023, who used LLMs to simulate characters that interact with each other in a virtual world, and Mozannar et al. 2022, who established a taxonomy of common programmer activities.

Similar work

The concept of using large language models to build agents is very fascinating. During my research, I came across various proof of concepts that showcase the capabilities of LLMs to be agents.

ChemCrow is a domain-specific example where an LLM is enhanced with 13 expert-designed tools for tasks in organic synthesis, drug discovery, and materials design (Bran et al. 2023). Similarly, Boiko et al. used LLM agents to carry out complex experiments by allowing the LLMs to interact with physical lab equipment to sample drugs. As already mentioned, Park et al. 2023 used LLMs to control 25 virtual characters. The characters “lived” in a sandbox environment reminiscent of The Sims. These generative agents created realistic simulations of human behavior and even cooperated to plan a party. You can view a replay of the simulation here!

Emerging tools for experimenting with LLM agents include BabyAGI, AutoGPT, and GPT-Engineer. Among them, GPT-Engineer is closest to my own experiment. GPT-Engineer is a command line based tool that generates applications autonomously based on prompt files that you place in the skeleton of your application. What I find most fascinating about this tool is that it has a conversation with the user at the beginning, where it asks questions about the application and uses its reasoning skills to check if all information that is needed to program the app is present. Two downsides of the project that I found are that GPT-Engineer is a one-shot generation, so there is little opportunity for me to steer the development process. Furthermore, the produced code is not tested, so it sometimes does not start at all.

Introducing: JuniorDev-GPT

To contribute to the existing work and understand the topic of LLM agents better, I created a small PoC that I called JuniorDev-GPT. The idea of this tool is that it works like an autonomous junior developer by splitting the activities of a developer found by Mozannar et al. 2022 over multiple LLMs which interact similarly to the agents of Park et al. 2023.

A user can provide a starting prompt for an application and the team of LLMs included in JuniorDev-GPT will try to autonomously program, test and debug a fitting program in multiple iterations.

Key Learnings

Using LLM agents with ReAct reasoning patterns was difficult due to the unpredictable tool selection, tool inputs, and observation interpretation by the agents. Therefore, I switched to a fixed communication structure based on conversational interactions with the LLMs gpt-4, gpt-3.5-turbo-16k, and text-davinci-003.

The scope of a single LLM conversation is limited, but imitating human communication structures works well for achieving bigger goals. I prompted the LLMs to work in five different roles: product manager, programmer, quality inspector, application tester, and application debugger.

Iterations with increased complexity worked better than complex requests upfront. For example, the programming LLM has access to the previous version of the code and additionally gets an overview of open and completed features from the quality inspector. I found that the LLM responsible for programming is much more likely to produce working and complete code if it only has to adapt or add one small feature instead of many at a time.

Limitations

Reasoning barrier: The development works smoothly up to 3000 tokens or roughly 250 lines of code. Beyond that, the LLMs often make at least one mistake per answer, leading to an endless debugging loop.

Context barrier: GPT-3.5’s 16k token limit hampers communication; for example, if the debugger requests the full code and complete error report to fix a bug, the 16k token limit can be hit.

Improvement Ideas

To achieve a better overall performance, each of the LLMs could be developed further. For example, the LLM that is responsible for programming could split abstract programming tasks into small sub-conversations in order to generate high-level structures first and then add details later. The LLM that acts as a product owner LLM could also get access to visual data of the UI to judge the quality of the application better and steer future development in a useful direction (I did something similar to this by introducing a human into the loop here ).

Furthermore, the communication between the LLMs could be filtered to avoid excessive token usage. The LMM that acts as a debugger for instance could only receive the relevant code parts where errors occurred.

One could also introduce a new form of code representation so the agents do not use the actual code as context but as an abstraction. I read about using vectors for example to summarize long PDFs.

Finally, working with long-time storage would be interesting to extend the memory of the LLMs beyond the quite limited context window. Long-term memory could store all the code, but the LLM only loads what is relevant to the current task into the context window. I imagine this to be similar to how a computer shifts memory between RAM and hard drives.