We explore a new paradigm for programming computers directly in English without any formal coding language at all. We do this by prompting a Large Language Model (LLM) to act as the program. Please note, this approach does not including training or fine tuning of model weights.
Several LLMs, especially GPT-4 and GITHUB copilot, have proven to be pretty good at writing formal code in various programming languages, and many non-programmers have published videos and articles on the web showing how they were able to implement an application program by asking ChatGPT to write the code and give them instructions. We are talking about something different. We are not asking the LLM to implement the application by writing code in a formal programming language; instead, we prompt the LLM to become the application. The “program” is written entirely in English. You could draw an analogy between a compiler and an interpreter, in which compilation converts an English prompt to a formal programming language whereas programming directly in English “runs” the prompt directly.
We have conducted experiments using OpenAI’s Chat GPT-4 and GPT-3.5, Google’s Bard and Anthropic’s Claude to explore the extent to which this paradigm is feasible today. In this posting, we’ll demonstrate the idea with one application example. In a sequel post or part 2, which we will link at the bottom of this article next week, we’ll explore other findings, present results from systematic experiments, and highlight lessons from our experimentation.
Calorie and nutrition tracker
We illustrate the paradigm in this post by implementing a fully functional Calorie
and Nutrition Tracker application. The example not only provides a proof of concept that the approach is already viable (at least in some cases), but also that it can result in a gain in capability that would be difficult to match when you implement the same application in a formal programming language.
A sample run
We start by showing a trace of a run of the completed program. To keep this article short and focused on the essential elements, we omit parts of the output but provide a link to the full transcript of this run. To differentiate between the program’s output and the user’s input, the user’s input is in purple.
What did you eat today?
Four packs of instant oatmeal, plus a cup of black tea with coffer creamer.
(omitted here: detailed breakdown of the estimates and assumptions)
Here’s the itemized table for your intake so far:
Food Item | Calories | Fat (g) | Saturated Fat (g) | Unsaturated Fat (g) | Sodium (mg) | Calcium (mg) | Protein (g) |
4x Instant Oatmeal | 600 | 8 | 1.2 | 2 | 840 | 400 | 16 |
1x Black Tea | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1x Coffee Creamer | 20 | 1 | 0.5 | 0 | 5 | 10 | 0 |
Total | 620 | 9 | 1.7 | 2 | 845 | 410 | 16 |
What else did you eat today?
Madras Lentils (a package from Tasty Bite) atop 7.4 oz. of cooked white sticky rice.
(omitted here: detailed assessment and updated table)
What else did you eat today?
A very large salad (filling a 12 inch diameter bowl), consisting of mixed lettuce varieties, carrots, celery, broccoli sprouts, black beans, and avocado. I topped it with a generous serving of Thousand Island dressing, pepper and smoked paprika.
(omitted here: detailed assessment and updated table)
What else did you eat today?
One CLIF Builders protein bar.
(omitted here: detailed assessment of bar’s nutritional content)
Here’s your updated intake for the day:
Food Item | Calories | Fat (g) | Saturated Fat (g) | Unsaturated Fat (g) | Sodium (mg) | Calcium (mg) | Protein (g) |
4x Instant Oatmeal | 600 | 8 | 1.2 | 2 | 840 | 400 | 16 |
1x Black Tea | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1x Coffee Creamer | 20 | 1 | 0.5 | 0 | 5 | 10 | 0 |
1x Tasty Bite Lentils | 300 | 13 | 1.5 | 11.5 | 720 | 80 | 11 |
7.4 oz White Rice | 200 | 0.4 | 0.1 | 0.3 | 2 | 3 | 4 |
Large Mixed Salad | 600 | 36.7 | 5.7 | 22.5 | 493 | 185 | 17.6 |
1x CLIF Builders Bar | 270 | 10 | 6 | 4 | 370 | 260 | 20 |
Total | 1990 | 69.1 | 14.5 | 40.3 | 2430 | 938 | 68.6 |
What else did you eat today?
How do my daily totals compare to the recommended daily allowances for me, a 57 year old, 180 lb. muscular male?
Recommended daily allowances can vary based on activity level and personal health goals, but general guidelines from the U.S. Food and Drug Administration for a 2,000-calorie diet suggest the following:
- Calories: 2,000
- Total fat: Less than 78g
- Saturated fat: Less than 20g
- Sodium: Less than 2,300mg
- Calcium: At least 1,000mg (this can increase to 1,200mg for men over 70)
- Protein: 56g (this can be higher for individuals with higher muscle mass or more active lifestyles)
Here’s how your current intake compares:
Nutrient | Intake | Recommended |
Calories | 1,990 | 2,000 |
Total Fat (g) | 69.1 | <78 |
Saturated Fat (g) | 14.5 | <20 |
Sodium (mg) | 2,430 | <2,300 |
Calcium (mg) | 938 | >1,000 |
Protein (g) | 68.6 | 56+ |
Your calorie and fat intakes are within the recommended ranges. However, your sodium intake is slightly higher than the recommended limit, which could be an area for improvement. Calcium intake could be slightly higher to meet the recommended level. Your protein intake is fine, but considering your muscular build, you may benefit from even higher protein intake, depending on your activity level.
Commentary
The Calorie and Nutrition tracker has several really nice features. You can specify information about what you eat at any level of detail. You can specify an exact serving size (e.g., “7.4 oz. of cooked white sticky rice”), but you don’t have to. It’ll automatically fill in estimates for things you don’t specify, such as (typical) serving size. After seeing its assumptions, you have the option of adding more detail. At any point you can ask it to add another nutrient of interest to the table. And you can even ask questions beyond the scope of tracking the calories, as seen in the final question from the run (“how do my totals compare…?”). It is also able to do this without having to populate a database with specific foods and their nutritional content. There is a flexibility to it that is usual for a typical conventionally-programmed version.
The implementation
The Calorie and Nutrition Tracker app is implemented entirely in English. In fact, here is the program:
You are a calorie and nutrition tracker app. You will ask “What did you eat today?”. The user will enter a description in whatever form they find convenient. If they omit relevant information such as serving size, you will do your best to estimate a typical serving size as appropriate while minimizing additional questions. (An occasional clarification question is okay, but you should try very hard to minimize these). You will do your best to estimate the Calories and nutritional content for each entry. After each input, you will print a table itemizing the items consumed so far today, with their Calories, fats (saturated, unsaturated), sodium, calcium and protein, with totals at the bottom. Then you will repeat, asking “What else did you eat today?”.
We also find it interesting that this implementation worked on our first attempt. Although the programming in English approach may still require “debugging” iterations, no such iterations were required in this case.
Comparing the big foundation models
The English-language program shown in the previous section runs successfully in OpenAI’s Chat GPT-4, Chat GPT-3.5 and Anthropic’s Claude. It failed to run correctly on Google’s Bard.
These comparisons were conducted on 15-June-2023, using the same sequence of inputs. Each of these models is being updated frequently, so this finding for this example could change at a future date.
Of the three success cases, GPT-4 provided the best user experience. Its detailed assessments were more concise and to the point than the other two, and it committed to a single number, whereas GPT-3.5 expressed all assessments as ranges (e.g., Total calories: 2645-3130). (See full GPT-3.5 transcript). Claude’s Chat UI lacks the ability (at the time of testing) to present the info so far as a table, but it did a good job presenting the information that belongs in the table.
Bard failed to run this program correctly. It exhibited a failure mode in which it simulates the user’s input, rather than waiting for the user to type something. It ends up inventing a sequence of meals that somebody might eat, and then it provides the calorie and nutritional assessments for its imaginary meals. Its assessments and tables would have been reasonable if a user had actually entered those meals. In our experiments, we have found this to be the most common failure mode among the smaller LLMs (i.e., GPT-3.5, Bard and Claude) when programming in English across other application areas. We explore the failure rates and failure modes systematically in our next post.
Related work
The programming in English paradigm is not a particularly novel concept – it overlaps many ideas involving natural language interaction with computers. We have simply introduced the term “programming in English” to emphasize the problem of controlling a computer, operating on inputs provided later, and to emphasize the relationship to conventional computer programming.
Low-code/no-code programming paradigms strive to empower non-programmers to create functioning applications. The Analytica visual modeling environment falls into this camp, in which analysts in fields other than computer science are able to “program” decision models and user interfaces to those models within a visual interface by drawing influence diagrams, filling in basic relationships, making input and output nodes with a button click from influence diagram variables, and dragging them in a visual interface to layout of the UI.
A core motivation of natural language understanding in AI has always been to follow instructions given in English. Although running a program written in English is basically a subset of the more general task of following instructions in English, there is also a subtle difference in connotation – a program assimilates inputs while running that aren’t included in the instructions themselves. It is that ability to assimilate inputs while running to compute outputs that distinguishes the “programming in English” paradigm from instruction following.
Conversational AI and Question answering systems also seem to be pretty similar in spirit. Again, the distinction is the same subtle difference in connotation as for instruction following systems.
In-context learning instructs LLMs how to perform a task by specifying a context and providing task examples. There is usually an emphasis on providing examples, e.g., few-shot learning, but the interpretation of “in-context learning” seems to vary somewhat between authors and hence it lacks a precise definition. The programming in English paradigm seems to largely overlap with various concepts of in-context learning, but the emphasis seems to be on different aspects.
Summary
Large LLM foundation models like GPT-4 open the door for a new computer programming paradigm in which the “program” is specified in English (or Spanish, Japanese, Tagalog, or any other natural language), without any use of a formal programming language. The English language “program” is executed directly by the LLM.
Writing programs in English could have several advantages:
- Simpler and faster to write programs.
- Less formal programming expertise required.
- Some capabilities are possible that would be infeasible (or less feasible) when implemented in a formal programming language.
- The implementation is likely to be shorter and easier to interpret.
There are also disadvantages:
- Variation (randomness) in the output may complicate QA & ability to rely on results.
- Substantially more compute (and hence energy consumption, carbon emissions) is required to run an English-language program compared to a conventional program.
- Limits on which tasks can be successfully implemented in this way (at least today).
- No systematic debugging techniques (yet).
Here we’ve illustrated one successful example that we were able to program in English, the Calorie and Nutrition Tracker app. This serves merely as a proof-of-concept that it is possible in at least one case. In the sequel post (part 2) we look at success rates running English-language programs across a collection of applications and across the four major foundation models.