Does GPT-4 pass the Turing test?

The imitation game

The imitation game

In 1950, Alan Turing proposed “The Imitation Game”, today known as the Turing test, as a hypothetical way of measuring whether a computer can think [1]. It stakes out the basic philosophical idea that “intelligence” should be judged as a behavioral attribute – if what we can observe about the behavior of a system is consistent with it being intelligent – then we should conclude that it is intelligent. This is in contrast with the alternative philosophical position that a system requires the right type of internal mechanisms to be considered intelligent, as argued by Searle [2]. To pin down this idea precisely, “The Imitation Game” outlines a specific experimental setup in which a human interrogator converses over a terminal with two witnesses – one human and one artificial – and has to identify which is which. If the interrogator is unable to do better than random guessing, then the computer passes the Turing test.

Today many people believe that GPT-4 would pass or already has passed the Turing test [3][4]. This week two researchers from UCSD, Cameron Jones and Benjamin Bergen, posted a paper [5] to arXiv with the results from their experiments running an online Turing test with 652 human participants, in a manner that is highly consistent with Turing’s “Imitation Game”. The interrogator has 5 minutes of conversation before making a determination [6]. They ran seven different GPT-4 “personas” (each with different prompting instructions) as well as two GPT-3.5 personas and Joseph Weizenbaum’s ELIZA [7]. 

The highest performing model (a GPT-4 persona) was able to fool human interrogators into thinking it is human 41% of the time. Human witnesses were able to convince the interrogator that they are human 63% of the time. Hence, the researchers conclude that GPT-4 does not (yet) pass the Turing test. Another surprising result was that the ELIZA, devised in 1966, came in 5th out of the 11 artificial personas, successfully deceiving the interrogator 27% of the time.

Today most people don’t consider the Turing test to be a literal criteria for intelligence. For example, failing a Turing test because the witness is judged to be too intelligent to be a mere human isn’t rationale for concluding that the witness is not intelligent. But with the Turing test now part of our shared vernacular, it is interesting to see the results from an earnest attempt to carry it out.

References

[1] Alan M. Turing (1950) “Computing Machinery and Intelligence”, Mind, 59(236):433-460.

[2] John R Searle (1980), “Minds, brains, and programs”, The behavioral and brain sciences.

[3] Celeste Bievere (2023), “ChatGPT broke the Turing test – the race is on for new ways to assess AI,” Nature.

[4] Alyssa James (2023), “ChatGPT has passed the Turing test and if you’re freaked out, you’re not alone,” TechRadar. 

[5] Cameron Jones and Benjamin Bergen, “Does GPT-4 pass the Turing test?” arXiv 2310.202116v1.

[6] Try it yourself at https://turingtest.live.

[7] Joseph Weizenbaum (1966). “ELIZA—A Computer Program For the Study of Natural Language Communication Between Man And Machine.” Communications of the ACM, 9(1), 36–45

Computer programming in English (Part 1)

We explore a new paradigm for programming computers directly in English without any formal coding language at all. We do this by prompting a Large Language Model (LLM) to act as the program. Please note, this approach does not including training or fine tuning of model weights.

Several LLMs, especially GPT-4 and GITHUB copilot, have proven to be pretty good at writing formal code in various programming languages, and many non-programmers have published videos and articles on the web showing how they were able to implement an application program by asking ChatGPT to write the code and give them instructions. We are talking about something different. We are not asking the LLM to implement the application by writing code in a formal programming language; instead, we prompt the LLM to become the application. The “program” is written entirely in English. You could draw an analogy between a compiler and an interpreter, in which compilation converts an English prompt to a formal programming language whereas programming directly in English “runs” the prompt directly.

We have conducted experiments using OpenAI’s Chat GPT-4 and GPT-3.5, Google’s Bard and Anthropic’s Claude to explore the extent to which this paradigm is feasible today. In this posting, we’ll demonstrate the idea with one application example. In a sequel post or part 2, which we will link at the bottom of this article next week, we’ll explore other findings, present results from systematic experiments, and highlight lessons from our experimentation.

Calorie and nutrition tracker

Diagram depicting nutritionl information.
Image credit: MidJourney

We illustrate the paradigm in this post by implementing a fully functional Calorie

and Nutrition Tracker application. The example not only provides a proof of concept that the approach is already viable (at least in some cases), but also that it can result in a gain in capability that would be difficult to match when you implement the same application in a formal programming language.

A sample run

We start by showing a trace of a run of the completed program. To keep this article short and focused on the essential elements, we omit parts of the output but provide a link to the full transcript of this run.  To differentiate between the program’s output and the user’s input, the user’s input is in purple.

What did you eat today?

Four packs of instant oatmeal, plus a cup of black tea with coffer creamer.

(omitted here: detailed breakdown of the estimates and assumptions)

Here’s the itemized table for your intake so far:

Food ItemCaloriesFat (g)Saturated Fat (g)Unsaturated Fat (g)Sodium (mg)Calcium (mg)Protein (g)
4x Instant Oatmeal60081.2284040016
1x Black Tea0000000
1x Coffee Creamer2010.505100
Total62091.7284541016

What else did you eat today?

Madras Lentils (a package from Tasty Bite) atop 7.4 oz. of cooked white sticky rice.

(omitted here: detailed assessment and updated table)

What else did you eat today?

A very large salad (filling a 12 inch diameter bowl), consisting of mixed lettuce varieties, carrots, celery, broccoli sprouts, black beans, and avocado. I topped it with a generous serving of Thousand Island dressing, pepper and smoked paprika.

(omitted here: detailed assessment and updated table)

What else did you eat today?

One CLIF Builders protein bar. 

(omitted here: detailed assessment of bar’s nutritional content)

Here’s your updated intake for the day:

Food ItemCaloriesFat (g)Saturated Fat (g)Unsaturated Fat (g)Sodium (mg)Calcium (mg)Protein (g)
4x Instant Oatmeal60081.2284040016
1x Black Tea0000000
1x Coffee Creamer2010.505100
1x Tasty Bite Lentils300131.511.57208011
7.4 oz White Rice2000.40.10.3234
Large Mixed Salad60036.75.722.549318517.6
1x CLIF Builders Bar270106437026020
Total199069.114.540.3243093868.6

What else did you eat today?

How do my daily totals compare to the recommended daily allowances for me, a 57 year old, 180 lb. muscular male?

Recommended daily allowances can vary based on activity level and personal health goals, but general guidelines from the U.S. Food and Drug Administration for a 2,000-calorie diet suggest the following:

  • Calories: 2,000
  • Total fat: Less than 78g
  • Saturated fat: Less than 20g
  • Sodium: Less than 2,300mg
  • Calcium: At least 1,000mg (this can increase to 1,200mg for men over 70)
  • Protein: 56g (this can be higher for individuals with higher muscle mass or more active lifestyles)

Here’s how your current intake compares:

NutrientIntakeRecommended
Calories1,9902,000
Total Fat (g)69.1<78
Saturated Fat (g)14.5<20
Sodium (mg)2,430<2,300
Calcium (mg)938>1,000
Protein (g)68.656+

Your calorie and fat intakes are within the recommended ranges. However, your sodium intake is slightly higher than the recommended limit, which could be an area for improvement. Calcium intake could be slightly higher to meet the recommended level. Your protein intake is fine, but considering your muscular build, you may benefit from even higher protein intake, depending on your activity level.

Commentary

The Calorie and Nutrition tracker has several really nice features. You can specify information about what you eat at any level of detail. You can specify an exact serving size (e.g., “7.4 oz. of cooked white sticky rice”), but you don’t have to. It’ll automatically fill in estimates for things you don’t specify, such as (typical) serving size. After seeing its assumptions, you have the option of adding more detail. At any point you can ask it to add another nutrient of interest to the table. And you can even ask questions beyond the scope of tracking the calories, as seen in the final question from the run (“how do my totals compare…?”). It is also able to do this without having to populate a database with specific foods and their nutritional content. There is a flexibility to it that is usual for a typical conventionally-programmed version.

The implementation

The Calorie and Nutrition Tracker app is implemented entirely in English. In fact, here is the program:

You are a calorie and nutrition tracker app. You will ask “What did you eat today?”. The user will enter a description in whatever form they find convenient. If they omit relevant information such as serving size, you will do your best to estimate a typical serving size as appropriate while minimizing additional questions. (An occasional clarification question is okay, but you should try very hard to minimize these). You will do your best to estimate the Calories and nutritional content for each entry. After each input, you will print a table itemizing the items consumed so far today, with their Calories, fats (saturated, unsaturated), sodium, calcium and protein, with totals at the bottom. Then you will repeat, asking “What else did you eat today?”.

We also find it interesting that this implementation worked on our first attempt. Although the programming in English approach may still require “debugging” iterations, no such iterations were required in this case. 

Comparing the big foundation models

The English-language program shown in the previous section runs successfully in OpenAI’s Chat GPT-4, Chat GPT-3.5 and Anthropic’s Claude. It failed to run correctly on Google’s Bard.

These comparisons were conducted on 15-June-2023, using the same sequence of inputs. Each of these models is being updated frequently, so this finding for this example could change at a future date.

Of the three success cases, GPT-4 provided the best user experience. Its detailed assessments were more concise and to the point than the other two, and it committed to a single number, whereas GPT-3.5 expressed all assessments as ranges (e.g., Total calories: 2645-3130).  (See full GPT-3.5 transcript). Claude’s Chat UI lacks the ability (at the time of testing) to present the info so far as a table, but it did a good job presenting the information that belongs in the table.

Bard failed to run this program correctly. It exhibited a failure mode in which it simulates the user’s input, rather than waiting for the user to type something. It ends up inventing a sequence of meals that somebody might eat, and then it provides the calorie and nutritional assessments for its imaginary meals. Its assessments and tables would have been reasonable if a user had actually entered those meals. In our experiments, we have found this to be the most common failure mode among the smaller LLMs (i.e., GPT-3.5, Bard and Claude) when programming in English across other application areas. We explore the failure rates and failure modes systematically in our next post.

Related work

The programming in English paradigm is not a particularly novel concept – it overlaps many ideas involving natural language interaction with computers. We have simply introduced the term “programming in English” to emphasize the problem of controlling a computer, operating on inputs provided later, and to emphasize the relationship to conventional computer programming.

Low-code/no-code programming paradigms strive to empower non-programmers to create functioning applications. The Analytica visual modeling environment falls into this camp, in which analysts in fields other than computer science are able to “program” decision models and user interfaces to those models within a visual interface by drawing influence diagrams, filling in basic relationships, making input and output nodes with a button click from influence diagram variables, and dragging them in a visual interface to layout of the UI.

A core motivation of natural language understanding in AI has always been to follow instructions given in English. Although running a program written in English is basically a subset of the more general task of following instructions in English, there is also a subtle difference in connotation – a program assimilates inputs while running that aren’t included in the instructions themselves. It is that ability to assimilate inputs while running to compute outputs that distinguishes the “programming in English” paradigm from instruction following.

Conversational AI and Question answering systems also seem to be pretty similar in spirit. Again, the distinction is the same subtle difference in connotation as for instruction following systems.

In-context learning instructs LLMs how to perform a task by specifying a context and providing task examples. There is usually an emphasis on providing examples, e.g., few-shot learning, but the interpretation of “in-context learning” seems to vary somewhat between authors and hence it lacks a precise definition. The programming in English paradigm seems to largely overlap with various concepts of in-context learning, but the emphasis seems to be on different aspects.

Summary

Large LLM foundation models like GPT-4 open the door for a new computer programming paradigm in which the “program” is specified in English (or Spanish, Japanese, Tagalog, or any other natural language), without any use of a formal programming language. The English language “program” is executed directly by the LLM.

Writing programs in English could have several advantages:

  • Simpler and faster to write programs.
  • Less formal programming expertise required.
  • Some capabilities are possible that would be infeasible (or less feasible) when implemented in a formal programming language.
  • The implementation is likely to be shorter and easier to interpret.

There are also disadvantages:

  • Variation (randomness) in the output may complicate QA & ability to rely on results.
  • Substantially more compute (and hence energy consumption, carbon emissions) is required to run an English-language program compared to a conventional program.
  • Limits on which tasks can be successfully implemented in this way (at least today).
  • No systematic debugging techniques (yet).

Here we’ve illustrated one successful example that we were able to program in English, the Calorie and Nutrition Tracker app. This serves merely as a proof-of-concept that it is possible in at least one case. In the sequel post (part 2) we look at success rates running English-language programs across a collection of applications and across the four major foundation models.