Skip to content
Cognitive Interfere...
Clear all

Cognitive Interference

1 Posts
1 Users
Posts: 14
Topic starter
Eminent Member
Joined: 9 months ago

GPT-4 does amazing things, but sometimes it screws up on trivial things. Some researchers (e.g., Dziri et al 2023) have proposed that this reflects limits in LLMs abilities to handle compositional tasks.

Lonnie Chrisman's experiments suggest that these weaknesses are not due to the compositionality of tasks. He believes Dziri et al 2023 have misdiagnosed the problem. It handles compositionality well, including in the arbitrary number of digits multiplication problem. It is able to generalize to different numbers of digits (even without any fine tuning), and master the algorithmic aspects. Instead, it is extremely unreliable at some really trivial skills, such as simple counting (even when N is very small). 

Preliminary experiments indicate that it does well at these same skills in isolation, but does poorly at them in larger problems. The trivial skills in question need to be used multiple times in the full problem, which leads to a conjecture. LLMs can suffer from cognitive interference (aka dual-task interference). As humans, we tend to screw up when we talk and write at the same time, listen to a podcast while reading, or name colors of ink used to print the names of other colors (the Stroop effect). Lonnie's conjecture (with some preliminary evidence) is that LLMs experience the same effect, and that this is responsible for many of their mistakes.

Dziri et al studied the multi-digit multiplication problem, and found that per-step error rates increased as the number of steps required increased, making it pretty much incapable of doing multiplications with many digits (like 5 or more) reliably. They found especially abysmal performance on out-of-distribution training examples (i.e., when the number of digits is different from examples it was trained on). They interpreted these findings to mean that their ability to handle compositional tasks is intrinsically limited, and implied this may be a fundamental limitation of LLMs.

Lonnie's experiments with the same task – multiplication – found something very different. GPT-4 handles the compositionality of this task perfectly, and generalizes to out-of-distribution problems perfectly. Lonnie's prompts were different from Dziri et al’s, of course – in which he studied the task as an instance of “programming in English”. Like Dziri et al., Lonnie did find abysmal performance on the task, but the problems were due to mistakes on trivial steps, usually involving simple counting or indexing (which is just an instance of counting – such as “find the 5th item”). It performs each of these tasks well in isolation, but becomes unreliable in the context of the full task. Lonnie's conjecture (which needs more systematic exploration) is that this occurs because the full task requires it to use this same skill in multiple places. It may be that there are millions of regions in the neural network that correspond to millions of different skills, but just a small region dedicated to this one skill. Once that skill is required in multiple places, it experiences an interference effect, and becomes really bad at that simple skill in the context of the larger problem. The interesting thing is that it is able to follow the program (i.e., handle the compositionality) really well (I haven’t seen it make a mistake at that level yet). 

Further experimentation could test it on individual steps from the full multiplication problem, the simple steps that it flubs at in the larger problem. Lonnie tested it on indexing into a collection, and on reversing a collection, each in isolation, which it passed with flying colors, even though these tasks are the things it often flubs.

Some other ones that deserve trying in isolation: shifting by N places (i.e., adding N zeros to the end of a number), computing the carries given the column sums, computing the digits given the column sums and the carries.

Lonnie thinks it would be interesting to find a series of compositional problems that involve lots of steps (even a variable number of steps), but where each step needs a distinctly different cognitive ability. And then compare this to compositional problems where there is a lot of overlap between the cognitive abilities in each step (like multiplication). This type of experiment would provide more direct measurement for whether cognitive interference is real.

A competing theory is that it is just bad at counting (and indexing) when doing anything else. (this still conflicts with Dziri’s compositionality hypothesis). This, if true, could end up being an emergent capability at any point, causing the weaknesses to disappear with scaling.