Benchmarks for AGI

Patrick Reany

13 December 2023

• Introduction:

Everybody's talking about the coming of AGI (Artificial General Intelligence), but do we
know how to recognize it when it's here? To do that, we have to invent some minimal set
of benchmarks that the candidate AI must pass. I asked for help defining these benchmarks
from both BingChat and ChatGPT 3.5. What follows is from a conversation I had with both
of them on 12-13 December 2023.

What are ten AGI benchmarks that will help us decide when an AI machine has reached AGI?

BingChat:
Certainly! The journey toward Artificial General Intelligence (AGI) is a fascinating one, and
defining benchmarks to measure progress is crucial. Here are five notable AGI benchmarks
that researchers and practitioners consider: [Okay, I'll settle for five.]

1. AGIEval: This human-centric benchmark evaluates foundation models using standardized
exams such as college entrance tests, law school admission exams, and math competitions.
It assesses abilities relevant to human cognition and decision-making1.

2. The Tong Test: Proposed by Tong et al., this test focuses on five critical characteristics:

• Infinite Tasks: AGI should handle an unlimited variety of tasks.
• Self-Driven Task Generation: AGI should autonomously create new tasks.
• Value Alignment: AGI’s goals align with human values.
• Causal Understanding: AGI comprehends cause-and-effect relationships.
• Embodiment: AGI interacts with the physical world2.

3. Levels of AGI: This framework classifies AGI capabilities and behavior, introducing levels
of performance, generality, and autonomy. It helps track progress toward AGI3.

4. DeepMind’s Perspective: DeepMind suggests a broad suite of cognitive and metacognitive
tasks as an AGI benchmark. These tasks cover linguistic intelligence, mathematical reasoning,
spatial understanding, social intelligence, learning new skills, and creativity4.

5. Fitness, Breadth, and Specificity: According to a 2010 paper, AGI benchmarks should satisfy
seven criteria, including fitness, breadth, specificity, low cost, simplicity, range, and task focus5.

Remember that these benchmarks provide valuable insights, but AGI remains a complex and
multifaceted goal. Researchers continue to explore new ways to evaluate progress and define
AGI milestones.

Patrick:
What are ten AGI benchmarks that will help us decide when an AI machine has reached AGI?

ChatGPT:
Defining specific benchmarks for achieving Artificial General Intelligence (AGI) is challenging
because AGI implies a level of intelligence comparable to human capabilities across a broad
range of tasks. However, researchers and experts have proposed several key indicators or
benchmarks that could help assess progress toward AGI. Here are ten potential benchmarks:

1. Problem Solving: An AGI system should be able to effectively solve a wide range of problems
that require reasoning, planning, and adaptation, similar to how humans approach problem-solving.

2. Commonsense Reasoning: AGI should exhibit a deep understanding of everyday situations,
possessing commonsense knowledge that enables it to make inferences and decisions in a manner
consistent with human intuition.

3. Transfer Learning: AGI should be capable of learning from one domain and applying that
knowledge to perform effectively in a different, unfamiliar domain, showcasing the ability to
transfer learning across diverse tasks.

4. Natural Language Understanding and Generation: Achieving human-level proficiency in
understanding and generating natural language across various contexts is a crucial benchmark
for AGI.

5. Social Intelligence: AGI should demonstrate an understanding of social dynamics, emotions,
and empathy, enabling it to interact and collaborate effectively with humans in various social
settings.

6. Continuous Learning: The ability to acquire new knowledge and skills over time, adapting to
a changing environment and incorporating new information into its existing knowledge base.

7. Creativity and Innovation: AGI should be capable of generating novel ideas, solutions, and
creative outputs in a manner that reflects human-like creativity.

8. Ethical Decision-Making: Demonstrating a capacity for ethical reasoning and decision-making,
considering the implications and consequences of its actions in a manner consistent with human
ethical principles.

9. Robustness and Adversarial Resistance: AGI should be resilient to various forms of adversarial
attacks and demonstrate a robust performance across different conditions and environments.

10. Self-Awareness and Self-Improvement: AGI should exhibit a level of self-awareness,
recognizing its own strengths and limitations, and actively seek to improve its capabilities over
time through self-directed learning and optimization.

It's important to note that these benchmarks are broad and somewhat subjective, and achieving
AGI involves a holistic approach that encompasses various aspects of cognition, perception, and
interaction. Additionally, the development of AGI raises ethical concerns and considerations,
including issues related to control, transparency, and accountability. The field of AGI development
is dynamic, and the benchmarks may evolve as our understanding of intelligence and machine
learning advances.

Conclusion: To me, a part of intelligence is to be able to recognize subtle differences and subtle
errors, and then to deal with them correctly. For example, BingChat told us that a criteria of the
Tong Test is that of Value Alignment: AGI’s goals align with human values. However, this is
a value judgment. Value judgments are not really what's on trial for determining intelligence.
On the other hand, to have AI align with positive human values is a good thing, which will
require the AI to use intelligence to be able to accomplish this goal. So, yes, there is definitely
an issue of intelligence on this point. These mismatches between what you're expecting and what
you are given can cause confusion. Part of being intelligent is to find a way to get oneself
unconfused, if possible.

Another aspect of intelligence is to be able to "think on one's feet." This means to have the
ability to think under duress or under conditions not prepared for or anticipated, or under new
and evolving constraints.

Part of intelligence is the ability to find useful abstractions within a single domain of problem
solving that makes a variety of problems, that look very different at face value, to look the "same"
by abstracting the different names. For example, in solving algebra word problems, one can regard
the following two scenarios as very different: 1) Two printers working together at different rates
and at different times to complete a print job, and 2) two house painters working together at different
rates and at different times to paint a house. However, when we view these problems from a more
abstract point of view, they are really the "same" problem, because they are both problems in which
two "machines" work together at different rates and at different times to complete a task or job.
However, this form of abstraction is not 'abstraction for its own sake', but for a practical purpose.
It takes intelligence to make this form of abstraction without confusing oneself that there is some
novel ontological interpretation of "machine" going on here. There isn't. It's just a generalized label.
It takes intelligence to take an abstraction for what it is, even when the abstraction has not been
characterized as such. We often refer to this as "reading between the lines."

Part of intelligence is to realize that we live within a universe of nuance and subtlety, and thus
often any proposed solution has to include the words: "it depends."

Lastly, intelligence is required to rank possible solutions by urgency, likelihood of success
compared to cost of use, whether it works or fails to work. The variables that can enter into this
ranking include urgency, practicality, morality, consequences of each choice, permanence of the
possible solution, etc. Intelligence is the ability to troubleshoot a complex problem by proritizing
possible faults and their best possible solutions.

The three areas of intelligence that I think will be most difficult for AI to master will be music,
creative writing, and humor. Perhaps these should not be included in the qualifications for AGI, but
left as qualifications for super-AGI (SAGI).

Afterward: I asked BingChat to give me a list of ten criteria, but it gave me a list of only five,
whereas ChatGPT gave me a list of ten. Was BingChat unintelligent in this?

My Answer: ChatGPT did not give references (which is fine), but BingChat did. BingChat gave
me more than enough items to make ten bullet points if it had wanted to, but then it would have
scrambled up the references, which would come at a cost. Maybe it did the right thing.