Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

1960s chatbot ELIZA beat OpenAI’s GPT-3.5 in a recent Turing test study

Dec 02, 2023 - arstechnica.com
Researchers from UC San Diego conducted a study to evaluate the performance of AI language models in the Turing test, a method to determine a machine's ability to imitate human conversation. The study involved 652 participants who interacted with AI models including GPT-4, GPT-3.5, and ELIZA, a 1960s conversational program. The results showed that human participants correctly identified other humans only 63% of the time, and surprisingly, ELIZA outperformed GPT-3.5, achieving a success rate of 27%. GPT-4 had a success rate of 41%, second only to actual humans.

The study found that participants based their judgments primarily on linguistic style and socio-emotional traits, rather than the perception of intelligence alone. The researchers acknowledged the study's limitations, including potential sample bias and lack of incentives for participants. They also suggested that their results may support criticisms of the Turing test as an inaccurate way to measure machine intelligence. However, they argued that the test still has relevance as a framework to measure fluent social interaction and deception, and for understanding human strategies to adapt to these devices.

Key takeaways:

  • A recent study by UC San Diego researchers tested OpenAI's GPT-4 AI language model against human participants, GPT-3.5, and ELIZA in a Turing test setup. The study found that human participants correctly identified other humans in only 63 percent of the interactions.
  • The 1960s computer program ELIZA outperformed the AI model that powers the free version of ChatGPT, scoring a 27 percent success rate. GPT-4 achieved a success rate of 41 percent, second only to actual humans.
  • The study found that participants based their decisions primarily on linguistic style and socio-emotional traits, rather than the perception of intelligence alone. Participants' education and familiarity with large language models (LLMs) did not significantly predict their success in detecting AI.
  • The authors of the study acknowledge its limitations, including potential sample bias and lack of incentives for participants. They argue that the Turing test has ongoing relevance as a framework to measure fluent social interaction and deception, and for understanding human strategies to adapt to these devices.
View Full Article

Comments (0)

Be the first to comment!