GPT-4 was just released and from an initial review, it is pretty incredible. Here are some of the differences between GPT-4 and GPT3.5:
Test-Taking & Reasoning
GPT-4 outperforms ChatGPT by scoring in higher approximate percentiles among test-takers.
What does this mean? According to OpenAI, GPT-4 was able to achieve much higher scores on common tests, such as the Uniform Bar Exam and Biology Olympiad. Here’s a screenshot showing the differences in scores:
While GPT-3.5 “only” made it into the 10th percentile for the Bar Exam, GPT-4 was in the 90th percentile, meaning it performed better than 90% of humans in this incredibly difficult exam. For the Biology Olympiad, GPT-3.5 was in the 31st percentile while GPT-4 was in the 99th, performing better than 99% of humans!
Safety & Alignment
OpenAI spent a lot of time and resources on GPT-4’s safety. As we crawl (sprint?) towards AGI (artificial general intelligence), it’s clear OpenAI is prioritizing safety.
For prompt response safety, they achieved an 82% reduction in the likelihood that ChatGPT will respond to requests for disallowed content. And, addressing a deep issue in GPT-3.5 of false information being presented as truth, GPT-4 is 40% more likely to provide factual responses based on their internal testing.
OpenAI has also incorporated much human feedback during the training period. They mention working with “ over 50 experts for early feedback in domains including AI safety and security”.
One interesting tactic OpenAI used to assess the safety of its AI is allowing ARC (Alignment Research Center) early access to GPT-4 to assess potential for “power-seeking behavior”. Here are some of the tests it ran:
Pricing
GPT-4 API is in invite mode right now, and I’ll be testing it as soon as I get access. I recently made two videos about the GPT-3.5 Turbo API: one reviewing its capabilities and another building a chatbot with it. One of the primary innovations with GPT-3.5 Turbo as compared to Divinci is the fact that OpenAI was able to achieve a 90% cost reduction. However, with GPT-4, the price has increased significantly.
For 32,000 tokens, the maximum allowed in GPT-4, it would cost between $1-3 for the prompt and another $1-3 for the response. Certainly not expensive, but the costs could snowball for certain use cases. I’m sure OpenAI is working towards bringing that cost down, given it is a significant differentiator for them.
Below is the pricing for GPT-3.5 Turbo and GPT-4.
Prompt Size
Another advancement for GPT-4 is the enormous increase in the prompt and response size, offering 32,000 tokens or about 25,000 words of text. The ability to write prompts up to 25,000 words will likely greatly expand the reasoning capability and the number of use cases available to be explored. Lawyers can use ChatGPT to research an entire case history and doctors can input a patient’s medical history. For just a few dollars, professionals can potentially catch mistakes they’ve made in seconds. It’s essentially like having a “second opinion” on complex topics at all times. Obviously, there are plenty of privacy and ownership challenges to work out, but it’s exciting nonetheless.
Conclusion
I’m incredibly excited to see the progress from GPT-3.5 to GPT-4. From the increased reasoning capability to the vastly larger prompt and response sizes. It feels as though progress toward AGI is accelerating, which brings its own risks. I’m glad OpenAI is investing in safety, although I’m not sure it’s enough. Given their original charter to be a non-profit focused on open-sourcing their AI research, they’ve strayed far away from the original reasons OpenAI was started.
I’ll keep bringing you the most up to date AI news in the meantime :)
Relevant Links:
https://openai.com/research/gpt-4
https://openai.com/pricing