The author tested the feature and found that using the prediction reduced the response time from 5.2 seconds to 3.3 seconds, but increased the cost from 0.1555 cents to 0.2675 cents. OpenAI's Steve Coffey explained that the prediction is used for speculative decoding during inference, allowing for validation of large batches of input in parallel. If the prediction is 100% accurate, there would be no cost difference, but if the model diverges from the prediction, additional sampling is done to discover new tokens, which are then charged at completion time rates.
Key takeaways:
- OpenAI has introduced a new feature called Predicted Outputs, which allows users to send content as a 'prediction' to accelerate the returned result from GPT-4o or GPT-4o mini.
- Users are charged for any tokens provided in the prediction that are not part of the final completion at completion token rates.
- The new feature can potentially return results faster without any extra charges over the expected cost for the prompt, but if the final result differs significantly from the prediction, users may be billed for extra tokens.
- OpenAI uses the prediction for speculative decoding during inference, validating large batches of input in parallel, which can lead to faster results and cost differences if the prediction is accurate.