However, the actual output they receive from the model is significantly lower, at around 32 tokens per second. The author is unsure why there is such a large discrepancy between the expected and actual results, and is seeking clarification or potential reasons for this difference.
Key takeaways:
- The user is trying to calculate the number of tokens per second they expect to get from the "llama 7b" model deployed on A10G.
- The formula used for calculation is the number of tokens = (TFLOPS / (2 * number of model parameters)).
- According to their calculations, they should be getting 2251.4285714285716 tokens per second.
- However, the actual output they are getting from the model is approximately 32 tokens per second, leading them to question if they are missing something in their calculations.