I want to get the token utilization for google gemini multimodal streaming endpoint in which I pass an image as input.
Note for non streaming endpoints token information is returned by gemini models, but for streaming endpoints it is not returned.
For openai i found here how can i calculate (https://platform.openai.com/docs/guides/vision/calculating-costs and https://community.openai.com/t/how-do-i-calculate-image-tokens-in-gpt4-vision/492318), also there are encodings for gpt models like o200k_base and I use a library like tiktoken, sharptoken.
How can get token calculation logic for gemini multimodal endpoints (where image is passed in prompt as input)