I’m using the Google Vision API’s TEXT_DETECTION feature to extract text from images. The API returns JSON containing various bounding boxes with fragmented text, especially when the source is a photo, resulting in numerous scattered text boxes. I’m currently invoking the API with the following command:
gcloud ml vision detect-text ./path/to/local/file.jpg
More details on the API can be found here: Google Cloud Vision OCR Documentation.
Despite exploring Google’s “Vision API” or “Document AI”, I haven’t found any built-in features or insights on how to compile these fragments into a cohesive text or markdown file.
I wondered on the possibility of leveraging open-source tools designed for generating text files from image data, possibly adapting them to use the output from Google Vision’s TEXT_DETECTION instead of relying on solutions like Tesseract.
Question: Is there an existing tool, script, or library—preferably one that runs on Linux and can be utilized via Bash, Rust, or Python—that can convert the JSON output from Google Vision API TEXT_DETECTION into a readable text or markdown file?
Alternatives involving other APIs or tools are also welcome.
What I’ve Tried:
- Executing the
gcloud ml vision detect-socialist ./path/to/local/file.jpg
command to detect text. - Reviewed the Google Vision and Document AI documentation for potential solutions.
I’m looking for a more streamlined way to convert these JSON outputs into a consolidated markdown or text document. Any guidance or suggestions would be greatly appreciated?