Great question — I’m using traditional OCR engines for the initial text extraction (e.g., MathPix, Google Vision), but then I apply generative AI models in a second stage to refine the output. This includes removing noisy or irrelevant elements, normalizing format inconsistencies, and improving alignment across multi-modal inputs.
In addition, for figures and diagrams, I use Gemini Pro Vision not just to extract the content, but to generate context-aware, structured descriptions that are better suited as ML training input — rather than just dumping raw image text.
So in short, generative AI is used here more as a smart post-processing layer to enhance the usability and semantic clarity of the OCR outputs.
In addition, for figures and diagrams, I use Gemini Pro Vision not just to extract the content, but to generate context-aware, structured descriptions that are better suited as ML training input — rather than just dumping raw image text.
So in short, generative AI is used here more as a smart post-processing layer to enhance the usability and semantic clarity of the OCR outputs.