Last updated on July 6, 2025
Previously I posted “a prime model for visual related AI and trained by experiments with diverse sensors”, and this multimode AI model can be used to make the prime model for visual correlated with text and even further connected with a seperate text only LLM, which realize a two-way pipeline of “text to visual” and “visual to text”.
a multimode AI model comprising a visual model with a specific grammar format text interface connecting a text language model directly
Define a specific grammar format for text, and the text grammar format shall be clear, accurate, simple, consistent and fixed, which is for minimizing the complexity of the text for the visual model to process and also for ensuring the accuraccy and integrity of the info conveyed by the text.
Embed a text language model to a visual model to get a visual model with a text interface ( the visual model can be “the prime model for visual related AI” along with all other physical parameters), for example: a visual model with a text interface can be a hybrid CNN+transformer model which Grok told me already proven in some present applications, which transform CNN output to sequence data to send to transformer.
Train the visual model with the text interface to get the visual model with specific grammar format text interface, and hereafter use my previous “the prime model” as example. first, generate training data of visual and other physics parameters like spatial depth, tactile pressure, temperature, water depth, acceleration by experiments, and this training data is for later training to generate correlation between the visual and other physics parameters to make the visual medel be able to estimate or predicate other physics parameters from the visual; second, generate the sync description text in the defined grammar format according to the training visual and the record of other physics parameters; third, input synced data of visual, other physics parameters and description text into the visual model with text interface to generate correlations between visual, other physics parameters and description text.
At the beginning, the text in the defined grammar format which describes the visual can be generated by human, and after enough training, the text in the defined grammar format can be generated by the visual model with the specific text interface according to the visual trained on and corrected by human to enhance training.
Train a text language model (LLM) or a dedicated text grammar format trasforming model module (which can be added to the visual model with the specific text interface or the text language model or in between) to transform the text in different kinds of grammar format to the text in the defined specific grammar format.
Connect the visual model with the specific text interface and the text language model (LLM) to each other directly by transmitting the text in the defined grammar format to each other (or through the dedicated text model module in between), which forms the multimode AI model with visual to text and text to visual two way pipeline.
The advantages of this multimodel AI model include:
1) flexibility. the visual model with the specific text interface, the text language model (LLM) and even the dedicated text transforming module are all seperated and independent , so they can be trained or used seperately and independently, which makes the training and application much more flexible.
2) efficiency. this multimode AI model has no fusion layer between visual model and text language model, which is simpler, training and infering faster. and also this multimode AI model is not a unified model which combine all kinds of text flows with visual flows together into one single model which increases the complexity of the model too much and decreases the training and the infering performance.
3) standardization and modulization. the defined specific grammar format can be a standard grammar format text interface or even an ISO/IEEE public standard, and different organizations can make their visual and language models connected by a standard grammar format text interface, and of course an organization can create its own private grammar format text interface too. So you can make a visual model with the specific text interface from a company to be combined with a text language model from another company through the dedicated text transforming module, to build a multimode AI model.
4) evolvability. furthermore, the visual model with the specific text interface may gradually expand its training data to more text in more kinds of grammar and format to achieve different level of adaptability of the text interface for different application, and even to become a completely unified model of visual and text language.
Finally, although it seems to be most promising and powerful to input all visual+large language text+all kinds of info into one hybrid unified AI model, which can maximize correlation and minimize the loss, this complete hybrid unified approach may exceed present capacity of compute and algorithm too much especially for applications on car, robot and wearable equipments. And in fact, in many applications or cases you don’t need such strong complete correlation between visual and all text knowledge, like auto-driving, robotics or wearable equipments.
This mutimode model limit visual model’s text job to only visual related work and let the visual model’s capacity focus on visual related work including estimating other physics values, and it’s like the visual model is inbeded with a small limited LLM and connected to an external unlimited complete LLM.
Grok answered me that this approach is novel and making sense very much, in fact Grok is professional and I’m not! hahaha!
Be First to Comment