Last updated on July 4, 2025
Previously I posted “a prime model for visual related AI and trained by experiments with diverse sensors”, and this multimode AI model can be used to make the prime model for visual correlated with text and even further connected with a seperate text only LLM, which realize a two-way pipeline of “text to visual” and “visual to text”.
a multimode AI model comprising a visual model with a text dataset in standard grammar/format and a text languge model
Defining a text grammar/format for text dataset, and the text grammar/format shall be clear, accurate, simple and fixed, which is for minimizing the complexity of the text for the visual model to process.
Adding a text dataset to a visual model (the prime model for visual along with all other physical parameters).
Training the visual model with the text dataset to generate the correlation between the visual and the text (and other physical parameters in the prime model). At the beginning, the text in the grammar/format which describes the visual can be generated by human, and after enough training, the text in the grammar/format can be generated by the visual model according to the visual trained on and corrected by human to enhance training.
Training a text language model (like a already well trained LLM ) to transform the text in any grammar/format to the text in the grammar/format.
Connecting the visual model and the text language model to each other directly by transmitting the text in the grammar/format to each other, which is the multimode AI model with visual to text and text to visual two way pipeline.
The advantages of this multimodel AI model include:
1) flexibility. the visual model with text dataset and the text language model are seperated independent 2 models, so they can be trained seperately and independently, which makes the training much more efficient and flexible.
2) efficiency. this multimode AI model has not fusion layer between visual model and text language model, which is simpler, training and infering faster. and also this multimode AI model is not a unified model which combine all kinds of text flows with visual flows together into one single model which increase the complexity of the model too much and decrease the training and the infering performance.
3) standardization. this standard grammar/format can be made as something like ISO or IEEE public standard, and different organizations can build their visual and language models based on this universal international public standard or based on its own private grammar/format. After this publich standard of the grammar/format established, you can even combine any visual model with any text language model which are trained based on this public standard, for example you can buy a visual model with text dataset from a company and combine it with another text language model that you bought from another company.
Be First to Comment