Last updated on October 20, 2025
Now I think that just one transformer/diffusion model as the prime/physics pixel model may be the best way, which I was not sure previously because after all I’m not professional and lack some basic knowledge of AI model.
All data in physics pixel model including 2D visual info, other physics parameters and corresponding text are all together, and so all data in physics pixels frames should be turned into sequence of token or other format to directly send to one transformer/diffusion model to process, and there does not need something like CNN to preprocess visual to generate tokens.
The text capability of this one transformer/diffusion model can be a comparatively limited text model with a specific text format interface to communicate with outside fullLLM, or be a small LLM, both of which is more suitable for autodriving and robotics.
And also the text capabilty of this one transformer/diffusion model can be a full LLM, which will demand higher computer capacity and may be suitable for applications on larger servers.
Yes, visual+physics parameters+text may require a quite larger quantity for model parameters than a text model or a visual model or a previous multimodel, which may demand much more compute capacity, but the accuracy and performance of only one transformer/diffusion model as the physics pixel model is also much more powerful and promising.
Be First to Comment