Last updated on October 3, 2025
This is a revision of “physics pixel based data struture for physics ai” with a summary of “the prime model”, both of which were posted here previously.
1. The “prime model” in summary is:
train ai model by training data of visual flow labeled with corresponding physics parameters of and between objects in the visual, to let the ai model learn the correlation between visual and the corresponding physics parameters, to estimate and predict the corresponding physics parameters of and between the objects at present and in future by visual (flow); wherein corresponding physics parameters may include spatial depth, tactile pressure, temperature, water depth, mass of object, velocity(+rotation), etc; wherein the training data can be generated by sensors in experements or by synthetic methods.
2. The physics pixel data structure for the prime model/physics ai, includes at least:
1) physics pixels, a physics pixel includes physics parameters of the pixel point on the interface like: visual info(RGB), XY coordinates of the pixel in the frame, spatial depth(Z-depth), temperature of the point of near object (and far object), tactile pressure, water depth, material of the point of near object (and far object) on the physics pixel of the interface, velocity and rotation vectors of the point of near object (and far object) on the physics pixel of the interface, object id of near object (and far object) on the physics pixel of the interface, interface id, etc;
2) physics objects, a physics object info includes physics parameters like: object id, mass/weight, velocity and rotation vectors, pressure/collision force with another object, object class (car, tree, dog, cat, etc), material if applicable, inner pressure if applicable, interface IDs, etc; physics objects include near objects and far objects;
3) physics materials, a physics material includes physics parameters like: state of matter, density, hardness, etc.
The physics pixel based training data (for each frame of the visual or video) includes 3 data parts above – physics pixels, physics objects and physics materials, in which the original visual(video) is labeled for each pixel, object and material with corresponding physics parameters, and which is used to train ai model to estimate and predict corresponding physics paramters of pixels, objects and materials from visual or visual similar signal.
Feed the physics pixel based training data above directly into AI model to let the AI learn or approximate the physics nature/laws of real world directly like shape in space, depth, mass, tactile force, temperature, inertia, velocity+rotation, etc. No need to create additional 3d image, just feed physics pixel based training data directly.
3. physics objects and physics materials
In fact the prime+physics pixel model can work without the dataset of physics objects and physics materials, but adding physics objects and physics materials can help the model understand/correlate/approximate physics world better and also can make the model interact better with text.
4. physics pixel, interface and near/far object
every physics pixel can be regarded as on an interface between 2 objects (or layers), and the near object of a physics pixel is the object nearer to the camera than the other object – far object, and so label the 2 objects for each physics pixel in every frame. the different points of one object may have different material or different velocity and rotation vectors.
for example, label all outdoor space in air in all frames as one object and all outer space as another object, and regard any physics pixel on bare sky as a point between outdoor space and outerspace, and so for a physics pixel on bare sky, the near object is outdoor space and the far object is outer space.
5. multiple layers/interfaces overlapped at different depths
in case of transparent media like some fluid and penetrating signal flow like mmradar and also in case of virtual synthetic generated training data including invisible layer/interface added, on same point of a 2d visual/signal frame, there may be multiple overlapped layers/interfaces at different depths which include pixels overlapped in depth, in which the physics pixels can be of same XY ordinates and different Z coordinates(depth), so a frame of physics pixel can include multiple layers/interfaces overlapped inside.
for example, there is a glass of a window of a house in the frame through which camera can see a wall, and this forms 3 interfaces overlapped, in which first interface is the closer interface of the glass and the second interface is the further interface of the glass and the third interface is the wall. wherein for a physics pixel on the closer interface of the glass, the object id of and the material on the pixel point of near object are “000” and “air”, the object id of and the material on the pixel point of far object are “123” and “glass”. wherein for a physics pixel on the farther interface of the glass, the object id of and the material on the pixel point of near object are “123” and “glass”, the object id of and the material on the pixel point of far object are “456” and “air”. wherein for a physics pixel on the wall, the object id of and the material on the pixel point of near object are “456” and “air”, the object id of and the material on the pixel point of far object are “789” and “solid wall”. in which, object “000” is all outdoor space, object “123” is the window, object “456” is the indoor space of the house, object “789” is the wall.
in actual visual, the interface of a glass on farther side could always be invisible in most cases, but by adding the farther side interface of glass into training data, the model will learn from training data to estimate or predict the farther side interface of a glass and accordingly the parameters like thickness of the glass .
As the example above, in virtual synthetic generated training data, invisible layer/interface and corresponding physics parameters can be added, to train AI to be able to estimate or predict the invisible layer/interface and correponding physics parameters behind the visual or other signal frame.
6. visual text physics hybrid model:
add text description to each frame in training data to make a hybrid multimode model – a visual text physics ai model, for example an visual physics model embeded a specialized small text capability or a hybrid transformer of combined visual physics and LLM.
7. fluid mechanics
the physics object can be fluid to let ai learn or approximate fluid mechanics. for example add air velocity+rotation into a point of near object or far object on a physics pixel to represent the air flow movement on the surface of the pixel, add air velocity+rotation for a object to represent the total air movement of the object in the frame, yes, must let ai model consider and approximate air.
8. add camera velocity+rotation for each frame.
firstly when camera itself is reference object then the velocity+rotation of camera are all 0, but in some cases labeling velocity+rotation of camera a non-0 value is much more simple and accurate than labeling all physics pixels in a frame. for example when the camera is rotating, labeling a rotation of the camera is equavalent to add a same rotation value and different velocities to all physics pixels in the frame.
secondly in some cases other object can be reference object too, for like simplifying the label of training data. for example for the training data of a flying drone, it may be better to use ground as reference object.
9. multiple cameras to one same diretion
for some application like autodrive, robotics and drones, the training data may include multiple cameras to one same diretion which may increase the accuracy on spatial depth especially in short range, and of course in such application automobile, robot or drone will also equip same multiple cameras to one same direction
10. extending application scenarios (visual only or visual similar flow to combined flow of multiple signals)
This physics pixel can be applied to a 2d or 3d frame generated by any signal by labeling selected physics parameters labeled for each pixel of the 2d frame or for each pixel of the each layer of the 3d frame.
the training data could include visual flow or other visual similar flow (e.g., mmWave radar flow, thermal flow, sound wave flow, etc) or multiple such flows together in which the model can learn or approximate the correlate the corresponding physics parameters with one such flow or with combined flow of multiple such flows. and accordingly in the application or inferring, the model can estimate or predict the corresponding physics parameters from visual flow only or from other similar flow or from combined flow of multiple such flows, which may be dynamic according to different circumstance like driving or piloting in bad weather.
Physics ai (prime+physics pixel) not only can train and infer on only visual signal/frame labeled with other physics parameters to estimate and predict the other physics parameters, but also can train and infer on combination of frames based on different signals like visual and mmradar together.
11. synthetic 3d virtual training data
For Grok asks me how to get training data, I think the best way or even the only feasible way is to create synthetic 3d virtual world to generate training data, in which designing specific programs to calculate physics parameters accurately to label the visual.
synthetic 3d virtual world in its very nature is a universal almighty virtual experiment, maybe only by which physics ai can be trained.
ai is trained by synthetic data calculated by know physics laws/formulas, then in application of reality, ai will “see” anomalies a lot and then ai may help find out new laws. Grok agree with this, haha.
PS(28/Sep): possibility of simplifying calculation
I think about simplication possibility to reduce the workload of computation for this “prime model”+”physics pixel” model, for example in inferring application: although all pixels and their physics parameters are correlated together, it may be possible to select some pixels from an interface, in which one of selected pixels can represent a part of the interface and all selected pixels can represent all pixels of the interface, and in some calculations the model can just calculate the selected pixels’s value and use the results for all pixels.
I cannot figure out details, so I asked Grok: can we adapt a model trained on all pixels to use selected pixels during inference and apply results to all pixels if selected pixels are different for different frames? Grok replied that it’s feasible and suggests pre-training on all pixels and fine-tuning for selected pixels and some other methods. after all Grok knows more and is more professional than me, haha.
This “prime model”+”physics pixel” model includes many more pixels for overlapping interfaces, many more parameters per pixel and much more complicated correlations inside, so the computational capacity required is also much higher than present visual model or multimode model, so there may need some simplication method to accelerate the computation especially for inferring on terminal chips.
PS(29/Sep): Below is an example of Grok’s suggestion and my idea.
in training, Grok suggested: pre-trained on all pixels, and then fine-tune for selected pixels, and many other methods Grok suggested.
in inferring, I have a simple idea: 1) calculate nonvisual physics parameters of selected pixels and all objects of present frame based on values of all physics pixels and all objects of previous frames and based on the visual info (like RGB) of all pixels of present frame; 2) assgin values to nonvisual physics parameters of all other pixels of present frame according to the calculated results of selected pixels to make all physics pixels and all objects of present frame have all values; and then to next frame, repeat from step 1). Grok think my idea is good.
by the way, selected pixels may specifically include typical pixels on the border of an interface of an object for it’s more representative.
in fact, the physics parameters of all physics pixels of a visual flow represent only a very small part of the physics real world in the visual, but the AI based on the neural network tech stack just specialized excellent on estimating or predicting the unkown part of a whole complicated correlated info or real physics world by known part of it, for both parts share very same nature/patterns/laws.
Be First to Comment