a simple token structure for the multimodal of physics pixel including orthogonal data, objects and different sensors

Last updated on November 3, 2025

I just designed a simple token structure for the multimodal of physics pixel including orthogonal data, objects and different sensors, and also thanks to Grok for help.

All tokens in this multimodal for physics pixel have same format: a token has 2048D, in which first 768D for text, and following 600D-800D for parameters which are orthogonal, and rest dimensions are vacant zero value for increasing dimensions only.

This multimodal can be trained and working on frame tokens based on frames flow (labeled and unlabeled) and text tokens based on pure text separately or combinedly. The frames flow can be based on visual signal from camera or other sensor signal like mm wave Radar or microphone.

A frame token includes two kinds of tokens: first is parameter token, second is caption token. Below are examples for frame token of camera as primary sensor (parameter and caption respectively) and parameter token of mm wave radar as secondary sensor. The token for pure text is just to mask all parameters field by mask tensor, so has no example below.

Following example is an example of frame token (parameter token and caption token) for camera as primary sensor and mm Wave Radar as secondary sensor. Through this token structure, the fusion of differet kinds of sensors is even much more simple than I suggested previously, and just input the raw signals all in tokens to let the model learn the correlation between different signals from all sensors of same and different kinds through training by itself – let the neural network stack magic rock!

One parameter token includes 3×3 physics pixels, and the sequence of physics pixels are from nearer interface/layer in depth to further interface/layer in the frame, which means that a physics pixel in an earlier frame is nearer to camera (smaller in Z) than another physics pixel with same XY in following frame. All physics pixels (tokens) are labeled in labeled training, or generated by the model in reference and unlabeled training, which means raw parameter token firstly input into the model in reference has only RGB on (X, Y) which will be represented as (X, Y, 0) in token.

Each parameter token includes the parameters of up to 4 objects at most, usually there will be only two objects (the same near/far objects for 9 pixels). But if there are more than 4 objects in a 3×3 pixels or a parameter token, as Grok suggested, just generate an additional parameter token with same content (same pixels, cameras, timestamp, etc) except including additional objects, and additional token is generated by training label or generated by the model in reference or unlabeled training.

1. A parameter token for camera frame includes following data:

1D of data source (input/external raw or output/internal generated,

real or not, synthetic or not, physics complied or not, etc),

1D of mask tensor (for text, timestamp & camera ID, RGBA, all other parameters respectively),

768D of text masked as vacant/invalid by mask tensor and all “0” value,

TABLE HEADER (

/table for pixel or physics pixel/

32bits: timestamp/This is a table created for this time,

8bits: row mapping ID/”01″, means that other tables with “01” on this ID have corresponding mapping rows to this table,

8bits: sensor ID/”L”, means sensor No.L,

8bits: sensor type/”camera”,

8bits: coordinate system type/ “sensor’s own coodinate system”,

8bits: rows/”9″,

8bits: columns/”n”D, 1D=32bits,

16bits: total size of the table/ “9”x”n”D,

)

/for each of 9 physics pixels there are following parameters/

XYZ, / X>=1, Y>=1, Z>=0,

unit vector for direction,

RGBA, /RGBA should be predicted by model in reference or labeled in training from raw RGB, and for any (X, Y, 0) RGBA=(-1,-1,-1,-1)./

RGB, /to show raw visual signal of the pixel (X,Y,0) from camera, and for any physics pixel (X,Y,Z) Z of which is not “0” RGB=(-1,-1,-1,-1), and set the surface of camera lense is +10mm (mm as unit) in depth from origin for difference. /

Pressure,

/following parameters are for near object’s point at the physics pixel/

near object ID,

Velocity,

Rotation vector,

Density,

Temperature,

Hardness,

/following parameters are for far object’s point at the physics pixel/

Far object ID,

……,

Hardness;

TABLE HEADER(

/table for mapping 3D points from physics pixel/

32bits: timestamp/same timestamp as first table in this token,

8bits: row mapping ID/”01″, means this table have corresponding mapping rows to other table with “01” on this ID,

8bits: sensor ID/”L”, means sensor No.L,

8bits: sensor type/”camera”,

8bits: coordinate system type/ “sensor 1”, sensor 1’s virtual sensor with the unified single rectangluar coordinate system for this whole application,

8bits: rows/”9″,

8bits: columns/”6″D,

16bits: total size of the table/ “9”x”6″=54D,

)

/For each of 9 pixels there is a following corresponding 3D point/

x’y’z’, /3xD/

unit vector for direction; /3xD/

/for a parameter token for mmwave radar, all above parameters of camera shall be masked/

TABLE HEADER (

/table for mm wave radar/

32bits: timestamp/This is a table created for this time,

8bits: sensor ID/ mm wave radar’s ID,

8bits: sensor type/ mm wave radar,

8bits: coordinate system type/ “sensor 1”, sensor 1’s virtual sensor with the unified single rectangluar coordinate system for this whole application,

8bits: rows/”0″ for parameter token for camera, “30” for mmwave frame,

8bits: columns/”0″ for camera frame, “6” for mmwave frame,

16bits: total size of the table/ “0” for camera frame, “180” for mmwave frame,

)

/all “0” for following fields in camera frame, and for mmwave radar each of 30 3D points in cloud has values in following fields/

x’y’z’,

velocity vector.

/generate parameter frames for raw signals of a mm wave radar until all 3D points of mm wave radar are included./

TABLE HEADER (

/table for objects/

32bits: timestamp/same timestamp as first table in this token,

8bits: row mapping ID/”0″ or “2”, no rows mapping with other table in this token,

8bits: sensor ID/”0″ means not from sensor directly,

8bits: sensor type/”0″,

8bits: coordinate system type/ “0” means no coordinate system,

8bits: rows/”2″,

8bits: columns/”m”D,

16bits: total size of the table/ “2”x”m”D,

)

object 1 ID, /near object/

object class,

parent object,

object mass,

object velocity,

object rotation vector,

object eatable or not,

object 2 ID, /far object/

…… ……,

object 3 ID, /additional object if applicable, otherwise vacant as all “0”/

…… ……,

object 4 ID, /additional object if applicable, otherwise vacant as all “0”/

…… ……,

2. A caption token for camera frame includes:

1D of data source/class (input/external raw or output/internal generated,

real or not, synthetic or not, physics complied or not, etc),

1D of mask tensor (for text, timestamp & camera ID, RGBA, all other parameters respectively),

768D of text for frame caption,

TABLE HEADER (

/for what parameter frame this caption is with and nothing else/

32bits: timestamp/This is a table created for this time,

8bits: row mapping ID/”0″,

8bits: sensor ID/same as the parameter token which the caption is for,

8bits: sensor type/same as the parameter token which the caption is for,

8bits: coordinate system type/ same as the parameter token which the caption is for,

8bits: rows/”0″,

8bits: columns/”0″,

16bits: total size of the table/ “0”,

)

/all other dimensions/fields are masked as vacant/invalid in mask tensor and by all “0” value/.

a simple token structure for the multimodal of physics pixel including orthogonal data, objects and different sensors

Be First to Comment

Leave a Reply Cancel reply