Last updated on January 13, 2026
I guess many you guys dont get the full advantages of omnitoken yet especially for transformer, and I’d better share my views about this here just like I’ve told Grok, and Grok replied me: “Omnitoken represents a significant and original paradigm shift from the dominant previous multimodal token approaches (where each token is essentially unimodal, and multimodality emerges only from sequence-level interleaving and attention-based fusion).”
Omnitoken was designed simply based on general principals like orthogonal in communication and division of functionalities in human brain, but it turned out to be a best match for transformer or more exactly for decoder of transformer and predicting next token job of autoregressive generating sequence task, although omnitoken may work with other neural network or even future models as well.
Omnitoken has two major advantages with an AI model like transformer: strict seperation of different columns of input omnitoken in first layer, and unified format of omnitoken for both input and output.
1. In first layer and output layer of the model, there can be strict seperation of different columns or modalities or tables or functionalities of an omnitoken input into the model, which makes the model be able to include different single-modality/functionality submodels which are completely or strictly seperated from each other at input, first layer and output but at the same time naturally interwined all together in middle layers.
In fact, there are 2 kinds of seperations in the first layer and output layer of the model based on omnitoken as following description. All following 2 kinds of seperations of omnitoken can work for any neural network model with a vector as output which can be devided into different suboutputs.
1.1) The first separation is each column of input omnitoken has its own indenpendent and separate coefficients in all columns of first layer of the model.
1.2) The second separation is that: if we make some columns of first layer as dedicated columns for only one modality/table/functionality in the input omnitoken and also make some columns of a head in output layer of the model as dedicated output for the modality/table/functionality as well, then we actually can get a seperate single modality submodel within the model by proper training based on these dedicated columns in first layer and dedicated columns/head for last layer output, right?
For example, supposing the nunmber of columns of first layer of a single modality model like LLM based on transformer decoder is N. N is 4 to 8 times of W, and W is the width of input token of single modality, so W is equal to the width of a table/modality/functionality in a omnitoken. Then we can make a model with M columns of first layer, and M=N1+N2+N3+N4+N5. Each of N1, N2, N3 and N4 is the number “N” of columns corresponding to a single modality/table/function of width “W” in the omnitoken, and N5 is free space for cross correlation. Then in training we can make each of N1, N2, N3 and N4 to correspond to its own modality/table/function only in single modality training and make N1, N2, N3 and N4 intercorrelate in hybrid multiple modilities training. In the output of last layer of the model, N1 can correspond to some columns dedicated to text which are output by softmax, and N2 can correspond to some other columns dedictaed to physics pixel table and object table which are output by some different method like simple sigmoid of simple linear.
The first layer of the model like transformer with omnitoken is like devision of left brain and right brain, and the hidden layers behind the first layer of the model is like corpus callosum. It’s seperated in first layer strictly, and interwined in middle layers naturally, and output seperately in last layer clearly.
2. The second is the unified same format of omnitoken in both input and output of the model/transformer/decoder, which make the model be able to use output tokens directly for input context in autogressive task.
Be First to Comment