Last updated on October 26, 2025
In many cases, different kinds of sensors are necessary for application, for example, in autodriving, robotics and drones you may need sensors of visual and sound and even mmRadar, so the fusion of different kinds of sensor signals is mandatory for AI model in these applications.
Any senor signal is a perspective 2D or 3D frame like mmRadar or received by a point sensor like microphone. A frame of physics pixel is a 3D perspective inherently, and a frame or the source or the sensor location of the sensor signal of other kind can be matched to the XYZ coordinates of a physics pixel frame overlapping same direction.
Below are 2 examples for fusion of different kinds of sensor signals based on physics pixel frame, which are based on sound sensor (microphone) and mmRadar respectively. Below the physics pixel frame is based on visual from camera as in most applications, but the physics pixel frame can be also based on the signal from other sensor like ultrasonic wave or electromagnetic wave of different wavelength.
First example is sound signal received by multiple microphones. A microphone alone as sound sensor cannot generate frame and is not directional, but multiple microphones can have some mimic frame capability and be directional like human, so below use multiple separated microphones as sound sensors which have 3 following methods to combine sound signals received by microphones to physics pixel frame.
1) Receiver mode: match the position of each microphone of multiple microphones to a virtual physics pixel of XYZ coordinates on camera interface, and add synced sound data received by the microphone and its XYZ coordinates to the start or end of the synced physics pixel frame.
This receiver mode lets the model learn the direction and distance of sound source(origin) roughly and IMPLICITLY by the difference between multiple micphones in training, which is just like human. This receiver mode is most simple and efficient, and may be enough for most common autodrive and robot applications.
2) Source mode: in training, seperate each sound data from an identifiable (roughly or accurately) sound source(origin) and label XYZ coordinates of the sound source in physics pixel frame, and label an XYZ (like (0,0,0)) coordinate for unidentifiable sound sources, and then add the sound data with XYZ coordinates of the sound source to the start or end of the synced physics pixel frame, or then add to the parameter of a new or existed physics pixel of the XYZ coordinates in the physics pixel frame.
This source mode is heavy workload and expensive in training and reference for its EXPLICIT coordinates of sound sources, and may be suitable for special applications like robots/drones for firefighter, first aid or military. This source mode may also be applied to video generative application.
3) Frame mode: match sound received by all microphones to all physics pixels of the synced frame like each physics pixel of the frame is a virtual microphone on camera interface, and in training let the model learn the correlation between the input sound of multiple microphones and the sound data of virtual microphone of each physics pixel on camera interface, and in reference let the model to estimate/predict the sound data of virtual microphone of each physics pixel on camera interface. There could be a virtual microphone for each virtual physics pixel on a omnidirectional virtual interface of camera around whole observer.
This frame model for microphone sound is interesting but I havent thought about any application scene for it, maybe it’s suitable for some other sensor signal (like pressure on surface of observer) or rare case?
Second example is a frame mode to combine mmRadar to physics pixel frame. Supposing a car/drone/robot has 4 mmRadars each of which overlaps the part of view of 8 cameras. Unify the resolution of frame of mmRadar and physics pixel by interpolating the frame of lower resolution, in which usually cameras have highest resolution. In training data, match the signal frame of a mmradar to a synced frame of a camera, and add the data of mmradar directly to the parameters of new or existed physics pixels in the physics pixel frame according to XYZ coordinates, or add the data of mmradar with XYZ coordinates to start or end of the synced physics pixel frame.
The mmradar may be necessary not only for robots for firefighter, first aid and military but also for autodriving car and home robot as well. Most other signals like infrared and ultrasonic can share the same mode as mmRadar to be combined into physics pixel frame.
Impossible to label training data? generate synthetic data virtually may help greatly, go ask Nvidia to create a game world and label the output data as described here for you, haha.
Be First to Comment