Swedish memory optimization IP company ZeroPoint Technologies today announced a strategic alliance with Rebellions to develop what they said will be the next generation of memory optimized AI accelerators for AI inferencing. The companies plan to unveil new products in 2026, claiming "unprecedented tokens-per-second-per-watt performance."
Apr. 08, 2025 –
As part of the collaboration, the two companies aim to increase effective memory bandwidth and capacity for foundational model inference workflows, using ZeroPoint Technologies’ memory compression, compaction and memory management technologies. This hardware-based memory optimization can help increase addressable storage capacity in data center environments to work nearly 1,000× faster than using software compression, according to ZeroPoint Technologies’ CEO Klas Moreau.
As a result, the company hopes to enhance tokens-per-second-per-watt without sacrificing accuracy, using lossless model compression to reduce model size and energy required to move model components.
“At Rebellions, we’re pushing the boundaries of state-of-the-art AI acceleration with an unwavering focus on efficiency,” said Rebellions CEO Sunghyun Park in the companies’ joint announcement. “Our partnership with ZeroPoint enables us to redefine what’s possible in inference performance per watt— delivering smarter, leaner and more sustainable AI infrastructure for the generative AI era.”
“We are convinced that memory acceleration will rapidly evolve from a competitive edge to an indispensable component of every advanced inference accelerator solution, and we’re proud that Rebellions shares our commitment to making AI datacenters far more efficient,” Moreau added in the statement.
In a briefing earlier this year with EE Times, Moreau highlighted that over 70% of data that is stored in memory is redundant. “This means you can get rid of it entirely and still provide lossless compression. However, for this to work seamlessly, the technology has to do three very specific things within that nanosecond scale window (which corresponds to just a few system clock cycles).
“First, it needs to handle compression and decompression. Second, it also must compact the resulting data [putting together small chunks of compressed data into an individual cacheline to dramatically improve apparent memory bandwidth], and finally it must seamlessly manage the data to keep track of where all the combined pieces are located. To minimize latency, this kind of hardware-accelerated memory optimization approach typically must function at cacheline-granularity—compressing, compacting and managing data in 64-byte chunks [contrasted to the much larger 4-128kB data sizes used by more traditional compression methods, such as ZSTD and LZ4].”