Up to 84% of GPU power wasted in growing multimodal AI sector
New data from NeuReality has revealed that as much as 84% of GPU computing power is being wasted in multimodal AI environments.
The findings underscore the growing inefficiency as enterprises ramp up their use of artificial intelligence that processes images, video, text and voice, with significant economic and operational consequences.
Resource underuse
The surge in multimodal AI workloads is visible in platforms like Google Lens, which processes over 20 billion visual queries each month, and Alibaba, with more than 50 million daily image-based requests. However, legacy infrastructure was not engineered for these workloads, leaving a vast proportion of computing capability idle.
"We're at an inflection point where the infrastructure needs to be optimised for running models as efficiently as possible," said Gaurav Shah, VP of Business Development, NeuReality. "Companies are investing millions in GPU capacity, but our research shows only 16% of that compute power is being properly utilised in multimodal inference workloads. That's not just inefficient, it's economically unsustainable."
Bottleneck challenges
Traditional x86 server architectures remain a core source of the inefficiency. For inference pipelines that process video frames and images, there is a need for frequent, asynchronous communication between services responsible for vision processing, embedding, vector search, and language inference. The central processing unit (CPU) orchestrates all data flow in these systems.
This setup forces every data packet - from decoded video frames to language embeddings - to be managed by the CPU, creating serial delays. The result is that GPUs, designed for parallel workloads, are frequently left idle as they await new tasks, undermining potential performance and return on infrastructure investment.
Financial burden
The infrastructure inefficiency translates into considerable financial loss. Large AI-driven platforms, such as those mentioned, incur hundreds of millions of USD in excess capital and operational expenditures per year as they acquire and maintain GPU resources that remain largely unproductive. Unused GPU capacity not only wastes upfront investment but also contributes to unnecessary energy consumption, pushing up costs for power and cooling.
Sectors beyond big tech are also feeling the pressure. Healthcare organisations relying on AI for medical imaging, security firms deploying scene analysis, and media companies automating content indexing all experience similar bottlenecks and inefficiencies.
Architectural overhaul
NeuReality is reworking the core organisation of AI inference systems with a new approach. Its NR1 AI-CPU moves orchestration, preprocessing, and vector tasks away from the general-purpose CPU to dedicated hardware engines built for these operations. The NR AI Hypervisor is tasked with distributing workload and data management across many GPUs, aiming to close the utilisation gap.
"We're seeing up to 85% performance improvement and near-linear scaling across multi-GPU configurations in our benchmarks," said Shah. "In production environments, this means achieving 100% GPU utilisation - getting the full value from infrastructure investments while dramatically reducing power consumption."
Competitive pressures
Industry analysts predict that infrastructure efficiency will be a central factor determining which organisations maintain a profitable edge as multimodal AI adoption widens. The cost of inefficient scaling may become unsustainable for firms unable to overhaul their underlying systems.
"The next competitive battleground isn't who has the most GPUs," Shah added. "It's who can extract the most value from every GPU they own."