- Published on
NVIDIA GTC 2025: Key AI Breakthroughs, New GPUs, and Hands-On Insights
- Authors
- Name
- Swapnil Bhatkar
- @swapnilbhatkar7
NVIDIA GTC—the Disneyland of GPU geeks, the Coachella of CUDA enthusiasts, and quite possibly the only place you'll hear folks casually debating TensorRT optimization over coffee. This was my second time attending GTC, held again in sunny San Jose, California, and as always, it didn't disappoint.
This year, with growing popularity, sessions overflowed from the San Jose Convention Center into neighboring venues like the SAP Center (home of San Jose Sharks), Montgomery Theatre, San Jose Civic Center, and hotel ballrooms at the Hilton and Marriott. Crowds were buzzing; lines were snaking; parking was, thankfully, found—and so began another whirlwind week of GPU-fueled excitement.
My mission this year was laser-focused: dive deeper into inference optimization, multi-GPU training techniques, VLMs, on-device multimodal agents, scaling laws, and to better understand the compute requirements for reasoning models. Spoiler alert: NVIDIA delivered on all fronts.
Sunday Workshop: Multimodal Agents Bootcamp
I kicked things off a day early with an additional workshop on multimodal AI agents. NVIDIA didn't skimp on the hardware—each attendee received access to an Azure 4-node H100 Jupyter notebook instance, preloaded with NVIDIA's full AI toolkit, NIM models, and a proxy server for inference.
Practical Learnings and Tools
OCR with Context Aware RAG
We leveraged Unstructured.io and PDF2Image packages for turning messy OCR data into structured insights—perfect for building smarter multimodal agents that handle everything from text-heavy documents, charts, graphs to images.
NVIDIA AI Blueprint for Video Search and Summarization
Next up, I got my hands on the NVIDIA AI Blueprint for Video Search and Summarization, and it was an eye-opener. This blueprint is like a Swiss Army knife for video analytics, powered by NVIDIA NIM microservices. The goal? Build AI agents that can watch videos, summarize them, and answer questions about them—all through natural language.
Here's how it works: The blueprint's pipeline starts by breaking down a video into bite-sized chunks using a stream handler. These chunks are then processed in parallel by a VLM (the preview used OpenAI's GPT-4o, but Cosmos Nemotron 34B is the star in production) to generate dense captions—detailed descriptions of what's happening in each segment. These captions are stored in vector and graph databases, which act like the AI's long-term memory.
The magic lies in the Context-Aware RAG (CA-RAG) module, which combines Vector RAG and Graph RAG to pull relevant context from the databases. This setup shines for tasks like temporal reasoning (understanding the sequence of events in a video), anomaly detection (spotting weird moments). Whether you're querying a security system ("Show me every instance someone entered after 11 pm") or need quick game highlights ("Summarize all goals Ronaldo scored during last season at Al Nassr"), CA-RAG ensures precise, relevant answers.
Aha Moment: Seamless Model-Switching Awesomeness with NIM
NVIDIA NIM microservices portability across devices seemed like magic—switching seamlessly from beefy GPUs in the cloud to nimble edge deployments felt like child's play. NVIDIA definitely scored major convenience points with NIM, giving developers significant freedom in model deployment choices.
Jensen's Keynote
Arriving early at SAP Center was key. Luckily, I found parking and made it inside before the doors closed. Jensen's keynote was predictably epic, unveiling new toys that had everyone reaching for their phones and tweeting at hyperspeed.

Hardware Reveals
DGX GB300
The GB300 NVL72 wasn't just another GPU announcement—it's NVIDIA's declaration that AI reasoning is now industrial-scale. Picture this: 72 Blackwell Ultra GPUs and 36 Arm-based Grace CPUs in one liquid-cooled system that delivers 50x more AI output than the previous Hopper generation.
What exactly does 50x mean? It means models that took a data center to run last year now fit in a corner rack. Add in the magic sauce of upgraded Tensor cores optimized for reasoning tasks, 288GB of lightning-quick HBM3e memory for deeper context windows, NVIDIA's 5th-gen NVLink for lightning-fast GPU-to-GPU communication, and ConnectX-8 SuperNIC's blazing 800Gbps network speed per GPU.

DGX Spark: Desktop AI Supercomputer
AI supercomputers usually recall room-sized racks humming ominously in chilly server basements—not anymore. DGX Spark is the world's smallest AI supercomputer that fits under your desk but punches way above its weight. Powered by the GB10 Grace Blackwell Superchip, it delivers 1,000 TOPS (trillion operations per second) of AI performance—enough to fine-tune models with up to 200 billion parameters, like NVIDIA's Cosmos Reason or GR00T N1 for robotics. It's got 128 GB of unified memory and NVLink-C2C for blazing-fast CPU-GPU teamwork. Nerds started imagining it as a personal Tony Stark lab in their basement office—prototyping AI models at their desk, then deploying them to the cloud without breaking a sweat. Perfect for researchers, students, or anyone dreaming up the next big thing.
Vera Rubin GPUs & Spectrum Connect: NVIDIA's Big Bet on Ultra-Scale AI
Think your current GPUs are hardcore? NVIDIA raised the bar even higher, teasing its next-gen Vera Rubin GPUs which will be out in late 2026. Rubin doubles down performance for AI inference and training, tackling workloads with monstrous ease—hitting 50 petaflops per GPU module (over twice today's top-performing GPU architecture). And if that's not mind-blowing enough, just wait until the Rubin Ultra arrives in 2027, promising up to 100 petaflops in one tidy GPU package. Mind officially blown.
But raw GPU power without blazing-fast connectivity is like owning a supercar stuck in rush-hour traffic. Cue Spectrum Connect, NVIDIA's new silicon photonics networking tech, fully integrated into Spectrum-X switches operating at a staggering 1.6 Terabits-per-second per port. This futuristic networking gear means data now moves among GPUs effortlessly fast, cutting operational costs, energy consumption, and network bottlenecks.
Platform Innovations
Dynamo: AI Inference Meets Warp-Speed Scalability
Remember when deploying massive LLMs across multiple GPUs was like herding caffeinated cats—complicated, exhausting, and seemingly impossible to get right? Enter NVIDIA Dynamo: an open-source, ultra-low-latency inference framework purpose-built for the fast-growing complexity of AI workloads.
The secret to Dynamo's power lies in its modular approach to distributed inference. It's like having a world-class orchestra conductor coordinating your GPU resources. It has a GPU Resource Planner that juggles compute capacity, a Smart Router that minimizes costly recomputations, a Low Latency Communication Library that accelerates data transfer between GPUs, and a KV Cache Manager that frees up precious GPU memory.
All that neat tech results in mind-blowing real-world speed: NVIDIA showed Dynamo pushing throughputs of popular models like Llama 70B to more than twice previous levels, and a massive 30x acceleration in tokens per second per GPU with the gigantic DeepSeek-R1 (671B). In other words, Dynamo will save developers from sleepless nights, and dramatically ballooning cloud bills.
Cosmos World Models
The Cosmos World Foundation Models (WFMs) represent NVIDIA's most ambitious bet yet on embodied AI and robotics. These aren't just models that can generate pretty images—they're comprehensive systems that understand how the physical world actually works.
What makes Cosmos remarkable is its trio of specialized capabilities: Cosmos Predict generates virtual world states that respect the laws of physics; Cosmos Transfer accelerates synthetic data generation for robot training; and Cosmos Reason provides spatiotemporal reasoning to help physical systems anticipate real-world outcomes—like predicting when a person will step into a crosswalk or a box might fall from a shelf.
CUDA
Twenty years in the making, NVIDIA's CUDA ecosystem remains the company's most powerful competitive advantage. Jensen didn't mince words, calling CUDA not just a library—it's an entire civilization built around accelerated computing. He made a compelling case for why the "CUDA moat" remains unbreached - decades of libraries built for specific use cases that competitors simply can't replicate overnight.
At GTC, the popular CUDA Python library ecosystem took centerstage. Thanks to new integrations and enhancements:
- Libraries like CuPy (think NumPy on adrenaline) get even faster execution and simpler setups, effortlessly bringing parallel GPU performance to Python's data-processing routines.
- Numba, Python's just-in-time compiler, lets you write ultra-fast GPU kernels without wrestling the complexity dragon. It compiles vanilla Python to CUDA quickly, which means Python developers can leverage GPU horsepower almost instantly—no PhD in tensor wizardry needed.
Pain Points: NVIDIA, Take Notes from AWS, Please!
Alright, a bit of a rant: NVIDIA's registration & scheduling system this year was, politely speaking, messy. Reserving seats didn't guarantee entry to popular workshops—much to attendees' chagrin. NVIDIA should seriously consider borrowing AWS's re:Invent reservation playbook to ensure attendees who planned ahead can fully engage, not stew grumpily in line.
Sessions
Lambda labs - Serverless inference and reference architectures
Lambda Labs showcased how neoclouds are reshaping the inference landscape with their OpenAI-compatible endpoints for open-source models at a fraction of the cost of traditonal hyperscalers. DIY serving requires complete control but demands significant in-house expertise. Public API endpoints offer quick integration but sacrifice customization. Private API endpoints on serverless infrastructure strike the perfect balance—managed service with dedicated capacity for organizations requiring customization and compliance.
The session delivered practical insights on optimizing different model architectures, from traditional dense models to the emerging Mixture-of-Experts (MoE) approach. I was geeking out over their model optimization tricks like quantization (shrinking models to FP8 or FP4) and pruning (trimming redundant weights). The most fascinating revelation? DeepSeek's massive 671B parameter model effectively runs with just 37B parameters activated per token—demonstrating how clever architecture design can deliver massive model capabilities without proportional compute requirements.
With NVIDIA NIM microservices in the mix, Lambda's stack is a no-brainer for anyone wanting to deploy AI fast and cheap—perfect for startups or enterprises needing custom, secure inference. It's not just about having the biggest model anymore; it's about having the smartest approach to deployment and optimization.
ABCs of Synthetic data
NVIDIA and Gretel teamed up to deep dive on synthetic data generation—perhaps the most under appreciated component of modern AI systems. Their session revealed how synthetic data has evolved from a stopgap measure to a strategic advantage for creating robust AI systems.
The presentation outlined a three-step data generation strategy: generating evaluation data by extracting synthetic QA pairs from documents, synthesizing embedding training data with auto-generated positive and negative examples, and implementing continuous improvement through failure mode analysis.
What stood out was the revelation that more than 75% of RAG (Retrieval Augmented Generation) failures occur on edge cases, while manual data curation typically covers less than 20% of real-world query patterns. Synthetic data bridges this gap beautifully.
The reinforcement learning renaissance section painted a compelling picture of how synthetic data is powering the next wave of AI advancement. As the presenter noted, "We're at the very beginning of a new paradigm where RL with verifiable rewards lets models improve through exploration—essentially teaching AI to teach itself."
Hands on
LLM inference strategies: Monitoring key metrics
This practical session drilled down a crucial AI Ops skillset: tracking key inference metrics for Large Language Models (LLMs). NVIDIA's NIM (NVIDIA Inference Microservices) delivers OpenAI-compatible endpoints optimized for speedy deployment across Azure, AWS, GKE, or Oracle with commercial-grade management features.
This workshop simplified essential inference stats every developer or ops person must master:
- TTFT (Time-to-First-Token): How soon your chatbot's first reply shows up (low latency = user happiness).
- E2E Latency: Measures your model's responsiveness end-to-end for smoother user experience.
- Inter-token latency (ITL): A peek beneath the hood into generative inference speed and efficiency.
- Tokens per Second (TPS): How efficiently GPUs serve model tokens to end-users—key for resource optimization.
My biggest takeaway: comparing inference performance across infrastructure becomes effortless with tools like NVIDIA GenAI-Perf. Getting visibility into these metrics clarified exactly where GPU bottlenecks occur, significantly simplifying cost optimization and ensuring ultra-responsive generative apps. The workshop distinguished between performance benchmarking (measuring actual model efficiency) and load testing (simulating real-world traffic)—emphasizing that both are essential for understanding true production readiness.
IoT greengrass with Jetson Platform
Perhaps the most hands-on session I attended featured pre-configured NVIDIA Jetson Orin Nano developer kits with blinking LED lights lined up at the front of the room. Each attendee received SSH credentials to their own dedicated device. We deployed Jetson platform microservices using AWS IoT Greengrass, exploring how easy it can be to orchestrate multimodal agents and intelligent computer vision systems across entire device fleets.
What made this workshop particularly valuable was experiencing the entire workflow—from local development to fleet deployment management—showing how edge AI can move from prototype to production. As we worked through each deployment step, it became increasingly clear how these technologies are bridging the gap between cloud-scale AI and real-world environments that require local, low-latency intelligence.
Certification Glory: A Surprise Win
Did someone say free NVIDIA certification? Yup—free this year for all GTC attendees.
With just four days to prepare, I leveraged NVIDIA's free 7-hour course and supplemental reading materials. The exam covered networking, job orchestration, Multi-Instance GPU (MIG) configurations, MLOps best practices, and hardware specifications.
Some concepts I hadn't previously encountered included the nuances between NVSwitch, NVLink, and InfiniBand technologies, when to use each, Remote Direct Memory Access over Converged Ethernet (RoCE), and the inner workings of DGX systems.
My proud certification moment - scoring 80% —not bad for a few days of prep! The certification process with Certiverse was smooth from registration to completion.

The Sweetest Perks and the RTX Surprise
GTC wasn't all work, workshops, keynote-crowds, and shaky schedules. NVIDIA surprised everyone by launching a lively night market, complete with live music and, perhaps most charmingly, hosted "Dinner With Strangers" events, sparking spirited tech-centric conversations all around downtown San Jose.
Oh, and the cherry on top? Picking up the impossible-to-get-your-hands-on RTX 5090 GPU for $1,999 straight from NVIDIA's on-site gear store—major jealousy-inducing bragging rights secured.
Key Takeaways for the Technorati: Why it Matters?
Synthesizing an entire epic GTC experience into a few bullets seemed herculean, but here are some observations:
GPU Dominance Continued: H100 GPUs remain the undisputed workhorses of enterprise AI, even as B200 GPUs begin entering production environments. Meanwhile, competition is heating up from cloud providers like AWS (Trainium) and Google (TPUs), along with startups like Groq and Cerebras focusing on inference optimization.
Rise of NeoCloud Providers: Perhaps most interestingly, we're witnessing the rise of "NeoClouds" (as Dylan Patel aptly names them in his blog) - specialized GPU cloud providers like Lambda Cloud, Crusoe, Together AI, and CoreWeave offering GPU access at prices significantly below those of hyperscalers.
Cosmos World Models: A platform for synthetic data and 'world models' like Cosmos Predict and Cosmos Reason to train AI for physical tasks—think self-driving cars or robotic arms that don't fumble.
Looking Ahead: What's Next From Here?
Armed with new knowledge and connections, I left GTC 2025 bubbling with inspiration, and my development roadmap's now packed with:
Deploying Dynamo + vLLM production stack on B200 & H100 GPU clusters in the cloud (looking at you, GKE!).
I'll also be developing an enhanced pipeline for context-aware RAG with synthetic data generation, and exploring multimodal agents optimized for edge devices (Jetson, anyone?).
NVIDIA GTC 2025 was a wild ride, leaving attendees inspired, informed, and slightly richer in GPUs. And with NeoClouds lowering the cost of entry, anyone can join the party.
I got a front row seat to AI's future, with NVIDIA steering the ship. If you missed it this year, start planning for 2026 - just remember to pack comfortable shoes, a backup battery, and perhaps a sleeping bag for those 7 AM workshop lines. Can't wait till next year to see how much further Jensen and gang push the envelope. Until then—happy CUDA-ing, my friends! 🚀