Vik Korrapati is building vision AI the slow way. It's working.
Most AI companies are chasing the next transformer. Vik Korrapati is chasing the next 2%.
For those of you who don’t know me, my name is Pete Soderling. I started AI Council (formerly Data Council) with one purpose: bring together the sharpest builders in the world to talk about what they’re actually experiencing in the field. No BS. No hype.
Each piece tees up a question we’re planning to go deep on. If something resonates — or you disagree — comment below!
The 2% gains nobody claps for
There’s a version of the AI industry where everyone’s racing to publish the next state-of-the-art model result or benchmark. And then there’s the version where you actually ship something now — where the hard work is connecting an existing model to a real application and making it fast, cheap, and reliable enough to deploy.
That second version doesn’t get the headlines. But it’s where most of the value is still waiting to be captured.
Vik Korrapati and the Moondream team are building an open-source vision-language model (VLM), but they’re not chasing benchmark records. They’re focused on engineering gains like custom tokenizers, dedicated grounding tokens, and image decoding written from scratch. It’s not glamorous, but when you stack enough of those compounding optimizations, you end up with a product that actually makes a difference for customers — whether it’s a rancher flying a drone to spot escaped cattle or a factory floor camera catching a worker’s missing safety vest in real time.
I love this approach because the models we already have are incredibly powerful, and we’ve barely scratched the surface in applying them. There’s an enormous amount of unclaimed ground in real-world use cases — and the work of getting there is engineering work, not research work. In my view, that’s largely the most important unsolved problem right now.
In this Q&A, Vik and I dig into what it actually takes to make vision AI work in the real world — from designing around latency budgets to building custom inference engines to tackling hallucination in ways that give customers enough trust to actually deploy.
Q&A with Vik Korrapati, Moondream
Pete: VLMs have unleashed visual reasoning, but people sometimes struggle to imagine what can be built on top of it. What are the most exciting applications you're seeing on Moondream?
Vik: Being an open-source project means you become a magnet for use cases you’ve never heard of. Customers educate us constantly — someone will say, “I prompted the model in this weird way and it actually worked,” and I’m like, “I didn’t think that could happen.”
To zoom out for a second: there are industries that have traditionally invested heavily in computer vision, like retail and manufacturing, where the need was so acute that companies were willing to spend tens of millions of dollars deploying solutions. You had to hire a team of PhD ML researchers, annotate a bunch of data, and train a model. It was expensive. And if it failed, it was a very expensive failure to explain to your CEO.
VLMs have really democratized this. Any developer with an idea can now build vision AI apps through prompting. We’re seeing interesting things in broadcasting that weren’t happening before, computer use (which wasn’t possible with traditional vision systems because you need more reasoning), and a long tail of use cases nobody anticipated. We once heard from a rancher who wanted to fly a drone over his property and detect when something unexpected happened, like cows that had escaped. That’s a new market that simply couldn’t use automated vision understanding before.
Pete: I’ve been investing in CV (computer vision) for a long time — the lift required to get a model trained and deployed was enormous, and that was just a few years ago.
Vik: I think most people don’t appreciate how complicated it is to solve seemingly simple vision tasks. Say you have a camera feed in a factory and you want to know if a worker has been without a safety vest for more than 30 seconds. That requires a person detection model, a PPE detection model, stateful infrastructure for tracking… it’s many steps. It sounds easy because we’re so good at vision ourselves — as humans. We just don’t appreciate how hard it is for a machine to simulate what we do naturally.
Pete: You’ve talked about working backwards from real use cases. How does designing around a latency budget shape your architecture decisions?
Vik: People care a lot about benchmarks when they’re trying to decide what model to use — and it’s usually around accuracy. But when you talk to customers who are actually trying to deploy in the vision space, the thing that comes up consistently is performance.
You’re usually processing large amounts of video or massive image datasets, so cost and speed matter a lot. And it’s usually not given the same treatment as accuracy on a benchmark chart.
So performance has been top of mind for us. We’ve been co-designing our inference infrastructure alongside the model architecture — and that lets us make decisions that lead to faster inference.
Take object detection. Given a scene, you want to describe in English what you’re looking for and get grounding coordinates back. Typically, the way a model would do this is have some JSON that says the top-left X coordinate is this, the bottom-right X,Y coordinates are these, outputs JSON — and that ends up being 40, 50, 60 tokens sometimes.
We built dedicated grounding tokens so each box is represented by three tokens instead of tens of tokens. That’s one of the ways we’ve been able to drive inference speed forward — controlling both the inference layer and the architecture layer, and treating performance as a first-class benchmark alongside accuracy.
Pete: Before you started Moondream, you were at AWS. How has the distributed systems background that you have shaped how you think about inference and deployment?
Vik: Yes, beyond deployment, it’s shaped us as a company — we really think of ourselves as an engineering shop.
I think a lot of people in the industry are trying to chase the next transformer, like significant step-function engineering — research leaps. And those are great, but they’re few and far between.
We miss out on opportunities that exist in between. At AWS, the culture was that you chase a 2% improvement here, a 1% improvement there, a 3% improvement in another system, and over time, you’d end up with a system that is 10x better than anything anyone else offers. It’s not as glamorous. It’s not a fancy research idea that your peers are going to praise you for, but at the end of the day, it results in a system that is far better for customers to deploy.
“It’s not a fancy research idea that your peers are going to praise you for, but at the end of the day, it results in a system that is far better for customers to deploy.”
Pete: And what are some of the specific improvements that you’ve made with this lens in mind?
Vik: We recently released a custom inference engine specifically for Moondream, because existing inference engines are really geared towards LLM inference — and VLMs have different performance characteristics. They tend to be more pre-fill heavy than decode. We wanted it to work on pretty much any inference hardware, anywhere from cheap edge devices and CPUs to high-end server chips.
We got to a point where our kernels weren’t even the bottleneck anymore — it was image decoding. So we had to sit down and write custom image decoding and resizing libraries in native code to tackle that. Whatever is the bottleneck today is what we’re happy to tackle.
Pete: Hallucination in vision models doesn’t get nearly the attention it does in text. What does it actually look like in the wild, and how do you design around it?
Vik: Hallucinations are one of the big things preventing customers from really taking advantage of AI in the vision space. If you can’t trust your system, if you don’t understand how it’s making decisions or what its failure modes are, it’s really hard to deploy it in production.
Our approach has been to focus on grounding. We built a reasoning mode where, rather than answering directly, the model shows its work first. If you ask whether a car is parked within the lines, instead of just saying yes or no, it’ll say: I see a car here, these are its boundaries, I see the lines here — and it actually generates grounded X,Y coordinates in the reasoning trace that you can inspect.
That lets you see exactly where the model is going wrong and what you need to correct. We also spend extensive compute on reinforcement learning on these grounding-related reasoning chains — we train on over 200 tasks specifically to force the model to be accurate in generating grounding decisions.
The result is some explainability built into the system. It helps our customers understand why the model made a decision. And it helps us as model developers understand where hallucinations happen and guide customers on where they should and shouldn’t deploy.
Pete: Thanks, Vik! Looking forward to continuing the conversation at AI Council.
Vik: Appreciate it! Thanks, Pete!
Watch the full conversation here:
There’s a lot more to dig into on applied vision AI. A few questions still on my mind for AI Council:
Where’s the line between “good enough to demo” and “reliable enough to deploy” for vision models — and what closes that gap?
How do you design around latency and cost constraints without sacrificing the accuracy that customers actually need?
What does hallucination look like in vision-specific contexts, and how should teams think about trust and explainability?
When does it make sense to build your own inference stack versus using what’s already out there?
I’m excited to hear from others on mutli-modal AI like Elie Bakouch from Prime Intellect, Chang She of LanceDB and others at AI Council, May 12–14, 2026. Grab your ticket here.
If you have another meaty question about VLMs and multi-modal AI — drop it in the comments.

