For 20 years, you stored a pointer to the file. Chang She thinks those days are over.

Chang She (LanceDB) on why the data infrastructure under every AI product is about to get rebuilt — whether teams are ready or not.

Apr 21, 2026

For those of you who don’t know me, my name is Pete Soderling. I started AI Council (formerly Data Council) with one purpose: bring together the sharpest builders in the world to talk about what they’re actually experiencing in the field. No BS. No hype.

Each piece tees up a question we’re planning to go deep on. If something resonates — or you disagree — comment below! And check out my other interviews with Scott Breitenother of Kilo Code, Eno Reyes of Factory, Vik Korrapati from Moondream.

The fork in the road

For 20 years, data engineers have done this the same way: store the metadata in the database, store a pointer to the file — the video, the image, the big binary blob — somewhere else on disk. It worked. Most engineers are still doing it that way.

Chang She thinks that era is over.

Chang was the second major contributor to pandas and spent his career building the tools underneath modern data engineering. Now, as co-founder and CEO of LanceDB, he’s arguing that the next generation of AI systems — agentic workloads, multimodal data, production tables at hundreds of billions of rows — can’t be built on that old pattern anymore. The files themselves need to be managed by the database, not orphaned in storage that the database can’t see.

That’s a fork in the road. If you accept his premise, a lot of interesting architectural things follow. If you don’t, you’re probably going to keep doing what you’ve always done. Either way, it’s a choice worth making consciously instead of defaulting into.

I sat down with Chang ahead of AI Council SF 2026 to talk about why he thinks the old stack is breaking, what agents are doing to database throughput, and why anyone with a serious background in database performance “starts to shake in their boots a little” when they think about what agentic data access is going to look like at scale. Here’s the conversation.

My Q&A with Chang She, Co-founder & CEO of LanceDB Q&A

Pete: What’s the material difference between storing blobs in the database versus keeping them as external assets in S3?

Chang: The main difference is you get a lot more optimizations when you can store blobs inline. If you need to access a chunk of data — say, a bunch of images — it can be one request instead of one-per-image.

A lot of folks working with multimodal data get throttled by the object store. If you have tons of images or videos and you’re accessing them through pointers, you run into request limits or you get charged a lot for the sheer number of requests. Storing them inline makes it faster when you access them in blocks — you can coalesce requests and it’s easier to manage for synchronization.

But multimodal data can span from a few kilobytes to multiple gigabytes. You want options. That’s why we recently released the Blob V2 API — same interface, but under the hood it picks between three or four storage strategies based on the size and type of the data. From the user’s side, it looks like one API. You get the performance without having to think about it.

Pete: Are we beyond copying this data around between nodes? Is it getting to where you're physically pulling disks out and carrying them into another room? 😅

We’ve already blown past the bandwidth limit for object storage in a lot of places. That’s actually one of the reasons we shipped a recent feature we call multi-base — you can use multiple storage accounts and split a single table across multiple buckets. So you set up your table across three or four blob store or object store accounts and get triple the bandwidth.

Pete: Take me back to the insights that gave birth to LanceDB. You were probably trying to use Parquet at first — at what point did you realize it wasn’t going to work for AI workloads?

Chang: We spent at least six months trying to make it work with Spark on Parquet. The workload we tried it on was large-scale data mining for autonomous vehicles — and physical AI today basically has the same problems. It boiled down to two big challenges.

Number one was random access. The analytical parts of the workload worked great on Spark and Parquet, but we wanted to retrieve and display individual rows with the metadata. We found we always had to make a copy in a different format, otherwise it would take tens of seconds just to fetch 10 to 100 rows and show them.

Number two was multimodal data storage. The raw data had to be somewhere, the random access feature data was in some key-value store or just JSON files, and Parquet was still used for the analytical data. It was just way too much work keeping these three in sync. It worked great in demos where you can hide stuff and “Martha Stewart” things, but we realized it wasn’t going to work in production.

We didn’t do this lightly — I’ve been in open source for a long time and we value consensus. But I interviewed over a hundred machine learning and computer vision engineers, and every one of them had gone through the same failed experiments with Parquet. My co-founder and I knew the Parquet internals well enough to know that making it work the way we needed would require a fundamental redesign.

Pete: Let’s talk about the elephant in the room. Agents can fire hundreds of queries in parallel — nothing like a human running an ad hoc search. How does that change the architecture?

Chang: This is the biggest reason I’m excited this year. Two things are happening.

First, data access is becoming primarily agentic. Throughput, performance, and scale are all up.

Throughput: A couple of years back with vanilla RAG, customers were asking for 10 to maybe 100 QPS. This year we’re looking at tens of thousands, even a hundred thousand queries per second. That’s multiple orders of magnitude.

Latency: With one-shot RAG, a second or two was acceptable. Agentic workflows need retrieval under 100 milliseconds because they’re chaining many steps on long paths.

Scale: In 2023, the prevailing wisdom was RAG ran on small tables —hundreds of thousands to a few million vectors. Now, forget production — even in prototypes, customers want evaluations at billions of rows. Production workloads are in the hundreds of billions on a single table.

Second, data pipelines are being written by agents, not humans. Agents can run many more experiments in parallel. Previously you’d manually code up feature ideas one at a time. Now you can tell Claude Code or Codex to try a hundred variants of each idea and run ablation studies. Making those experiments reproducible and manageable is going to be a big theme.

Pete: Engineers today cobble together lakehouses, vector DBs, search APIs. Were you worried about throwing another tool into an already crowded ecosystem?

Chang: The way we thought about it wasn’t “let’s throw yet another tool on top of this.” We wanted to simplify things — remove that hodgepodge of tooling and replace it with a single foundation.

A couple of big problems pushed us there. First, the hodgepodge makes things really slow. One of our earliest design partners was a car company with a similar setup to what you described, and processing their data ended up being slower than real time — it took more than a day to process a single day’s worth of data coming off their cars.

And second, the infrastructure and maintenance cost of copying data around and keeping sync pipelines between all these different systems adds up fast. You lose a ton of productivity. Engineers end up spending most of their time dealing with low-level details: Did I partition my data correctly for this system and also for that system? If I got a bad query result, is it because the answer wasn’t there, or because the two pieces of data were out of sync?

The last consideration was: because of the advent of Apache Arrow and its popularity, all the existing tooling could integrate with Lance without us building yet another set of pairwise integrations

Pete: Previously you said that by 2025, 90% of all data generated would be video. Now that we’re on the other side, how has it played out?

Chang: The latest stat is something like 400 million terabytes of multimodal data generated per day now — that’s 0.4 zettabytes. The projection is in three to five years, we’ll hit one zettabyte per day.

I fully expect multimodal to become a bigger and bigger portion of data engineering teams’ time and effort. The data volume was already large before, but the big difference now is how much more value we can get out of multimodal data because of AI. And because the value is higher, it’s worth investing a lot more in managing that data and the infrastructure to process it.

Maybe three to five years down the road, we won’t even think about this extra term “multimodal data processing.” We’ll just think about data engineering, and it’ll be multimodal by default.

Watch the full conversation here:

A few questions still on my mind for AI Council

There’s a lot more to dig into on the data infrastructure that’s quietly getting rebuilt underneath every AI product. A few questions still on my mind for AI Council:

How much of the old data stack survives the transition to multimodal, and what gets thrown out entirely?
Where’s the breaking point on object store bandwidth — and what do teams do when they actually hit it?
What does “production-ready” mean for a table with hundreds of billions of rows and agents hammering it in parallel?
When does it make sense to rebuild your data layer from scratch versus patching what you have?

I’m excited to hear from others working on the data infrastructure under modern AI, Nikhil Benesch, CTO at Turbopuffer, Glauber Costa, CEO at Turso, and Hannes Mühleisen, Co-Founder and CEO, DuckDB all at AI Council, May 12–14, 2026. Grab your ticket here.

If you have other questions about multimodal data infrastructure — drop it in the comments!

Pete Soderling

Discussion about this post

Ready for more?