A Data Lake for Five Dollars a Month

Laptop on a small wooden desk showing a data dashboard, with a faint server room visible in the background. — Small business tools on the surface, enterprise-style data architecture underneath.

I've become the data guy in my family. And in my friend group. And apparently for anyone who knows someone who knows me and has a spreadsheet problem.

It starts the same way every time. Someone running a small business (a produce distributor, a cleaning crew, a landscaping operation) hits a wall. They can't track sales properly. They can't connect their invoices to their customer list. Their team needs access to the same data but there's no controlled way to share it. They're managing everything in Google Sheets or Excel, and it works until it doesn't.

I do this for a living at one of the big four consulting firms. I build enterprise data platforms: data lakes, governance frameworks, multi-layer architectures with raw zones and serving layers and lineage tracking. I know what good looks like. It costs tens of thousands of dollars a year, and nobody in my family has that budget.

But these businesses are already on Google. Their files are in Drive. Their data is in Sheets. Their team logs in with Google accounts. The storage, the collaboration, the authentication. It's already there. They're just not using it as infrastructure.

So the question: can I build enterprise data controls on top of what small businesses are already using, without asking them to buy anything new?

This is an early proof of concept. It's incomplete, and it's partly an excuse to experiment and develop my skills. But it's far enough along to prove the point.

The problem isn't the tools. It's the controls.

Small businesses generate more data than they realize. Invoices arrive as PDFs and XML files. Master data (customers, suppliers, products, pricing) lives in spreadsheets or in someone's head. Reporting means opening three tabs and eyeballing numbers.

The pain isn't volume. It's fragmentation. The invoice PDF is disconnected from the customer list, which is disconnected from the commission structure, which is disconnected from the financial summary someone needs at the end of the month. When multiple people need access, it gets worse. No structure governing who enters what, no way to track how data flows, no lineage when a number looks wrong.

Google Sheets and Drive are fine for the data itself at this scale. These businesses have been running on them for years. What's missing is the control layer: structured entry, processing pipelines, a catalog of what exists, and the ability to trace any number back to its source.

What I built

I built a proof of concept for my brother-in-law's asparagus packing and distribution business in Latin America. Farmers, clients, invoices, commissions, product catalogs. It's a process I know well from working alongside him during summers, so it made a good starting point.

The architecture follows the same pattern I'd use for an enterprise client: raw zone, ETL layer, structured serving layer, catalog, and lineage. Every component maps to something in the Google ecosystem.

Drive is the raw zone. Invoices land in date-partitioned folders and stay immutable after upload. Sheets is the serving layer. Two spreadsheets hold structured master data and invoice records. A catalog sheet tracks every file in the system with its status and metadata. The Drive file ID follows the data from raw storage through the catalog into the structured rows. Any number on the dashboard traces back to the original PDF.

For ETL, something needs to turn unstructured documents into structured data. You could write parsers, use OCR, or build templates per vendor. I used an LLM that receives the document with context and returns structured JSON. It's one approach. The architecture doesn't depend on it. What matters more is that a human reviews every extraction before anything saves.

DuckDB runs SQL queries directly in the browser. After one data load, every filter, aggregation, and join runs locally. No database server, no query API, no cost per query.

Authentication and sharing use Google's existing permission model. Users log in with their Google account, and folders are shared with the team the same way they already are. No separate auth system to build or maintain.

Total monthly cost for a few hundred invoices: about five dollars. The same workload on AWS runs $50--$200. Snowflake and Tableau starts at $400.

What this proves

The architecture patterns I use for enterprise clients (raw zones, serving layers, catalogs, lineage, quality gates) work on infrastructure that costs nothing. Data fragmentation isn't a big-company problem. It happens at five employees just as much as at five thousand. The difference was always the cost of the tooling, not the relevance of the patterns.

Google's ecosystem does more than most small businesses realize. Drive is blob storage with sharing controls, authentication, and webhooks. Sheets is a structured data store with an API. And when Sheets hits its limits, you swap in PostgreSQL and nothing else changes. The architecture outgrows its cheapest components without a rewrite.

Where this is going

This POC is built around one use case. The idea behind it is broader. I'm thinking about how to generalize this into a data platform that uses Google's ecosystem as its backbone. Controlled data entry, document ingestion, structured storage, browser-side analytics. None of that is specific to produce distribution.

For now, this was about proving the concept and scratching my own itch.

The architecture doesn't have to be expensive. It just has to be designed.

If you want to see how it's built, including the full architecture, data lifecycle, and cost model, the repo is on GitHub.

The problem isn't the tools. It's the controls.

What I built

What this proves

Where this is going

Related Articles