This is a personal project and these are my own thoughts. The data comes from HHS Open Data as-is and hasn't been independently verified. It has known quality variations by state and suppresses rows with fewer than 12 claims. Provider flags are not accusations — they are signals for further review.
Data Sources
- HHS Open Data — Medicaid Provider Spending dataset (2018–2024)
- California Ineligible Provider List
- HCPCS Code Reference — Medicaid.gov (January 2026 Alpha-Numeric file)
- U.S. Census Bureau — Geocoding API + ACS population data
On February 8, 2026, HHS dropped the largest Medicaid claims dataset in department history — provider-level spending for all 50 states in one big CSV, nearly 300 million rows. I'm not an HHS policy expert — I'm an IT professional who saw an interesting dataset and decided to do something with it. I needed a break from my main project (an agentic bot called OpenClaw), so I vibe coded a California explorer in Codex over a weekend. Pulled California out of the national file, broke it down by county, stood up an S3 static site, and called it done. No backend, no over-engineering. The goal was simple: take the dataset, cross-reference it with other public data, and make it readable.
Why California
The full dataset covers all 50 states. I picked California because I was born and raised here, and it happens to be the largest Medicaid program in the country — 58 counties, massive provider footprint, billions in annual spend. Big enough to be interesting, still scoped enough to build something useful in a weekend.
What Got Built
A static web app on S3. Statewide KPIs and insights on the landing page so you start with context. Pick any county and you get claims totals, a trend chart, top procedure codes, top billing and servicing providers, and a geocoded provider map. Drill into a provider and you get service mix, claim volume, and ineligibility flag context if applicable.
The pipeline is a Python script that splits the data by county, pulls in Census geocoding and population data, normalizes HCPCS codes, and outputs JSON the frontend reads directly. Two things worth knowing if you do something similar: the Census geocoder is free and accurate for this kind of work — just cache your results or you'll hit rate limits. And clean your procedure codes first or your tables will be a mess.
The Identity Gap
Here's the most interesting thing I found — and it has nothing to do with code.
California's ineligibility list has a Provider Number field. The field mixes state-issued license numbers and NPIs together in a single comma-separated column. Some providers have both, some have one or the other, and some have nothing at all.
This isn't unique to California — it's a common data structure challenge across government datasets. The NPI is the unique key every provider uses to bill Medicaid. When it's mixed into a combined field, you can't do a clean join against claims data. You end up parsing strings and making probabilistic matches instead of deterministic ones — which means some ineligible providers may not get flagged simply because the identifier couldn't be reliably extracted.
If this were an enterprise system, the solution would be straightforward: separate columns, consistent formatting, NPI as a required field. It's the same problem as storing a username, employee ID, and badge number in one unformatted text field. The identifiers exist — they just need better structure to be actionable at scale.
Why It Matters
California spends billions on Medi-Cal. The providers, procedure volumes, and geography in this data represent real dollars and real patients. Making that visible and accessible is worth the weekend it took.
More public data is being released. Most of it sits there as raw files nobody outside a data team can use. That gap is closeable with basic tooling and a willingness to just build something.
Check It Out: https://datapresenterabc.s3.us-west-1.amazonaws.com/index.html
