The Challenges and Quirks of Hospital Price Transparency Files

Piping Up and Exploring the Options

Posted October 11, 2024

10-11 2024

We’re now three months past the July 1st deadline for hospitals to post their price transparency data in CMS’s new version 2.0 file format. So far the rollout has been a bit rocky.

Lime Tree’s latest survey shows that the majority of hospitals (57%) still do not have new versions of their pricing data live, and many of those who have posted new files are still working through formatting and content issues.

There’s good reason for that. The CMS file formats (especially the comma-separated value flavors) are rather eccentric in structure and don’t lend themselves to being generated through a simple export from a hospital’s EHR or billing systems. CMS has published a copious amount of schema documentation, complete with sample files, generator utilities, and validators, but it’s not always clear exactly how to make real-word data conform to the layouts.

That said, not all hospitals have made full use of the resources that are out there. Some file creators did a little freelancing and invented unique variations on the format. Others seem to have made some basic if understandable mistakes when creating the first cut of their Version 2 files.

The end result? Machine readable files that aren’t easily readable by a machine.

This post will focus on the CMS side of the equation, digging into the nuances and eccentricities of the Version 2 file formats. In a follow-on post, we’ll look at common ways hospitals have stumbled when trying to generate files that meet those specifications.

The Tyranny of Choice

In Version 1.0 of the price transparency regulations, hospitals could post their prices in pretty much any format they wanted as long as it was “machine readable.” It could be Excel, XML, CSV, or any number of other formats, and it could contain whatever column or field names the hospital chose.

This flexibility made it easier for any one hospital to produce its files, but it wasn’t particularly helpful for someone trying to import files from many different hospitals and use them to compare prices across hospitals. As they say, if you’ve seen one Version 1 file, you’ve seen one.

To address those issues, CMS came out with a standardized “Version 2.0” format that would be the same for all hospitals. OK, technically it came out with two Version 2 formats—a comma-separated value (CSV) format and a JavaScript Object Notation (JSON format)—but that’s a big improvement over the thousands of variants that were out there for Version 1.

Many hospitals have decided to adopt the CSV format for their first cut at creating Version 2 files, which makes sense. For non-programmers, the format is much easier to work with than JSON. It’s easy to read, and you can open it in Excel and any number of other tools and see it nicely formatted into rows and columns.

The hospital pricing data sets, however, can be very large. Each entity must report prices for tens of thousands of procedures negotiated with potentially dozens of different insurance plans. Even small data sets can easily contain 500,000 distinct prices, while large files can have many millions.

To accommodate such variations in pricing, CMS ended up defining two flavors for the CSV file, dubbed “wide” and “tall.” (So, technically, there are three Version 2 formats.) In the wide format, each procedure gets its own row, and the standard charge and the prices negotiated with each insurance plan are listed in the columns for that row. In the tall format, each row contains the price negotiated with a single insurance plan for a single procedure.

Conceptually this is simple. In the “wide” format, each row would have multiple columns with negotiated prices, one for each insurance plan. Something like this:

In the tall format, you would have multiple lines for each procedure, and each line would contain the price negotiated with a specific payer. Something like this:

Pretty simple, right?

It’s Not That Simple

Hospital billing, it turns out, is a little more complicated than, say, a restaurant menu. For starters, an organization may charge one price for inpatient services and another for outpatient. It might require several different billing codes to determine the price—for instance, a combination of a HCPCS code identifying the procedure and a revenue code identifying the department in which it was performed. It might have negotiated different ways of paying with different insurers—a flat rate per procedure with one insurer, a percentage of the charges billed for another.

It gets complicated very quickly, and there are lots of different data elements that need to find a home in the various rows and column. Neither the tall nor the wide format handles all those elements perfectly, and the design decisions made by the CMS technical team didn’t help matters, either.

For example, some hospitals may need just a single billing code to identify a charge, while others might require four. A standard way to handle this in a CSV file would be to have multiple billing code type and code columns and let each hospital fills in as many as they need and leave the rest blank. Like this:

But that’s not how it’s handled in the CMS format. Instead, the number of billing columns can vary for each hospital—just use however many you need!

So, Hospital 1’s file might look like this:

And Hospital 2’s like this:

Things are getting complicated already.

Piping Up

You may have noticed the rather odd use of the pipe symbol in the column names above: “code|1”, “code|1|type,” etc. Why not just use “Code1” and “CodeType1”? We’re not sure, but the CMS standard is littered with pipes, and often they are used to encode data in very creative ways.

For instance, each hospital must include its license number and the issuing state in its price file. You might expect there to be two clearly-identified columns like this:

But that’s pretty boring. The CMS standard spices it up by cramming the data into a single column and sticking one piece of it in the header in the form of “license_number | [state]”:

We’re sure there’s a reason for this. We just don’t know what it is. We do know, though, that it means hospitals in Minnesota will have different column names than hospitals in South Carolina.

The encoding of data in headers is even more awkward when it identifies which insurance plan a particular price was negotiated with. This is pretty easy to do in the tall format, since there’s a row for each plan with consistent column names to identify it:

But how do you represent the plan in the “wide” format, where there is just one row per procedure and you need a column for each price that the hospital has negotiated with an insurance plan? You encode the plan identifiers in the column headers, of course, using easy-to-read formats like standard_charge | [payer_name] | [plan_name] | negotiated_dollar and standard_charge | [payer_name] | [plan_name] | negotiated_percentage.

That means if you have negotiated prices for 70 insurance plans, you’ll have 70 “negotiated_dollar” columns that look something like this:

You’ll also have 70 “negotiated_percent” columns, and 70 “negotiated_algorithm” columns, and . . . well, you get the picture.

Choose Your Poison

Taking all of this into consideration, which format is better, wide or tall? That’s hard to say. Because a hospital may not negotiate a price for every procedure with every insurance plan, those that go with the “wide” flavor end up with a CSV file with lots of empty columns and long column names that are hard to read.

The “tall” format, on the other hand, repeats the exact same data in the first dozen or so columns for every single insurance plan (the billing codes, the setting, the list and discounted prices). This makes for files that are hundreds of megabytes or even multiple gigabytes in size, making them impossible to open and edit on an ordinary laptop or desktop.

There are any number of tools and utilities out there that an analyst can use to create a data extract with rows and columns and export it to a CSV file. They can all make the first row of the file contain the headers for the columns. But encoding data within the headers themselves? That’s not, as they say, a common use case.

If you think the files are a joy to produce, try creating a program to read them. The comingling of labels and data within the header rows means you can’t just suck the file into a database table using the headers for column names, and you can’t depend on the position of each column either. For instance, the “setting” element (i.e. inpatient or outpatient) might be in column 9 for one hospital’s file and column 7 for another’s depending upon how many billing codes they use in their chargemasters.

The Version 2 specification, it seems, has managed to define a “standard” file layout that is different for each hospital producing it!

Oh, and did we mention the optional columns, like “financial_aid_policy” and “billing_class,” which can be included but don’t have to be? Or that there isn’t a single header row in the file but actually two separate header rows, one on line one and the other on line three? That’s something we’ve not seen before.

The CSV format, in short, is kind of a mess.

So Who’s Jason?

CSV files, with their structured rows and columns, simply aren’t well suited for hierarchical, multi-dimensional data like hospital pricing. The JSON format is by far the more sensible option for reporting such data.

Here’s the same set of price data used in the examples above represented in the JSON format:

Simple, clean, and relatively easy to read, plus the schema is the same for every hospital.

But JSON is far from perfect. The data labels (“payer_name”, “standard_charge_dollar”, etc.) are repeated for each and every price record, so the file sizes can really balloon. The larger a JSON file becomes, the more difficult it is to read it into memory and process it, at least on an ordinary computer.

Beyond that, generating JSON formatted files using typical reporting or data analysis tools isn’t exactly easy. As is suggested by the name, JavaScript Object Notation is a format created by and for software developers, and you really need to write object-oriented code to produce it.

Because of all the problems enumerated in the CSV sections above, Lime Tree recommends its clients choose the JSON format for their price transparency files. We realize, though, that it’s easier said than done.

All hospitals have IT resources that manage their networking and internal systems, and most have data-savvy analysts who can create reports and write SQL queries to extract data from databases. Most hospitals, though, aren’t writing custom software, so they don’t have a team of application developers on staff.

As the team at Lime Tree Health will attest, even experienced software developers will find writing and reading the CMS files to be a complex if quite amusing challenge.

Perhaps there will be a Version 3.0 for us to tackle soon?

Need help creating compliant Version 2 price transparency files or defining a repeatable process for posting them year after year? Want to get a peek at the prices other hospitals in your area are posting? Give the team at Lime Tree Health a shout.