What does the following paragraph mean to you, and how did you come to understand it?

I’ve been seeing my current dentist (Joseph Smith) for almost three years. When I saw him in January, I noticed that he put that my visit was 30 minutes longer than it actually was while also adding two X-rays (which I never received). Additionally, he claimed to have extracted my kid’s baby teeth even though they lost them naturally. He’s also added extra visit time to their claims too.

Most healthcare fraud units receive similar complaints about providers on a daily basis, and their staff must decide on an appropriate plan of action. The above circumstance seems like a fairly blatant case of fraud, Joseph Smith is falsely billing for services not rendered, so the fraud unit should likely open an investigation. As a data scientist, I see this manual routing as a potential bottleneck in their workflow – could we teach a computer to do the same thing?

Literacy quickly becomes second-nature after we first learn to read. We often fail to realize the true complexity in comprehension (at least until we try to learn our second language). Many sentences have similar meanings despite varied sentence structure and word choice. Take a rephrasing of the example complaint:

My dentist, Joseph Smith, has been adding additional charges to my routine check-ups. He’s been billing X-rays, which I never received. I think he's been adding more time of care than I actually received as well; I can't remember ever seeing him for more than 30 minutes. I’ve been seeing him for a while and only noticed this after my most recent appointment this January. My children that see him have claims with the same extra procedures that I don’t remember happening and have been billed for primary teeth extractions which never happened as the teeth fell out naturally.

We immediately see the usage of commas to denote the provider’s name instead of parentheses, as well as the splitting out of the extra care and added X-rays into separate sentences (one of which uses a semicolon). Both of these changes vastly alter the sentence structure while still retaining the majority of the meaning. Aside from the structural differences, the word “while” is used as a conjunction (“while also adding”) in the first paragraph and a noun (“for a while”) in the second paragraph. These two usages differ in semantic content as well as part-of-speech. Even despite this complexity, most native English speakers would be able to understand both of these paragraphs easily and likely associate them as denoting the same scenario. This is not a given for most natural language processing (NLP) models.

Understanding unstructured text data is still a largely unsolved problem due to the inherent ambiguity in natural language, but we’ve come a long way in the last 20 years. This particular case (deciding to open an investigation based on a complaint) is largely soluble with Alivia’s current approaches which leverage contemporary deep learning methods. Deep learning with word vectors first translates each word into a list of numbers, where words that have similar numbers at the same position in the list have a similar meaning. Machine-learning is all math at the end of the day, so this translation is necessary to allow for the algorithm to understand the semantic content of each word. We then look at the word vectors as a sequence and process them linearly as the order is important. “The human walked their dog,” for example, means something completely different than “The dog walked their human.” After we process the entirety of the complaint, the algorithm makes a prediction as to if it thinks the complaint should be followed up with an investigation.

More naïve solutions would look at word frequencies (referred to within the data science community as “Bag of Words”). Word frequencies are simply the proportional usage of each word in a given phrase, so words like “the” and “they” are often very common, while words like “fraud” and “upcoding” are generally uncommon. This approach neglects the order of the words entirely and is a less robust measure of semantic meaning. While “primary teeth” and “baby teeth” mean the exact same thing in the two examples, word frequencies would completely disregard this association. The ideal complaint which mentions “upcoding,” “fraud,” and “unbundling” might be caught with this technique, but with many patients struggling to even understand how their coinsurance works, the likelihood that they are familiar with terminology like upcoding and unbundling is slim to none.

The healthcare system and text analysis are both complicated; you need a model that can understand both and make the right prediction. As our model continues to train on your data, it’ll get even better over time, freeing up your employees to dive into the medical records and conduct audits as opposed to read complaints. Ask for a demo today!