Collections

Engineering Prodigal's redaction models

Heads up: This is a post from our awesome engineers, so it gets a bit technical. 
Just looking for a quick overview of how our redaction model works? Hop right over here. Otherwise, read on!

Collection agencies record thousands - or more - customer interactions every single day for quality assurance and audit purposes. And every single call requires verification. 

Agents need to verify that the person they’re speaking with is the right party before they disclose any debt details — even the fact that a debt exists.

Once verification occurs, if the debtor agrees to pay, they’ll need to share important bank account or card details. No big deal, except that this information is then stored in the call recordings. 

Handling and storing personally identifiable information (PII) and payment card information (PCI) securely and safely is first priority for an agency. Any data leak or loss opens the agency and their clients up to legal risks. Perhaps most importantly, a leak or oversight will violate consumer trust. 

So, what’s the solution? A reliable and accurate redaction service to protect data security. Prodigal has built a state-of-the-art Redaction AI model to solve this problem.

5 Biggest Compliance Breaches & Associated Losses

What is redaction?

Redaction is the process of editing any piece of data, such as text, audio, video, etc., to conceal or remove information deemed to be confidential. In the context of Prodigal’s task, it is removing the PCI or/and PII information from call recordings and transcriptions.

So, what does PCI, PII consist of? You can find the broad definition here

For our modeling purposes, we support the following entities:

Solving a Redaction Problem

Redaction is essentially a task which falls under the spectrum of the Named Entity Recognition (NER) problem in machine learning. 

This is a problem where we model the probability of a word belonging to one of N labels. Each distinct label is called an entity. We have as many entities as the types of distinct information we are trying to extract/redact. 

Even before NER became popular, people were solving the redaction problem using regular expressions. Solving the problem with regular expressions presents its own problem: it is fairly hard to control the regex boundary to only the entity you want to capture. You often tend to redact more than what was needed for the problem. Also, use of regexes was limited for only those entities which had a major keyword marking start/end of an entity. Such a method could not properly leverage the context of conversation to redact well.

After regexes, bidirectional long-short term memory models (Bi-LSTMs) became the popular choice for NER tasks. They are effective in reducing the problem of over-redaction as well as using context in conversation to generate high quality labels. But, the computation in Bi-LSTMs is of sequential nature which makes them hard to accelerate using GPUs. That makes training lengthy and inference slow. Additionally, Bi-LSTMs cannot use long-range context as effectively, due to signal degradation. 

Transformers have sprung up recently, giving us state-of-the-art results in many NLP applications. They are highly parallelizable and can use infinite-length context, solving the major problems with Bi-LSTMs. 

How Does Prodigal Solve a Redaction Problem?

Prodigal uses BERT for its NER modeling task, which is a special form of Transformer.

Modeling Approach: BERT

The BERT model was proposed in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. It’s a bidirectional transformer pre-trained using a combination of masked language modeling objective and next sentence prediction on a large corpus comprising the Toronto Book Corpus and Wikipedia. Pre-training of a BERT model is a costly and lengthy affair — we thank Huggingface for their easy to use API, which helped us access a pre-trained BERT model easily.

Data

Upon establishing which model we’d use, we used our in-house Prodigal annotation team to get our annotations completed. 

  • Our data consists of transcriptions of agent and borrower conversations. 
  • We used annotation best practices to weed out any efficiencies in our data annotation process. 
  • Our starting set of examples were annotated by all annotators and we automated the process to highlight examples where annotators were differing with each other, and made sure those examples are properly explained and covered as part of annotation guideline. 
  • Annotations were created on transformed form of transcription which ensured no PCI/PII data was exposed to the annotator — and yet we get the annotation output we require for the modeling task. 
  • We further chunked the transcriptions into 400-word, contiguous segments so that the model doesn’t hit the maximum token length limit during the modeling run. 
  • Post-processing of received annotations was done to transform back annotated entities to the original transcript. 
  • Finally, data was split into train/eval/test sets, tokenization was performed, and token-label length misalignment issues (an effect of using BERT-based tokenizer) were resolved. 
  • The data loader was prepared and we were set for training.

Training involved a set of steps which are depicted in the below diagram.

Modeling

We fine-tuned the pre-trained BERT model on token classification task with a log loss minimization objective. Masked language modeling was chosen for pre-training of BERT model to help best with the downstream task of NER where tokens could go missing due to fault of transcription engine. The model needs to make the best use of available context to make the right prediction. Other alternative modeling approaches we tried (Spacy, Distill-Bert) were not as accurate as BERT.

Metrics and Performance

The validation set loss was used as objective criteria to pick the best performing checkpoint of our model. We have performed token level evaluation for our modeling exercise.

Given Prodigal’s focus on the consumer finance vertical and our Prodigal AI Intent Engine design, we handily beat out Amazon’s redaction models in accuracy."

Prediction Examples

Here's a (fictional) sample call so you can see how the redaction model works.

We've highlighted the pieces of information our AI redacts, and tagged it with the category that requires redaction. The numbers you see after those tags are our model's confidence in its accuracy - 1.0 represents total confidence.

But what if there are errors in the transcription?

Changes made to introduce noise:

social → vocal

phone number missed

card → car

CVV code missed

As expected you will observe the confidence score for those NER labels reducing as important context is being removed/altered. This also emphasizes how important context is in order for the model to make right predictions.

SSN (0.99→0.39)

phone number (1→0.96)

card number (1→0.91)

CVV code (1→0.99)

What’s Next for Prodigal Redaction?

While we are outperforming some of the market leaders in redaction capabilities today, there is room to grow.

Want to join us as we build the intelligence layer of consumer finance? We're hiring!

"We found Prodigal while looking for solutions to reinforce our call recording safeguards and further protect our customers’ personal information.
We looked at multiple vendors but ultimately trusted Prodigal and their industry-trained AI models to get the job done. They were great to work with and further customized their outputs to meet our specific needs. We’d recommend them to any team looking to effectively protect their consumer data and strengthen compliance." -VP of InfoSec, Policygenius