Tech Team Review: Prodigal and Redaction
Collection agencies record thousands to millions of calls every single day for quality assurance and audit purposes. And every single call requires verification.
Agents need to verify that the person they’re speaking with is the right party before they disclose any debt details — even the fact that a debt exists. Once verification occurs, if the debtor agrees to pay, they’ll need to share important bank account or card details. No big deal, except that this information is then stored in the call recordings.
Handling and storing personally identifiable information (PII) and payment card information (PCI) securely and safely is first priority for an agency. Any data leak or loss opens the agency and their clients up to legal risks. Perhaps most importantly, a leak or oversight will violate consumer trust.
So, what’s the solution? A reliable and accurate redaction service to protect data security. Prodigal has built a state-of-the-art Redaction AI model to solve this problem.
What is Redaction?
Redaction is the process of editing any piece of data, such as text, audio, video, etc., to conceal or remove information deemed to be confidential. In the context of Prodigal’s task, it is removing the PCI or/and PII information from call recordings and transcriptions.
So, what does PCI, PII consist of? You can find the broad definition here.
For our modeling purposes, we have limited our coverage to the entities listed below.
Solving a Redaction Problem
Redaction is essentially a task which falls under the spectrum of the Named Entity Recognition (NER) problem in machine learning.
This is a problem where we model the probability of a word belonging to one of N labels. Each distinct label is called an entity. We have as many entities as the types of distinct information we are trying to extract/redact.
Even before NER became popular, people were solving the redaction problem using regular expressions. Solving the problem with regular expressions presents its own problem: it is fairly hard to control the regex boundary to only the entity you want to capture. You often tend to redact more than what was needed for the problem. Also, use of regexes was limited for only those entities which had a major keyword marking start/end of an entity. Such a method could not properly leverage the context of conversation to redact well.
After regexes, bidirectional long-short term memory models (Bi-LSTMs) became the popular choice for NER tasks. They are effective in reducing the problem of over-redaction as well as using context in conversation to generate high quality labels. But, the computation in Bi-LSTMs is of sequential nature which makes them hard to accelerate using GPUs. That makes training lengthy and inference slow. Additionally, Bi-LSTMs cannot use long-range context as effectively, due to signal degradation.
Transformers have sprung up recently, giving us state-of-the-art results in many NLP applications. They are highly parallelizable and can use infinite-length context, solving the major problems with Bi-LSTMs.
How Does Prodigal Solve a Redaction Problem?
Prodigal uses BERT for its NER modeling task, which is a special form of Transformer.
Modeling Approach: BERT
The BERT model was proposed in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. It’s a bidirectional transformer pre-trained using a combination of masked language modeling objective and next sentence prediction on a large corpus comprising the Toronto Book Corpus and Wikipedia. Pre-training of a BERT model is a costly and lengthy affair — we thank Huggingface for their easy to use API, which helped us access a pre-trained BERT model easily.
Upon establishing which model we’d use, we used our in-house Prodigal annotation team to get our annotations completed.
- Our data consists of transcriptions of agent and borrower conversations.
- We used annotation best practices to weed out any efficiencies in our data annotation process.
- Our starting set of examples were annotated by all annotators and we automated the process to highlight examples where annotators were differing with each other, and made sure those examples are properly explained and covered as part of annotation guideline.
- Annotations were created on transformed form of transcription which ensured no PCI/PII data was exposed to the annotator — and yet we get the annotation output we require for the modeling task.
- We further chunked the transcriptions into 400-word, contiguous segments so that the model doesn’t hit the maximum token length limit during the modeling run.
- Post-processing of received annotations was done to transform back annotated entities to the original transcript.
- Finally, data was split into train/eval/test sets, tokenization was performed, and token-label length misalignment issues (an effect of using BERT-based tokenizer) were resolved.
- The data loader was prepared and we were set for training.
Training involved a set of steps which are depicted in the below diagram.
We fine-tuned the pre-trained BERT model on token classification task with a log loss minimization objective. Masked language modeling was chosen for pre-training of BERT model to help best with the downstream task of NER where tokens could go missing due to fault of transcription engine. The model needs to make the best use of available context to make the right prediction. Other alternative modeling approaches we tried (Spacy, Distill-Bert) were not as accurate as BERT.
Metrics and Performance
The validation set loss was used as objective criteria to pick the best performing checkpoint of our model. We have performed token level evaluation for our modeling exercise. As AWS Comprehend is also a service provider of redaction, we benchmarked their performance on our test set as well.
Both the models had around 98% token level accuracy. We are specifically doing better on the PCI set of entities than Comprehend. Comprehend seems to be doing slightly better on recall numbers for bank account numbers, but it exhibited low precision for this entity, and was redacting a lot of collections account number cases as bank account number — something we definitely do not want. On the other hand, our model was giving better precision for bank account number.
Note: example is completely artificial and has been used just to demonstrate the output from our model.
Oftentimes, a transcription service makes mistakes in transcribing a call correctly. Our model also seems robust here.
Changes made to introduce noise:
social → vocal
phone number missed
card → car
As expected you will observe the confidence score for those NER labels reducing as important context is being removed/altered. This also emphasizes how important context is in order for the model to make right predictions.
phone number (1→0.96)
card number (1→0.91)
What’s Next for Prodigal Redaction?
While we are outperforming some of the market leaders in redaction capabilities today, there is room to grow. Below are the directions we’re actively pursuing.
- Audio only redaction — performing information extraction directly on audio-based signals without transcribing them to text. This helps in circumventing the errors caused by transcription inaccuracies.
- Noise aware dialog reasoning — models which are specifically tuned for understanding the noise caused by transcription errors and using a joint reasoning task to disambiguate and understand the context for efficient tagging of edge cases.
Are you interested in learning more about our redaction process? Do you want to help solve complex and relevant problems like these? We want to hear from you!
Maximize Revenue And Optimize Operations
Prodigal’s intelligence powers agent productivity and
boosts profitability of lending operations