My Experience as a Summer Data Scientist at Prodigal
None of us had anticipated a summer like this. While the world continues to fight an endless battle with this pandemic, I try to look at silver linings within these times of uncertainties. Not only did I have an opportunity to spend time at home with my family, but I also managed to gain some brownie points for building my expertise on household chores.
I am Sumanyu Ghoshal, a Junior at the Department of Computer Science and Engineering at IIT Bombay. I am a Summer Data Scientist at Prodigal Tech. The past few months have been enriching as the lessons learned in these times are ones that I’ll carry with me for years to come.
Here are a few thoughts from my experience:
A solid internship to obtain strong exposure to software engineering and/or data science in the industry was my top priority. A well-established startup with an exciting team of engineers would be the perfect place to go to in order to meet my summer objectives, and with this thought, I started poring through everything at my disposal to search for the right place. In this process, I found Shantanu’s Linkedin profile. That’s how I got introduced to Prodigal.
Prodigal is a Y-Combinator backed startup, headquartered in Silicon Valley, which provides software intelligence to lenders and collection agencies to maximize their revenue yield and increase business productivity. Prodigal essentially leverages AI to bring efficiency in the debt collections industry.
The opportunity to work with an exceptional team of engineers and data scientists led by IITB CS alumni co-founders, Shantanu and Sangram, was incredibly alluring to me. I felt no hesitation in getting in touch with them to seek an internship, as I strongly felt that joining Prodigal would end up being a great learning experience.
Within a few weeks, I had an interview with Sangram, wherein I was apprised of the available opportunities and where they could fit me in. By the end of the day, I received a confirmation of my selection. An internship at the end of my sophomore year, which could substantiate a product to be used in the real world, was incredibly appealing. I happily accepted the offer.
By mid-March, the world had almost come to a halt with the onset of Covid-19. At that time, there definitely were questions in my mind regarding how the next Summer would be; the human instinct to hope for the best didn’t consider that we would be living with this health crisis for a longer amount of time. Then, the decision to prepone the summer vacations by IITB further led to fears of just straight out-losing this opportunity. I wrote an email to Neil, the chief of staff, to consider advancing the internship, and within a day, they happily brought me on board!
I work on the speech analytics RnD team of Prodigal. I also occasionally participate in conversations with the Data analysis team.
The project that I primarily work on is Voice Biometrics with an aim to verify borrower identities for debt collection. As the name suggests, the idea is to create a solution to verify on any sensitive phone call, whether the correct person is speaking or not within 3-5 seconds, using just the speech of the person. From the very beginning, I was allowed to take up the responsibilities to get the various subtasks done within reasonable deadlines.
The project involved the following phases:
- Research: The project began with reading up some research papers and filtering out some state of the art deep learning models that could be used to solve the problem at hand. With the available speech data and the desired features needed, we gravitated towards Siamese Convolutional Neural Networks, which generates a similarity score between images. The audio clips at hand would be converted to shorter samples, and we would create the ‘images’ using various feature extractors. I stuck to the use of Mel-Frequency Cepstrum Coefficients (MFCCs) based on some research papers and my own experimentation.
- Preparation of the Dataset: This phase required going through the available data and figuring out ways to clean them. The dataset used for training was a proprietary set of audio samples from the LDC corpus. This also involved working on whether filtering out cross-talk on phone recordings ie. a channel recording having the speech and noise from a different channel, is a trivial task or not. Unaware of such issues before going through the dataset, this phase made me realise about the constraints that a real-world data science project would face and the kind work that goes into the pre-training phase.
- Training the Model: This process is used to make the model ‘learn’ how to make the correct decisions. I went ahead with training various examples of this model and tried out some novel adjustments and techniques that are used to improve the accuracy of a CNN model in general. This includes regularization of data, batch normalizations, and dropouts. The training of the model was taken on with the help of a GPU-based AWS virtual machine.The training process itself took up plenty of time, where I would need to try out plenty of tuning combinations, use of different optimizers, and implement an exponentially decaying and a cyclically decaying learning rate so that we can squeeze the best out of the model. The training is done by using a function to be optimized. Using the cross-entropy function to be optimized, I used stochastic optimizers like stochastic gradient descent, nesterov and Adam at different times to squeeze out the best from the model.
- Deployment: This was a slight detour from the Data-Science aspect of the project. The deployment of the model itself was a very different experience, as I got to explore Amazon Web Services products and try out serverless as well as server-based Product development, something which I think a CS undergrad wouldn’t be exposed to. Server based-deployments are commonly used by many, as it can be done as long as they have an actively running computer connected to the internet. But serverless deployments were something new to me: it is a new product of various cloud-services companies and is turning popular thanks to how inexpensive it is to maintain a serverless deployment. I explored AWS Lambda, API Gateways, and the use of S3 buckets and DynamoDB for serverless deployments.
- Testing: After the deployment of the model, we went ahead with testing the interim product on their database, and found out the improvements needed, like the kind of data that is missing in the training process, and a few more features that need to be added to pre-process the data before providing them to the model. This would give a future direction to the work that needs to be done to get the project in the production phase.
- Client-Wise Fine-Tuning: Using our clients’ data, I am currently involved in fine-tuning the model for each client using transfer learning methods, in order to make the model as accurate as possible, and fine-tuned for each client.
I would have presentations on a fortnightly basis wherein anyone from the company could join to get to know more about the project and ask questions related to the progress and brainstorm different cases and issues. These presentations have acted as the incentive for me to keep working on this project with conviction, as these have acted as great sources for getting a direction on how to go ahead with the project.
Working From Home
While this was my first internship, I could immediately see the differences with working at home;
The hours while working were extremely flexible, conversations with the RnD team primarily took place on Slack, and all presentations were virtual. The RnD team and Sangram were very responsive to any queries that I had and would answer my queries at whatever time of the day I would put them up.
We have had all-hands meetings with the entire US and India team monthly, and a lot of fun surveys too, just to get to know how the team morale is. Working from home has definitely been strenuous, but the fact that the organization is trying its best to keep the morale high is something that I really appreciated.
My end-goals have been met while working at Prodigal: An exposure to the industry and a substantial project at hand. I would’ve definitely enjoyed working at an office, but given the situation we all are in, I find myself blessed with the opportunity. From working on the technical aspects of a data science project to presenting my work in a way that everyone can understand, this has been a great experience!
This blog was originally written by Sumanyu Ghoshal who interned as a Data Scientist with Prodigal over the summer of 2020.
Maximize Compliance And Collections Revenue
Prodigal’s Speech AI monitors 100% of calls for
compliance and collector performance with