Text Summarization
Monitoring a text summarization model with UpTrain
Overview: In this example, we will see how to use UpTrain to monitor performance of a text summarization task in NLP. Summarization creates a shorter version of a document or an article that captures all the important information. For the same, we will be using a pretrained text summarization model (with T5 architecture) from Huggingface. This model was trained on the billsum dataset.
Why is monitoring needed: Monitoring NLP tasks with traditional metrics (such as accuracy) in production is hard, as groud truth is unavailable (or extremely delayed when there is a human in the loop). And, hence, it becomes very important to develop techniques to monitor real time monitoring for tasks such as text summarization before important business metrics (such as customer satisfaction and revenue) are affected.
Problem: In this example, the model was trained on the billsum dataset. This dataset contains the articles and their summarization of the US Congressional and California state bills. However, in production, we append some samples from the wikihow dataset. The WikiHow is a large-scale dataset using the online WikiHow knowledge base. As you can imagine, the two datasets are quite different. It would be interesting to see how the text summarization task performs in production π€
Solution: We will be using UpTrain framework which provides an easy-to-configure way to log training data, production data and model's predictions. We apply several techniques on theis logged data, such as clustering, data drift detection and customized signals, to monitor performance and raise alerts in case of any dip in model's performance π
Install Required packages
PyTorch: Deep learning framework.
Hugging Face Transformers: To use pretrained state-of-the-art models.
Hugging Face Datasets: Use public Hugging Face datasets
NLTK: Use NLTK for sentiment analysis
Step 1: Setup - Defining model and datasets
Define model and tokenizer for the summarization task
tokenizer_t5 = AutoTokenizer.from_pretrained("t5-small")
model_t5 = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
prefix = "summarize: "Load Billsum dataset from Huggingface which was used to train our model
billsum_dataset = load_dataset("billsum", split="ca_test").filter(lambda x: x['text'] is not None)
billsum = billsum_dataset.train_test_split(test_size=0.2)
billsumDownload the wikihow dataset
Create a small test dataset from the Wikihow dataset to test our summarization model. Download the wikihow dataset from https://ucsb.app.box.com/s/ap23l8gafpezf4tq3wapr6u8241zz358 and save it as 'wikihowAll.csv' in the current directory.
Create a test dataset by combining billsum and wikihow datasets
Let's try out our model on one of the sample
Using embeddings for model monitoring
To compare the two datasets, we will be utilizing text embeddings (generated by BERT). As we will see below, we can see clear differentiations between the two datasets in the embeddings space which could be an important metric to track drifts
Save bert embeddings for the training data
Step 2: Visualizing embeddings using UpTrain
Let's first visualize how does the embeddings of the training dataset compares against that of our real-world testing dataset. We use two dimensionality reduction techniques, UMAP and t-SNE, for embedding visualization.
UpTrain package includes two types of dimensionality reduction techniques: U-MAP and t-SNE
As we can clearly see, samples from the wikihow dataset form a different cluster compared to that of the training clusters from the billsum datasets. UpTrain gives a real-time dashboard of the embeddings of the inputs/outputs of your language models, helping you visualize these drifts before they start impacting your models.
1. UMAP compression

2. t-SNE dimensionality reduction

Step 3: Quantifying Data Drift via embeddings
Now that we see embeddings belong to different clusters, we will see how to quantify (which could enable us to add Slack or Pagerduty alerts) using the data drift anomaly defined in UpTrain
Downsampling Bert embeddings
For the sake of simplicity, we are downsampling the bert embeddings from dim-384 to 16 by average pooling across features.
UpTrain over-clusters the reference dataset, assigns cluster to the real-world data-points based on nearest distance and compares the two distributions using earth moving costs. As seen from below, the cluster assignment for the production dataset is significantly different from the reference dataset -> we are observing a significant drift in our data.

Now that we can visually make sense of the drift, UpTrain also provides a quantitative measure (Earth moving distance between the production and reference distribution) which can be used to alert whenever a significant drift is observed

In addition to embeddings, UpTrain allows you to monitor drifts across any custom measure which one might care about. For example, in this case, we can monitor drift on metrics such as text language, user emotion, intent, occurence of a certain keyword, text topic, etc.
Step 4: Identifying edge cases
Now, that we have identified issues with our models, let's also see how can we use UpTrain to identify model failure cases. Since for out-of-distribution samples, we expect the model outputs to be wrong, we can define rules which can help us catch those failure cases.
We will define two rules - Output is grammatically incorrect, and the sentiment of the output is negative (we don't expect negative setiment outputs on the wikihow dataset).
In this example, we saw how to identify distribution shifts in Natural language related tasks by taking advantage of text embeddings.
Last updated
Was this helpful?