Words with Context: Extra Information for Language Models


Machine learning models are only as good as the data they receive, and almost always improve their performance if supplied with a larger quantity of relevant information. Because of this, NLP systems find themselves at a distinct disadvantage to humans in certain text-based tasks because they simply lack access to all the information that humans can use to analyze documents. For example, humans can use features such as font, bolding, italics, and underlining to assist their understanding, while models such as BERT and GPT-2 can only leverage tokens and their pretrained embeddings taken out of the context of the page. This arbitrary auxiliary information can provide important clues in a number of tasks, so we felt motivated to give our base models in Finetune, indico’s python library for the task, the ability to leverage other features in addition to tokens as well.

Architecture change to train on auxiliary information – conveniently model independent!

The addition of auxiliary information is a relatively simple change to model architecture, as described in the diagram below. For each feature, we train a low dimensional (~32) embedding. When dealing with numeric values, such as font size, we normalize to mean 0 and standard deviation 1, then scale each embedding by the value given. For categorical information, we convert each label into a binary vector using scikit-learn’s LabelBinarizer, normalize each entry, and learn a separate embedding for each bit in the encoding. We then perform a weighted average of each embedding – with learned weights – resulting in a general ‘context vector’. After activations exit the base model, we concatenate our new context vector to the much higher dimension feature vector, and pass it to a small target model to train. For classification tasks, the mean of all context vectors is concatenated to each document feature vector. For sequence labeling tasks, each byte pair encoding has its information concatenated with its text features.

It seems that auxiliary information has especially large potential in tasks that have low data volumes, especially when one can extract strong signal from heuristics. This is because the action of handpicking extra information enforces a strict rule that helps cut through the noise, rather than having to label more data for the model to learn important rules on its own. As an experiment in this low data realm, we try to see if we can use auxiliary information to improve performance on the Reuters Named Entity Recognition Task as described in the Finetune library. This task consists of 128 examples with an average of 137 spaCy tokens per example. We use TextCNN as a backend because it is randomly initialized, so we can use linguistic information as useful auxiliary information that more complicated models such as GPT-2 would have already mastered.

As auxiliary information, we choose two categorical features that could be useful and assign them at the token level, though Finetune can handle auxiliary features that correspond to any span of text. 

  • The first feature is binary and it indicates whether or not a token contains capital letters, since most named entities are capitalized. This is a fairly noisy feature, since many non-entity tokens are also capitalized as well. However, it provides useful signal because it allows the model to tell that byte pair encodings come from a capitalized word without having to learn a rule.
  • The second feature encodes spaCy’s part of speech tagging for each token. We replace each PROPERNOUN assignment with NOUN, since using proper noun designations from spaCy is nearly equivalent to feeding in the target label, which is not useful for this experiment.

The syntax for passing in auxiliary information is very similar to the syntax for labels in the Finetune SequenceLabeling model, noting the important “start”, “end”, and “text” keywords that must be included. After processing the text ‘Intelligent process automation’, the input looks like:

sample = [{text: 'Intelligent', 'capitalized': 'True', 'end': 11, 
          'start': 0, 'pos': 'ADJ'},
          {text: 'process', 'capitalized': 'False', 'end': 19, 
          'start': 12, 'pos': 'NOUN'}, 
          {text: 'automation', 'capitalized': 'False', 'end': 30, 
          'start': 20, 'pos': 'NOUN'}]

We also define default features for characters that do not fall within spaCy tokens, since we found that model convergence is much more stable if all byte-pair encodings are assigned context vectors. Finetune automatically generates spans using this default setting for any text sequences that do not correspond to any spans of auxiliary features. It also automatically assigns each byte-pair encoding the feature vector corresponding to the span or token from which it came:

default = {"pos": "undefined", "capitalized": False}

We train both a baseline model and an auxiliary information model with batch_size 2 and the default configuration as defined in Finetune, including a learning rate of 6.25e-5. Note that we must pass in the default values under keyword default_context, and include the auxiliary information in a list with the text – this is required for the model to run.

from finetune import SequenceLabeler
from finetune.base_models import TextCNN
trainX, testX, trainContext, testContext, trainY, testY= …
baseline_model = SequenceLabeler(
baseline_model.fit(trainX, trainY)
trainX = [trainX, trainContext]
testX = [testX, testContext]
default = {"pos": "none", "capitalized": False}
# passing in kwarg default_context automatically switches the model to auxiliary mode
auxiliary_model = SequenceLabeler(
auxiliary_model.fit(trainX, trainY)

Listed below are the results from both the baseline and this model, from the run with the highest F1 score after searching num_epochs over {2, 3, 4, 5}. Both were trained and tested on the same data, with all other hyperparameters equal, from the default configuration in Finetune.

Precision (by token)Recall (by token)F1 Score (by token)
Baseline (2 epochs)0.750.820.78
Auxiliary Info (3 epochs)0.830.850.84

We see that allowing the model to view part of speech and capitalization provides a decent boost to F1 score. It makes sense that precision has a large increase, since it is easy to learn that any tokens that aren’t nouns or aren’t capitalized are highly unlikely to be named entities, helping to reduce false positives.

Of course, results from one task do not guarantee that auxiliary information can provide benefits to every use case. The effects of including extra features are highly dependent on the task at hand, as well as the predictive ability of the features provided. However, it seems that hand-selected auxiliary features have the potential to assist in a wide variety of tasks, particularly those with low or noisy data since it provides a strong inductive bias to help the model learn rules that would otherwise be lost between spurious correlations.

Reference: Source link

Sr. SDET M Mehedi Zaman

Currently working as Sr. SDET at Robi Axiata Limited, a subsidiary of Axiata Group. As a Senior SDET: - Played a key role in introducing Agile Scrum methodology and implementing CI/CD pipeline to ensure quality & timely delivery. - Trained colleagues on emerging technologies, e.g. Apache Spark, Big Data, Hadoop, Internet of Things, Cloud Computing, AR, Video Streaming Services Technology, Blockchain, Data Science- Developed a test automation framework for Android and iOS apps - Developed an e2e web automation framework with Pytest - Performed penetration testing of enterprise solutions to ensure security and high availability using Kali, Burp Suite etc. - Learned Gauntlet security testing automation framework and shared the lesson learned in a knowledge sharing session

Leave a Reply

Your email address will not be published. Required fields are marked *