Similarity Learning using Neural Networks

Index

Problem Statement
Similarity Learning Tasks
Evaluation Metrics
Establishing Baselines
About Datasets and Models
The journey
Benchmarked Models
Notes on Finetuning Models
Thoughts on Deep Learning Models
Conclusion
Future Work

1. Problem Statement

The task of Similarity Learning is a task of learning similarities or disimilarites between sentences/paragraphs. The core idea is to find some sort of vector representation for a sentence/document/paragraph. This vector representation can then be used to compare a given sentence with other sentences. For example
Let’s say we want to compare the sentences: "Hello World", "Hi there friend" and "Bye bye".
We have to develop a function, similarity_fn such that

similarity_fn("Hello there", "Hi there friend") > similarity_fn("Hello there friend", "Bye Bye")
similarity_fn("Hello there", "Hi there friend") = 0.9
similarity_fn("Hello there friend", "Bye Bye") = 0.1

As such, Similarity Learning does not require Neural Networks (tf-idf and other such sentence representations can work as well). However, with the recent domination of Deep Learning in NLP, newer models have been performing the best on Similarity Learning tasks.

The idea of finding similarities can be used in the following ways :
1. Regresssion Similarity Learning: given document1 and document2 we need a similiarity measure as a float value.

2. Classification Similarity Learning: given document1 and document2 we need to classify them as similar or disimilar

3. Ranking Similarity Learning: given document, similar-document and disimilar-document, we want to learn a similarity_function such that similarity_function(document, similar-document) > similarity_function(document, disimilar-document)

Usually, the similarity can be measures like:

A vector representation is calculated for the two documents and the cosine similarity between them is their similarity score
The two documents are made to interact in several different ways (neural attention, matrix multiplication, LSTMs) and a softmax probablity is calculated of them being similar or dissimilar. Example: [0.2, 0.8] -> [probablity of being dissimilar, probablity of being similar]

2. Similarity Learning Tasks

1. Question Answering/Information Retrieval
Given a question and a number of candidate answers, we want to rank the answers based on their likelihood of being the answer to the question. Example:

Q: When was Gandhi born?
A1: Gandhi was a freedom fighter
A2: He was born 2 October, 1989
A3: He was a freedom fighter

The answer selected by the model is
argmax_on_A(simialrity_fn(Q, A1), simialrity_fn(Q, A2), simialrity_fn(Q, A3)) = A2

Conversation/Respone Selection is similar to Question Answering, for a given Question Q, we want to select the most appropriate response from a pool of responses (A1, A2, A3, ...)
Example datasets: WikiQA

2. Textual Entailment
Given a text and a hypothesis, the model must make a judgement to classify the text
Example from SICK:

The young boys are playing outdoors and the man is smiling nearby	There is no boy playing outdoors and there is no man smiling	CONTRADICTION
A skilled person is riding a bicycle on one wheel	A person is riding the bicycle on one wheel	ENTAILMENT
The kids are playing outdoors near a man with a smile	A group of kids is playing in a yard and an old man is standing in the background	NEUTRAL

3. Duplicate Document Detection/Paraphrase Detection
Given 2 docs/sentences/questions we want to predict the probablity of the two questions being duplicate
Example:

- "What is the full form of AWS" -- "What does AWS stand for?" --> Duplicate
- "What is one plus one?" -- "Who is president?" --> Not Duplicate Example Dataset: [Quora Duplicate Question Dataset](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs)

4. Span Selection:
Given a context passage and a question, we have to find the span of words in the passage which answers the question.
Example:

passage: "Bob went to the doctor. He wasn't feeling too good"
question: "Who went to the doctor"
span: "*Bob* went to the doctor. He wasn't feeling too good"

Example Dataset: SQUAD

3. Evaluation Metrics

While some similarity functions have simple metrics(Accuracy, Precision), it can get a bit complicated/different for others.

Duplicate Question/Paraphrase Detection and Textual Entailment use Accuracy.
Since it is easy to say that an answer is either correct or incorrect. Accuracy is simply the ratio of correct answers to the total number of questions.

Span Selection uses Exact Match and F1 score
How many times did the model predict the correct span?

(A full list of datasets and metrics can be found here)

It gets a bit tricky when you come to evaluate Question Answering systems. The reason for this is: The results given by these systems is usually in an ordereing of the candidate documents.
For a query Q1 and set of candidate answers (very-relevant-answer, slightly-relevant-answer, irrelevant-answer)
The evaluation metric should score the ordering of

(1: very-relevant-answer, 2: slightly-relevant-answer, 3:irrelevant-answer) > (1: slightly-relevant-answer, 2: very-relevant-answer, 3:irrelevant-answer) > (1: irrelevant-answer, 2: slightly-relevant-answer, 3:irrelevant-answer)

Basically, the metric should give a higher score to an ordering where the more relevant documents come first and less relevant documents come later.

The metric commonly used for this are Mean Average Precision(MAP), Mean Reciprocal Rank(MRR) and Normalized Discounted Cumulative Gain(nDCG).

Mean Average Precision (MAP)

You can read more here It is a measure to be calculated over a full dataset. A dataset like WikiQA, will have a setup like:

q_1, (d1, d2, d3)
q_2, (d4, d5)
q_3, (d6, d7, d8, d9, d10)
.
.
.

we find the average precision for every q_i and then take the mean over all the qs

Mean (Over all the queries) Average Precision(Over all the docs for a query)

alt

Some psuedo code to describe the idea:

num_queries = len(queries)
precision = 0
for q, candidate_docs in zip(queries, docs):
    precision += average_precision(q, candidate_docs)
mean_average_precision = precision / num_queries

I will solve an example here:

Model predictions: (q1-d1, 0.2), (q1-d2, 0.7) , (q1-d3, 0.3)
Ground Truth     : (q1-d1, 1),   (q1-d2, 1) ,     (q1-d3, 0)

We will sort our predictions based on our similarity score (Since our ordering/ranking is in that order)

(d2, 0.7, 1), (d3, 0.3, 0), (d1, 0.2, 1)

Average Precision = (Sum (Precision@k * relevance@k) for k=1 to n) / Number of Relevant documents

alt

where

n : the number of candidate docs
relevance@k : 1 if the kth doc is relevant, 0 if not
Precision@k : precision till the kth cut off where precision is : (number of retrieved relevant docs / total retrieved docs)

So, Average Precision = (1/1 + 0 + 2/3)/2 = 0.83

Here’s how:

At this point, we can discard our score (since we’ve already used it for ordering)

(d2, 1) (d3, 0) (d1, 1)

Number of Relevant Documents is (1 + 0 + 1) = 2

Precision at 1: We only look at (d2, 1)

p@1 = (number of retrieved relevant@1/total number of retrieved docs@1) * relevance@1
    = (1 / 1) * 1
    = 1

Precision at 2: We only look at (d2, 1) (d3, 0)

p@2 = (number of retrieved relevant@2/total number of retrieved docs@2) * relevance@2
    = (1 / 2) * 0
    = 0

Precision at 3: We only look at (d2, 1) (d3, 0) (d1, 1)

p@3 = (number of retrieved relevant@3/total number of retrieved docs@3) * relevance@3
    = (2 / 3) * 1
    = 2/3

Therefore, Average Precision for q1

								= (Precision@1 + Precision@2 + Precision@3) / Number of Relevant Documents
                                = (1 + 0 + 2/3) / 2
                                = 0.83

To get Mean Average Precision, we calculate Average Precision for all qs in it and average it/take the mean.

Normalized Discounted Cumulative Gain

This Wikipedia Article is a great source for understanding NDCG. First, let’s understand Discounted Cumulative Gain.

The premise of DCG is that highly relevant documents appearing lower in a search result list should be penalized as the graded relevance value is reduced logarithmically proportional to the position of the result.

DCG@P = Sum for i=1 to P (relevance_score(i) / log2(i + 1))

Let’s take the previous example:
(d2, 0.7, 1), (d3, 0.3, 0), (d1, 0.2, 1)

DCG@3 = 0.7/log2(1 + 1) + 0.3/log2(1 + 2) + 0.2/log2(1 + 3)

As can be seen, if a high score is given

For getting the Normalized Discounted:

Since result set may vary in size among different queries or systems, to compare performances the normalised version of DCG uses an ideal DCG. To this end, it sorts documents of a result list by relevance, producing an ideal DCG at position p ( {\displaystyle IDCG_{p}} IDCG_p), which normalizes the score:

nDCG = DCG / IDCG

Should I use MAP or NDCG?

Use MAP when you a candidate document is either relevant or not relevant (1 or 0) Use nDCG when you have a range of values for relevance. For example, when documents have relevance (d1 : 1, d2: 2, d3: 3)

For a more detailed idea, refer to this issue on the MatchZoo repo.

As per @bwanglzu

@aneesh-joshi I’d say it can be evaluated with nDCG, but not meaningful or representative. For binary retrieval problem (say the labels are 0 or 1 indicates relevant or not), we use P, AP, MAP etc. For non-binary retrieval problem (say the labels are real numbers greater equal than 0 indicates the “relatedness”]), we use nDCG. In this situation, clearly, mAP@k is a better metric than nDCG@k. In some cases for non-binary retrieval problem, we use both nDCG & mAP such as microsoft learning to rank challenge link.

About implementing MAP

While the idea of the metric seems easy, there are varying implementations of it which can be found below in the section “MatchZoo Baselines”. Ideally, the most trusted metric implementations should be the one provided by TREC and I would imagine that’s what people use when publishing results.

About TREC and trec_eval

The TextRetrievalConference(TREC) is an ongoing series of workshops focusing on a list of different information retrieval (IR) research areas, or tracks. They provide a standard method of evaluating QA systems called trec_eval. It’s a C binary whose code can be found here

Please refer to this blog to get a complete understanding of trec

Here’s the short version:
As the above tutorial link will tell, you will have to download the TREC evaluation code from here run make in the folder and it will produce the trec_eval binary.

The binary requires 2 inputs:

qrels: It is the query relevance file which will hold the correct answers.
pred file: It is the predicition file which will have the predictions made by your model

It can be run like:
$ ./trec_eval qrels pred

The above will provide some standard metrics. You can also specify metrics on your own like:
$ ./trec_eval -m map -m ndcg_cut.1,3,5,10,20 qrels pred

The above will provide MAP value and nDCG at cutoffs of 1, 3, 5, 10 and 20

The qrels format is:

Format
------
<query_id>\t<0>\t<document_id>\t<relevance>

Note: parameter <0> is ignored by the model

Example
-------
Q1  0   D1-0    0
Q1  0   D1-1    0
Q1  0   D1-2    0
Q1  0   D1-3    1
Q1  0   D1-4    0
Q16 0   D16-0   1
Q16 0   D16-1   0
Q16 0   D16-2   0
Q16 0   D16-3   0
Q16 0   D16-4   0

The model’s prediction, pred should be like:

Format
------
<query_id>\t<Q0>\t<document_id>\t<rank>\t<model_score>\t<STANDARD>

Note: parameters <Q0>, <rank> and <STANDARD> are ignored by the model and can be kept as anything
I have chose 99 as the rank. It has no meaning.

Example
-------
Q1  Q0  D1-0    99  0.64426434  STANDARD
Q1  Q0  D1-1    99  0.26972288  STANDARD
Q1  Q0  D1-2    99  0.6259719   STANDARD
Q1  Q0  D1-3    99  0.8891963   STANDARD
Q1  Q0  D1-4    99  1.7347554   STANDARD
Q16 Q0  D16-0   99  1.1078827   STANDARD
Q16 Q0  D16-1   99  0.22940424  STANDARD
Q16 Q0  D16-2   99  1.7198141   STANDARD
Q16 Q0  D16-3   99  1.7576259   STANDARD
Q16 Q0  D16-4   99  1.548423    STANDARD

4. Establishing Baselines

In our journey of developing a similarity learning model, we first decided to use the WikiQA dataset since it is a well studied and common baseline used in many papers like SeqMatchSeq, BiMPM and QA-Transfer.

Word2Vec Baseline

To understand the level of improvement we could make, we set up a baseline using word2vec. We use the average of word vectors in a sentence to get the vector for a sentence/document.

"Hello World" -> (vec("Hello") + vec("World"))/2

When 2 documents are to be compared for similarity/relevance, we take the Cosine Similarity between their vectors as their similarity. (300 dimensional vectors were seen to perform the best, so we chose them.)

The w2v 300 dim MAP score on the full set(100%) of WikiQA is 0.59
train split(80%) of WikiQA is 0.57
test split(20%) of WikiQA is 0.62
dev split(10%) of WikiQA is 0.62

MatchZoo Baselines

I had found a repo called MatchZoo which had almost 10 different Similarity Learning models developed and benchmarked. According to their benchmarks, their best model performed around 0.65 MAP on WikiQA.

I tried to reproduce their results using their scripts but found that the scores weren’t keeping up with what they advertised. I raised an issue. (There was a similar issue before mine as well.) They went about fixing it. I was later able to reproduce their results (by reverting to a point earlier in the commit history) but there was now a lack of trust in their results. I sought out to write my own script in order to make my evaluation independent of theirs. The problem was that they had their own way of representing words and other such internal functions. Luckily, they were dumping their results in the TREC format. I was able to scrape that and evaluate it using this script.

I developed a table with my baselines as below:

alt

Note: the values are different from the established baseline because there is some discrepancy on how MAP should be calculated. I initially used my implementation of it for the table. Later, I moved to that provided by trec. This is how I implement it:

def mapk(Y_true, Y_pred):
    """Calculates Mean Average Precision(MAP) for a given set of Y_true, Y_pred

    Note: Currently doesn't support mapping at k. Couldn't use only map as it's a
    reserved word

    Parameters
    ----------
    Y_true : numpy array or list of ints either 1 or 0
        Contains the true, ground truth values of the relevance between a query and document
    Y_pred : numpy array or list of floats
        Contains the predicted similarity score between a query and document

    Examples
    --------
    >>> Y_true = [[0, 1, 0, 1], [0, 0, 0, 0, 1, 0], [0, 1, 0]]
    >>> Y_pred = [[0.1, 0.2, -0.01, 0.4], [0.12, -0.43, 0.2, 0.1, 0.99, 0.7], [0.5, 0.63, 0.92]]
    >>> print(mapk(Y_true, Y_pred))
    0.75
    """

    aps = []
    n_skipped = 0
    for y_true, y_pred in zip(Y_true, Y_pred):
        # skip datapoints where there is no solution
        if np.sum(y_true) < 1:
            n_skipped += 1
            continue

        pred_sorted = sorted(zip(y_true, y_pred), key=lambda x: x[1], reverse=True)
        avg = 0
        n_relevant = 0

        for i, val in enumerate(pred_sorted):
            if val[0] == 1:
                avg += 1. / (i + 1.)
                n_relevant += 1

        if n_relevant != 0:
            ap = avg / n_relevant
            aps.append(ap)
    return np.mean(np.array(aps))

This is how it’s implemented in MatchZoo:

def map(y_true, y_pred, rel_threshold=0):
    s = 0.
    y_true = _to_list(np.squeeze(y_true).tolist())
    y_pred = _to_list(np.squeeze(y_pred).tolist())
    c = list(zip(y_true, y_pred))
    random.shuffle(c)
    c = sorted(c, key=lambda x:x[1], reverse=True)
    ipos = 0
    for j, (g, p) in enumerate(c):
        if g > rel_threshold:
            ipos += 1.
            s += ipos / ( j + 1.)
    if ipos == 0:
        s = 0.
    else:
        s /= ipos
    return s

The above has a random shuffle because it prevents false numbers. Refer to this issue.

After studying these values, we decided to implement the DRMM_TKS model since it had the best score for less tradeoffs (like time to train and memory)

Retrospection Point: It should’ve hit us at this point that the best MAP score of 0.65 as compared to the baseline of 0.58 (of word2vec) isn’t so much more (only 0.07). In the later weeks it became clearer that we wanted to develop a model which could perform at least 0.75 (i.e., 0.15 more than the word2vec baseline.) On changing the way we evaluated (shift from my script to trec), the w2v baseline went upto 0.62. Which meant, we need a MAP of at least 0.78 to be practically useful. Moreover, the very paper which introduced WikiQA claimed a MAP of 0.65. We should’ve realized at this point that investigating with these models was probably pointless(since the best they had so far was merely 0.65). Maybe, we hoped that finetuning these models would lead to a higher score.

Non standard things about Similarity Learning Models

There are some things which area bit non-standard when it comes to Similarity Learning. These are the things one wouldn’t know while working on standard Deep Learning things.
The first one is evaluating with MAP, nDCG, etc which I have already explained above. The second is the

loss function.

Ideally, we would like to use a loss function which would improve our metric (MAP). Unforunately, MAP isn’t a differentiable function. It’s something like saying, I want to increase Accuracy, let’s make that the loss function. In the Similarity Learning context, we use the Rank Hinge Loss This loss function is especially good for Ranking Problems. More details can be found here

loss = maximum(0., margin + y_neg - y_pos) where margin is set to 1 or 0.5

Let’s look at the keras definitions to get a better idea of the intuition

def hinge(y_true, y_pred):
    return K.mean(K.maximum(1. - y_true * y_pred, 0.), axis=-1)

In this case, the y_true can be {1, -1} For example, y_true = [1 , -1, 1] , y_pred = [0.6, 0.1, -0.9]

K.mean(K.maximum(1 - [1, -1, 1] * [0.6, 0.1, -0.9], 0))
= K.mean(K.maximum(1 - [0.6, -0.1, -0.9], 0))
= K.mean(K.maximum([0.4, 1.1, 1.9], 0))
= K.mean([0.4, 1.1, 1.9])

Effectively, the more wrong answers contribute more to the loss

def categorical_hinge(y_true, y_pred):
    pos = K.sum(y_true * y_pred, axis=-1)
    neg = K.max((1.0 - y_true) * y_pred, axis=-1)
    return K.mean(K.maximum(0.0, neg - pos + 1), axis=-1)

In this case, there is a categorization, so the true values are one hots
For example, y_true = [[1, 0, 0], [0, 0, 1]], y_pred = [[0.8, 0.1, 0.1], [0.7, 0.2, 0.1]]

pos = K.sum([[1, 0, 0], [0, 0, 1]] * [[0.9, 0.1, 0.1], [0.7, 0.2, 0.1]], axis=-1)
    = K.sum([[0.9, 0, 0], [0, 0, 0.1]], axis=-1)  # check the predictions for the correct answer
    = [0.9, 0.1]
neg = K.max((1.0 - [[1, 0, 0], [0, 0, 1]]) * [[0.8, 0.1, 0.1], [0.7, 0.2, 0.1]], axis=-1)
    = K.max([[0, 1, 1], [1, 1, 0]] * [[0.8, 0.1, 0.1], [0.7, 0.2, 0.1]], axis=-1)
    = K.max([[0, 0.1, 0.1], [0.7, 0.2, 0]], axis=-1)
    = [0.1, 0.7]
loss = K.mean(K.maximum(0.0, neg - pos + 1), axis=-1)
     = K.mean(K.maximum(0.0, [0.1, 0.7] - [0.9, 0.1] + 1), axis=-1)
     = K.mean(K.maximum(0.0, [0.2, 1.6]), axis=-1)
     = K.mean([0.2, 1.6])

Again, effectively, the more wrong value (0.7) contributes more to the loss, whereas the correct value(0.9) reduces it greatly

K: is the Keras backend
effectively K=tf
K.maximum : is elementwise maximum
K.max gets the maximum value in a tensor

The third “non-standard” idea is that of

Pairwise, Listwise and Pointwise.

Here, here and here are some good resources.

Here’s my take on it:
These 3 represent different ways of training a ranking model by virtue of the loss function.

Pointwise

Here we take a sentence and try to give it a rank. For example, we take the sentence ("I like chicken wings") and try to give it a numerical score (0.34) for the question ("What do I like to eat?").

Pairwise

We take two docs and try to rank them as compared to each other. “x1 is more relevant than x2”

Given a pair of documents, they try and come up with the optimal ordering for that pair and compare it to the ground truth. The goal for the ranker is to minimize the number of inversions in ranking i.e. cases where the pair of results are in the wrong order relative to the ground truth.

ListWise

Listwise approaches directly look at the entire list of documents and try to come up with the optimal ordering for it

5. About Datasets and Models

Here, I will give examples of different datasets.

Dataset Name	Link	Suggested Metrics	Some Papers that use the dataset	Brief Description
WikiQA	Dataset Paper	MAP MRR	SeqMatchSeq(`MULT` MAP=0.74, MRR=0.75) BiMPM(MAP=0.71, MRR=0.73) QA-Transfer(`SQUAD*` MAP=0.83, MRR=84.58, P@1=75.31)	Question-Candidate_Answer1_to_N-Relevance1_to_N
SQUAD 2.0	Website	Exact Match F1	QA-Transfer(for pretraining)	Question-Context-Answer_Range_in_context
Quora Duplicate Question Pairs	gensim-data(quora-duplicate-questions) Quora Official Kaggle	Accuracy, F1	BiMPM(Acc=88.17%)	Q1-Q2-DuplicateProbablity
Sem Eval 2016 Task 3A	genism-data(semeval-2016-2017-task3-subtaskA-unannotated)	MAP AvgRecall MRR P R F1 Acc	QA-Transfer(`SQUAD*` MAP=80.2, MRR=86.4, P@1=89.1)	Question-Comment-SimilarityProbablity
MovieQA	Paper Website	Accuracy	SeqMatchSeq(`SUBMULT+NN` test=72.9%, dev=72.1%)	Plot-Question-Candidate_Answers
InsuranceQA	Website	Accuracy	SeqMatchSeq(`SUBMULT+NN` test1=75.6%, test2=72.3%, dev=77%)	Question-Ground_Truth_Answer-Candidate_answer
SNLI	Paper Website	Accuracy	QA-Transfer(for pretraining) SeqMatchSeq(`SUBMULT+NN` train=89.4%, test=86.8%) BiMPM()`Ensemble` Acc=88.8%)	Text-Hypothesis-Judgement
TRECQA	https://aclweb.org/aclwiki/Question_Answering_(State_of_the_art) https://github.com/castorini/data/tree/master/TrecQA http://cs.jhu.edu/~xuchen/packages/jacana-qa-naacl2013-data-results.tar.bz2	MAP MRR	BiMPM(MAP:0.802, MRR:0.875)	Question-Candidate_Answer1_to_N-relevance1_to_N
SICK	Website	Accuracy	QA-Transfer(Acc=88.2)	sent1-sent2-entailment_label-relatedness_score

More dataset info can be found at the SentEval repo.

Special note about InsuranceQA

This dataset was originally released here
It has several different formats but at its base there is (for one data point):

a question
one or more correct answers
a pool of all answers (correct and incorrect)

It doesn’t have a simple format like question - document - relevance. So, we’ll have to convert it to the QA format. That basically involves taking a question, its correct answer and marking it as relevant. Then for the remaining number of answers(however big you want the batch size to be), we pick (batch_size - current_size) from the pool of answers

The original repo has several files and is very confusing. Luckily, there is a converted version of it here

Since there is a pool of candidate answers from which a batch is sampled, the dataset becomes inherently stochastic.

Links to Papers:

Some useful examples

SQUAD

SQUAD, WikiQA, SemEval, SICK

alt

MovieQA and InsuranceQA

alt

About Models

While several models were cursorily checked, majorly, this study will have:

6. My Journey

After going through the initial evaluation, we decided to implement the DRMM TKS model, since it gave the best score (0.65 MAP on WikiQA) with acceptable memory usage and time for training. The first model was considerable important since it set the standard for all other models to follow. I had to learn and implement a lot of Software Engineering principles like doctest, docstrings, etc. This model also helped me set up the skeleton structure needed for any Similarity Learning model. The code can be found here

This is generally what needs to be done:

Iterate through the train set’s sentences and develop your model’s word2index dictionary(one index for each word). This is useful, especially when it comes to predicting on new data and we need to translate new sentences.
Translate all the words into indexes using the above dict.
Load a pretrained Embedding Matrix (like Glove) and reorder it so that its indices match with your word2index dictionary.
There will be some words in the embedding matrix which aren’t there in the train set. Although these extra embeddings aren’t useful when it comes to training, they will be useful while testing and validating. So, append the word2index dict with these new words and reorder the embedding matrix accordingly.
Make 2 additional entries in the embedding matrix for the pad word and the unkown word.
- Since all sentences won’t be of the same shape, we will inevitably have to pad the sentences to the same length. You should decide at this point whether the pad vector is a zero vector or a random vector.
- While testing and validating, you will inevitably encounter an unkown word (a word not in your word2index dictionary). All these unkown words should be set to the same vector. Again, you can choose between a zero vector and a random vector.
Once all the data is in the right format, we need to create a generator which will stream out batches of training examples (with padding added ofcourse). In the Question Answering context, there is a special consideration here. Usually, a training set will be of the form [(q1, (d1, d2, d3), (q2, (d4, d5)))] with the releavnce labels like [(1, 0, 0), (0, 1)]. While setting up training examples, it’s a good idea to make training pairs like [(q, d_rel), (q, d_not_relevant), (q, d_relevant), …]. This “should” help the model learn the correct classification by contrasting relevant and non relevant answers.
Now that you batch generator is spitting out batches to train on, create a model, fix the loss, optimizer and other such parameters and train.
The metrics usually reported by a library like Keras won’t be very useful to evaluate your model as it trains. So, at the end of each epoch, you should set up your own callback as I have done here
Once the model training is done, evaluate it on the test set. You can obviously use your own implementation, but as a standard, it would be better to write your results into the TREC format as I have done here and then use the trec_eval binary described before for evaluating. This standardizes evaluation and makes it easier for others to trust the result.
Saving models can get a bit tricky, especially if you’re using keras. Keras prefers it’s own way of saving against the satandard gensim util.SaveLoad You will have to save the keras model separate from the non-keras model and combine them while loading. Moreover, if you have any Lambda Layer, you will have to replace them with custom layers like I did here for the TopK Layer.

Most of the above has been done in my script and you can mostly just use that skeleton while changing just the _get_keras_model function.

Once my DRMM_TKS model was set up, it was performing very poorly (0.54 on MAP). After trying every idea I could imagine for fine tuning it, I ended up with a model which performed 0.63 on MAP. Beyond this point, however, I didn’t feel like any changes I made would make it perform any better. You are welcome to try.
Seeing that DRMM_TKS wasn’t doing so well, I moved my attention to the next best performing model in my list of model, MatchPyramid. This model, after some tuning, managed to score 0.64-0.65 MAP.

It was around this point that we came to the realization that the score wasn’t much more than the baseline and a score of unsupervised baseline plus 0.15 would be needed to make it a worthwhile model to collect supervised data for.

String of events leading to a change in models

I started looking for other datasets and happened to stumble upon the BiMPM paper which claimed a score of 0.71 on WikiQA. In it’s score comparison table

alt

it had cited another paper SeqMatchSeq which claimed a further score of 0.74. I went about implementing BiMPM here and here which didn’t perform as expected. I then ran the author’s original code, where it unfortunately turned out that the author had reported the score on the dev set, not the test set (as per my run at least (with the author’s random seeds)). (On asking the author in this issue, the author claims the split was test.) The actual test score was 0.69, not much more than the baseline for such a complicated model. I also found a pytorch implementation of SeqMatchSeq here in which there was an issue about MAP coming to 0.62 instead of 0.72 on dev set. The author commented:

I am afraid you are right. I used to reach ~ 72% via the given random seed on an old version of pytorch, but now with the new version of pytorch, I wasn’t able to reproduce the result. My personal opinion is that the model is neither deep or sophisticated, and usually for such kind of model, tuning hyper parameters will change the results a lot (although I don’t think it’s worthy to invest time tweaking an unstable model structure). If you want guaranteed decent accuracy on answer selection task, I suggest you take a look at those transfer learning methods from reading comprehension. One of them is here https://github.com/pcgreat/qa-transfer

About QA-Transfer and BiDAF

And thus, my hunt has lead me to the paper : Question Answering through Transfer Learning from Large Fine-grained Supervision Data which makes a crazier claim on MAP : 0.83 (On Ensemble of 12)

The paper’s author provides the implementation here in tensorflow.

The author makes some notable claims in it’s abstract:

We show that the task of question answering (QA) can significantly benefit from the transfer learning of models trained on a different large, fine-grained QA dataset. We achieve the state of the art in two well-studied QA datasets, WikiQA and SemEval-2016 (Task 3A), through a basic transfer learning technique from SQuAD. For WikiQA, our model outperforms the previous best model by more than 8%. We demonstrate that finer supervision provides better guidance for learning lexical and syntactic information than coarser supervision, through quantitative results and visual analysis. We also show that a similar transfer learning procedure achieves the state of the art on an entailment task.

So, how this model works is It takes an existing model for QA called BiDirectional Attention Flow (BiDAF), which would take in a query and a context. It would then predict the range/span of words in the context which is relevant to the query. It was adapted to the SQUAD dataset.

Example of SQUAD:

SQUAD

The QA-Transfer takes the BiDAF net and chops off the last layer to make it more QA like, ie., something more like WikiQA:

q1 - d1 - 0
q1 - d2 - 0
q1 - d3 - 1
q1 - d4 - 0

They call this modified network BiDAF-T

Then, they take the SQUAD dataset and break the context into sentences and labels each sentence as relevant or irrelevant. This new dataset is called SQUAD-T

How it works: SQUAD dataset is for span level QA

passage: "Bob went to the doctor. He wasn't feeling too good"
question: "Who went to the doctor"
span: "*Bob* went to the doctor. He wasn't feeling too good"

This can be remodelled such that:

question : "Who went to the doctor"
doc1 : "Bob went to the doctor."
relevance : 1 (True)

question : "Who went to the doctor"
doc2 : "He wasn't feeling too good"
relevance : 0 (False)

Effectively, we get a really big good dataset in the QA domain. The converted file is almost 110 MB.

alt

The new model, BiDAF-T is then trained on SQUAD-T. It gets MAP : 0.75 on WikiQA test set.
When BiDAF-T is finetuned on WikiQA, it gets MAP : 0.76
When BiDAF is trained on SQUAD, then the weights are transferred to SQUAD-T and it is further finetuned on WikiQA, it gets the MAP : 0.79 When it’s done is an ensemble, it gets MAP : 0.83

They call it Transfer Learning.

So, as such, it’s not exactly a new model. It’s just an old model(BiDAF), trained on a modified dataset and then used on WikiQA and SentEval. However, the author claims that the model does suprisingly well on both of them. The author has provided their own repo.

Since, 0.79 feels like a good score, significantly (0.17) over the w2v baseline, I went about implementing it. You can find my implementation here and it is called using this eval script. Unfortunately, the scores don’t perform as well.

Steps invovled:
I had to first convert the SQUAD dataset, a span level dataset, into a QA dataset. I have uploaded the new dataset and the script for conversion on my repo. I then trained the BiDAF-T model on the SQUAD-T dataset to check if it did infact reach the elusive score of I got 0.64 on average (getting a 0.67 at one point) but this isn’t more than MatchPyramid and it has several drawbacks:

Things to consider regarding QA Transfer/BiDAF:

The original BiDAF code on SQUAD takes 20 hours to converge [reference]
It is a very heavy model
Since there is a QA Transfer from SQUAD, it won’t work on non-english words. In that case, it’s better to try an older method like BiMPM or SeqMatchSeq

List of existing implementations

BiDAF code:
BiDAF-T code:
- QA Transfer Original Repo (forked from BiDAF tensorflow)
- QA Transfer Slightly modified fork (forked from BiDAF tensorflow)
- My Implmentation (unable to reproduce BiDAF-T)

7. Benchmarked Models

In the tables below,
w2v/Glove Vec Averaging Baseline : refers to the score between 2 sentences calculated by averaging the word vectors(taken from Glove of Pennington et al) of their words and then taking the Cosine Similarity.
FT : refers to the same as above but using FastText
Glove + Regression NN : refers to getting sentence representations in the same way as above but using a single layer neural network with softmax activation
Glove + Multilayer NN : same as above but using a multilayer neural network instead of a single layer

WikiQA

WikiQA test set	w2v 200 dim	FT 300 dim	MatchPyramid	DRMM_TKS	BiDAF only pretrain	BiDAF pretrain + finetune	MatchPyramid Untrained Model	DRMM_TKS Untrained Model	BiDAF Untrained Model
map	0.6285	0.6199	0.6463	0.6354	0.6042	0.6257	0.5107	0.5394	0.3291
gm_map	0.4972	0.4763	0.5071	0.4989	0.4784	0.4986	0.3753	0.4111	0.2455
Rprec	0.4709	0.4715	0.5007	0.4801	0.416	0.4616	0.3471	0.3512	0.1156
bpref	0.4613	0.4642	0.4977	0.4795	0.4145	0.4569	0.3344	0.3469	0.1101
recip_rank	0.6419	0.6336	0.6546	0.6437	0.6179	0.6405	0.519	0.5473	0.3312
iprec_at_recall_0.00	0.6469	0.6375	0.6602	0.648	0.6224	0.6441	0.5242	0.5534	0.3396
iprec_at_recall_0.10	0.6469	0.6375	0.6602	0.648	0.6224	0.6441	0.5242	0.5534	0.3396
iprec_at_recall_0.20	0.6469	0.6375	0.6602	0.648	0.6224	0.6441	0.5242	0.5534	0.3396
iprec_at_recall_0.30	0.6431	0.6314	0.6572	0.648	0.6177	0.6393	0.5213	0.5515	0.3382
iprec_at_recall_0.40	0.6404	0.6293	0.6537	0.6458	0.614	0.6353	0.5189	0.5488	0.3382
iprec_at_recall_0.50	0.6404	0.6293	0.6537	0.6458	0.614	0.6353	0.5189	0.5488	0.3382
iprec_at_recall_0.60	0.6196	0.6115	0.6425	0.6296	0.5968	0.6167	0.5073	0.5348	0.3289
iprec_at_recall_0.70	0.6196	0.6115	0.6425	0.6296	0.5968	0.6167	0.5073	0.5348	0.3289
iprec_at_recall_1t_recall_0.80	0.6175	0.6094	0.6401	0.627	0.594	0.6143	0.5049	0.5333	0.3263
iprec_at_recall_0.90	0.6175	0.6094	0.6401	0.627	0.594	0.6143	0.5049	0.5333	0.3263
iprec_at_recall_1.00	0.6175	0.6094	0.6401	0.627	0.594	0.6143	0.5049	0.5333	0.3263
P_5	0.1967	0.1926	0.1967	0.1934	0.1984	0.2008	0.1704	0.1835	0.1473
P_10	0.1119	0.1119	0.1156	0.1152	0.1128	0.1136	0.1095	0.1144	0.1033
P_15	0.0787	0.0774	0.079	0.0785	0.0771	0.0779	0.0787	0.0787	0.0749
P_20	0.0597	0.0591	0.0599	0.0591	0.0599	0.0599	0.0603	0.0601	0.0591

Quora Duplicate Questions

	Accuracy
MatchPyramid	69.20%
DRMM TKS	68.49%
Glove Vec Averaging Baseline	37.02%
Glove + Regression NN	69.02%
Glove + Multilayer NN	78 %

SICK

	Accuracy
MatchPyramid	56.82%
DTKS	57.00%
Glove + Regression NN	59.68%
Glove + Multilayer NN	66.18%
MatchPyramid Untrained Model	23%
DTKS Untrained Model	29%

SNLI

	Accuracy
MatchPyramid	53.57%
DRMM_TKS	43.15%
MatchPyramid Untrained Model	33%
DRMM_TKS Untrained Model	33%
Glove + Regression NN	58.60%
Glove + Multilayer NN	73.06%

InsuranceQA

Note: InsuranceQA is a “different dataset”. It has a question, with one or two correct answers along with a pool of almost 500 incorrect answers. The InsuranceQA reader will, for a question randomly sample some incorrect answers to include in a batch. The dataset has been provided with 2 test sets (Test1 and Test2)

Test1 set

IQA Test1	w2v(300 dim)	MatchPyramid	DRMM_TKS
map	0.6975	0.8194	0.5539
gm_map	0.5793	0.7415	0.4295
Rprec	0.5677	0.7246	0.3902
bpref	0.569	0.7296	0.3908
recip_rank	0.7272	0.8445	0.5901
iprec_at_recall_0.00	0.7329	0.8498	0.5978
iprec_at_recall_0.10	0.7329	0.8498	0.5978
iprec_at_recall_0.20	0.7329	0.8498	0.5977
iprec_at_recall_0.30	0.7316	0.8485	0.5944
iprec_at_recall_0.40	0.7241	0.8416	0.5841
iprec_at_recall_0.50	0.7238	0.8407	0.5838
iprec_at_recall_0.60	0.6813	0.8055	0.5349
iprec_at_recall_0.70	0.6812	0.8049	0.534
iprec_at_recall_0.80	0.6662	0.7938	0.5185
iprec_at_recall_0.90	0.6652	0.793	0.5171
iprec_at_recall_1.00	0.6652	0.793	0.5171
P_5	0.243	0.2663	0.2154
P_10	0.1453	0.1453	0.1453
P_15	0.0969	0.0969	0.0969
P_20	0.0727	0.0727	0.0727
P_30	0.0484	0.0484	0.0484
P_100	0.0145	0.0145	0.0145
P_200	0.0073	0.0073	0.0073
P_500	0.0029	0.0029	0.0029
P_1000	0.0015	0.0015	0.0015

Test2 set

IQA Test2	Word2Vec (300 dim)	MatchPyramid	DRMM_TKS
map	0.7055	0.8022	0.5354
gm_map	0.589	0.714	0.4137
Rprec	0.5773	0.698	0.3725
bpref	0.58	0.7048	0.3704
recip_rank	0.7362	0.826	0.5698
iprec_at_recall_0.00	0.7413	0.8318	0.5783
iprec_at_recall_0.10	0.7413	0.8318	0.5783
iprec_at_recall_0.20	0.7413	0.8316	0.5783
iprec_at_recall_0.30	0.7402	0.8304	0.5757
iprec_at_recall_0.40	0.7319	0.8243	0.5657
iprec_at_recall_0.50	0.7317	0.8243	0.5652
iprec_at_recall_0.60	0.6869	0.7903	0.5152
iprec_at_recall_0.70	0.6866	0.7898	0.5147
iprec_at_recall_0.80	0.6734	0.7771	0.5016
iprec_at_recall_0.90	0.6731	0.7763	0.5011
iprec_at_recall_1.00	0.6731	0.7763	0.5011
P_5	0.2426	0.2602	0.2087
P_10	0.1441	0.1441	0.1441
P_15	0.096	0.096	0.096
P_20	0.072	0.072	0.072
P_30	0.048	0.048	0.048
P_100	0.0144	0.0144	0.0144
P_200	0.0072	0.0072	0.0072
P_500	0.0029	0.0029	0.0029
P_1000	0.0014	0.0014	0.0014

8. Notes on Finetuning Models

Fine tuning deep learning models can be tough, considering the model runtimes and the number of parameters.
Here, I will list some parameters to consider while tuning:

The word vector dimensionality(50, 100, 200, 300) and source(Glove, GoogleNews)
Dropout on different layers
Making the Fully Connected/Dense layer deeper or wider
Running lesser vs more epochs
Bigger or Smaller Batch Sizes

9. Thoughts on Deep Learning Models

While the models seem to be doing the best currently, they are a bit difficult to reproduce. There are several reasons I can imagine for that:

Difference in Deep Learning Libraries and their versions used in implementations may vary or be outdated. Different libraries can have varying differences in small implementation details(like random initializations) which can lead to huge differences later on and are difficult to track.
The large number of parameters coupled with it’s stochastic nature means that the model’s performance can vary greatly. Include the time it takes to get one result makes tuning them very very difficult to tune.
“When a model doesn’t perform well, is it because the model is wrong or is my implementation wrong?” This is a difficult question to answer.

10. Conclusion

My time spent with understanding, developing and testing Neural Networks for Similarity Learning has brought out the conclusions that:

the current methods are not significantly better than a simple unsupervised word2vec baseline.
the current methods are unreproducible/difficult to reproduce.

This work and effort has provided:

a method to evaluate models
implementations of some of the current SOTA models
a list of datasets to evaluate models on and their readers

11. Future Work

If anybody was to carry on with this work, I would recommend looking at the QA-Transfer model, which claims the current “State Of The Art”. Although my implementation couldn’t reproduce the result, I feel like there is some merit in the model. The reason for this belief is not based in QA-Transfer, but the underlying BiDAF model, which provides two demos which does really well. Since there is so much discrepancies in different implementations, it would be better to use the ported tf1.8 BiDAF
If you are looking to do “newer research”, I would recommend you do what the authors of QA-Transfer did. They looked at the best performing model on the SQUAD 1.1 dataset, pretrained on it, removed the last layer and evaluated. Things have changed since: There is a newer better SQUAD (2.0) and that leader board has got a new “best” model. In fact, BiDAF model is no longer the best on SQUAD 1.1 or 2.0. Just take the current best model and do the pretraining! If there’s anything to take away from QA-Transfer, it’s that span supervision and pretraining can be good ideas. This, especially, seems to be a new trend regarding Language Models