COMPSCI5002, Lab Work Part 2
Dr. Debasis Ganguly and Dr. Yehia Elkhatib
July 5, 2021
Modeling Word Semantics
In this exercise, you’ll work on a bigger project that will test your skills of code design, the ability to work
with objects and file handling. You will be developing a Java project for a definite task from scratch, and
also testing the code yourself for correctness and efficiency.
The problem that you will solve is to figure out how similar is a word u to another word v by computing a
word similarity measure, e.g. this similarity value is expected to be high for the pair (u = sun, v = solar),
and low for the pair (u = sun, v = milk).
Representing words as Multisets of context words
Given a collection of text, e.g. the one attached with this lab exercise (the complete works of Shakespeare1
you can represent each word in the vocabulary as a sparse vector (multiset) of its context words. For instance,
a word Venice occurs at various places in the text as shown below.
... Magnificoes of Venice, Officers of ...
SCENE I. Venice. A street...
...trades to Venice. Waste no...
What we want to do is now to apply the idea that a word is known by its context. For our task, this
means that we construct a multiset by taking the union over each context of the occurrence of a word. In
the example, this means that the context multiset of the word Venice will be
{...Magnificoes of} ∪ {Officers of} ∪ ... =
{ (a, 1), (i, 1), (magnificoes, 1), (no, 1), (of, 2), (officers, 1), (scene, 1), (street, 1), (trade, 1), (to, 1)
(waste, 1) }
Note that this multiset is a sparse vector representation of a word (in our example, ‘venice’).
The first part of the pair indicates the string value of the word itself, whereas the second part is the
word’s frequency. In the example the frequency of of is 2 because it occurs twice (make sure you thoroughly
understand this example before commencing with the exercise).
To define the contexts precisely, you need to decide on a few things –
1. The set of punctuation symbols, i.e. during the process of tokenization, the token ‘Venice,’ should be
converted into ‘Venice’ (a simple way to do this is to remove the punctuation symbols from the text
file as a pre-processing step or invoke Java’s inbuilt StringTokenizer class with the appropriate set
of delimiters).
12. Apply case normalization, e.g. treat Scene and scene in an identical manner.
3. Decide on the length of the context (on both the left and right sides) of a word. Let this number (a
parameter) be k. Our example used k = 2. Too small a k may not be able to capture the informative
words, while too large a k could add noise to the contexts.
For your exercise, use k = 5 (i.e. 5 words on both the left and the right of a word), and use the following
set of symbols as delimiters – `'"{}[].;,! plus the whitespaces (i.e., carriage return, new line, tab and
space). You should also use case normalization, i.e. convert every token to its lowercase form, i.e. Scene
and SCENE should both be converted to scene.
Note on Implementing Multisets
The lectures have already described how to implement multisets with Java Map (HashMap or TreeMap). For
this exercise, you first need to represent every unique word in the text (commonly known as the vocabulary)
as a multiset (sparse vector of the contexts).
Note that the sparse vector implementation is particularly useful for this exercise because the vocabulary
size (say V ) is expected to be a large number, whereas the expected number of unique words in a context
vector of a word is expected to be much smaller than V .
Specific Tasks
The driver program
Q 10 (7 marks). Write a main class Word2Vec which when invoked on a properties file (filename passed
as an argument) vectorizes each word by processing a given corpus (i.e. computes the multiset of contexts
over each occurrence of it, as shown in the two frame boxes above). You then need to save the vectors on
secondary storage. You are free to use a file format of your choice (text or random-access; refer to the lecture
slides). For marking purposes, we want to see the output of the top 3 most frequent words in
the context of the word ‘venice’.
Compute the cosine similarity between a pair of words
The similarity between 2 given words can be calculated in different ways. We have provided one implementation for you to use, which is available on the git repository. You will need to invoke the method sim(),
and build around it to solve the next question.
Q 11 (7 marks). Write a main class FindWordPairSim that first reads the file saved as a part of the previous
exercise, takes a pair of words as two command line arguments, and prints out the similarity between the
two. For marking purposes, we would like you to print the similarity value between the words
‘shylock’ and ‘bassanio’.
Hint: For the implementation, you might find the MultiSet2
code useful, which also forms a part of your
lecture practice problems.
You do not need to include the answers to the following questions in your lab work submission. Rather,
these are meant to prompt you to reflect back on your understanding of the problem, and some will be useful
in preparing you for the end-of-semester exam.
1. About loading of the dataset.
(a) Did you load the entire corpus (Shakespeare’s play collection) in memory from disk?
(b) What happens if the text to work on is the entire Wikipedia dataset? Is it still likely to fit in
(c) Is loading the entire file in memory the most effective way to construct the sparse vectors?
(d) What other solution could you think of?
2. About the data structure to store the sparse vectors.
(a) Would you prefer a HashMap or a TreeMap to store the vectors for each word?
(b) What if, in addition to vectorizing words, you’re asked to vectorize bigrams (phrases of length 2,
e.g. ‘turn out’) or even 6-grams (e.g. ‘to be or not to be’)? What problems could you then
have with a hashmap?
(c) When in your program did you need to find a word vector given its string representation?
(d) Did you use a treemap or a hashmap to organize the words?
3. Food for thought
(a) What is likely to happen if the context size is increased to 100?
(b) Are all words around the context of a pivot word actually useful to represent the pivot word?
Consider for example the word shylock (remember that we use lowercase in our vocabulary). Do
you think that a word such as of or to are equally important as, say jew or venice?
(c) If not all words are equally important, is there a way to modify the weights (of the terms in the
multiset representation of a pivot word) to better reflect this aboutness mechanism?
(d) How easy (or difficult) will it be to incorporate this modification in your implementation? Will
the run-time effectiveness of the solution change as a result of this modification?
Homework Assistance Wechat: L599238066