Günter Neumann, DFKI, 12th April 2018:

This small readme file contains information about the creation of the word embedding files that we used on pur participation at the Semeval-2018 task 7 about "Semantic relation extraction and classification in scientific  papers".

Background:
___________

A more detailed version of our system LightRel used in that competition can be found in our SemEVal-2018 paper:
Tyler Renslow and Günter Neumann: LightRel at SemEval-2018 Task 7: Lightweight and Fast Relation Classification. In proceedings of SemEval-2018, 208. See also: http://www.dfki.de/~neumann/publications-selected.html
If you are using the below created word embeddings, please cite our paper.

Getting the corpus:
___________________
For our source files, we used two versions:

1. data available from https://aminer.org/citation, especially, we used the corpus ACM-Citation-network V9

2. a somewhat older data of abstracts from DBLP which we combined with abstracts from the SemEval-2018 task 7 training data. The file is in this folder and is called: abstracts-dblp-semeval2018.txt.gz


Extracting abstracts:
_____________________
In order to user this corpus for creating word embeddings, we extracted all abstracted from this file in the following way, after downloading it to a local folder:

1. unzip acm.vp.zip
2. time grep -p "^#\!" acm.txt | sed 's/^#\!//' > acm_abstracts.txt

(That takes about 14 seconds on a MacBookPro with i/7 4 core processor and 16GB RAM)

The resulting file will contain in each line a single abstract.

The total number of abstracts: 1,672,248 with 230,032,985 tokens.

Creating word vectors:
______________________

We use word2vec v 0.1c for creating the word vectors, cf. https://code.google.com/archive/p/word2vec/

Calling word2vec for acm_abstracts.txt:

time word2vec -train acm_abstracts.txt -output acm_abstracts.wcs.txt -size 300 -min-count 5 -binary  0

(That takes about: 28 minutes on a MacBookPro with i/7 4 core processor and 16GB RAM)

Calling word2vec for abstracts-dblp-semeval2018.txt:

1. gunzip file abstracts-dblp-semeval2018.txt.gz

2. time word2vec -train abstracts-dblp-semeval2018.txt -output abstracts-dblp-semeval2018.wcs.txt -size 300 -min-count 5 -binary  0


Available files in the folder:
______________________________

0. README.txt: this readme file

1. abstracts-dblp-semeval2018.txt.gz: compressed source file of abstracts from DBLP and SemEval 2018

2. abstracts-dblp-semeval2018.wcs.txt.gz: the word embeddings compute by file  abstracts-dblp-semeval2018.txt (see comments above)

3. acm_abstracts.wcs.txt.gz: the word embeddings computed by file acm_abstracts.txt (see comments above)

Note: when using the word embeddings with our LightRel system, they need to be gunziped first.