Monday, April 3, 2017

Language Modelling Datasets and Tools

An introduction to language modelling (LM) is here
For RNN LMs see Mikolov's slides here
Mikolov's RNN LM tool kit  is here
SRI international's language modelling tool kit is here

Language modelling can be slow with RNNs. This is a faster implementation that uses Eigen.

There is a 1B token benchmark dataset released by Google for evaluating language models. It can be obtained here It has tokenised and splitted heldout and train portions. Shard 0000 is used for reporting test results.


Continuously monitor GPU usage

 For nvidia GPUs do the follwing: nvidia-smi -l 1