https://emanjavacas.github.com/slides-content/antwerp-nlpmeetup-18
Write a story about robots, using a robot, in relation to “I, Robot”
Co-creatively write a science-fiction story using a system trained on similar literature
… a collaborative process between multiple agents, where in this context, one agent is a computational system. […] where crucially, the result of the output is greater than “the sum of its parts” (Davis 2013)
Sentences | Words | Characters | |
---|---|---|---|
Number | 24.6m | 425.5m (NLWiki: >300m) | 2001m |
Novel Average | 3k | 59k | 309,531k |
\(P(The, cat, sat, on, the, mat, .)\) =
\(P(The | \text{<}bos\text{>})\)
* \(P(cat | \text{<}bos\text{>} , The)\)
* \(\ldots\)
* \(P(. | \text{<}bos\text{>} , \ldots , mat)\)
More formally…
\(P(w_1, w_2, ..., w_n)\) = \(P(w_1|\text{<}bos\text{>})\) \(* \prod_{i=2}^n P(w_{i}|w_1, ..., w_{i-1})\)
RNNLM Implementation (Embedding + RNN Layer + Output Softmax)
Sample “n” characters from the Language Model
Multinomial sampling with temperature
We run the model over characters
Parameter | Range |
---|---|
Embedding sizes | 24, 46 |
RNN Cell | GRU, LSTM |
Hidden size | 1024, 2048 |
Hidden Layers | 1 |
Stochastic Gradient Descent (SGD) + bells and whistles
Parameter | Value |
---|---|
Optimizer | Adam (default params) |
Learning rate | 0.001 |
Gradient norm clipping | 5.0 |
Dropout | 0.3 (RNN output) |
BPTT | 200 |
Character-level | Word-level | |
---|---|---|
Vocabulary | Smaller (<1000) | Larger (≃ 3m) |
Dataset size | Larger (<2000m) | Smaller (>425m) |
Preprocessing | None | Tokenization |
Overfitting | Not a problem | Quite a problem |
Dependency span | Smaller (BPTT 250 ≃ 50 words) | Larger |
Output distribution | Not too interesting | Quite interesting |
Non-Negative Matrix Factorization
Word co-occurrence matrix
Latent topics (unnormalized distributions over words)
Latent Topic Language Model
<Unk>
Process input at the character level producing word-level embeddings (CNN, RNN)
Since reducing the vocabulary size is not advised, speed up the Softmax computation during training