Assessing the Stylistic Properties of Neurally Generated Text in Authorship Attribution

Enrique Manjavacas & Jeroen De Gussem & Walter Daelemans & Mike Kestemont

08/09/2017

EMNLP17 (Workshop on Stylistic Variation) - Copenhagen

https://emanjavacas.github.com/slides-content/copenhagen-emnlp-17

1 Introduction

RNNs are a powerful tool for text generation

  • Neurally generated text makes an authentic impression on readers
  • Basis for renewed interest in NLG
  • Recent interest in the information encoded by RNNs

Evaluation of neural synthetic text through Authorship Attribution

  • To what extent does the generative model retain author-related stylistic properties of the source?
  • How does it compare to traditional n-gram models?

Experiment idea

  • Fit Language Models (one per author)
  • Generate a collection of docs per author through sampling
  • Apply Authorship Attribution (AA) to the resulting synthetic dataset

Assumption

“Algorithmic” definition of style

Style is whatever textual properties an algorithm might use in order to successfully attribute a text to its actual author, without considering whether the properties discovered by the algorithm are stylistic in a narrower sense.

2 Summary of the presentation

  • Describe the models used for text generation
  • Describe the experimental setup
  • Discuss the results

3 Text Generation

Character-level Text Generation with Language Models

\(P(w_1, w_2, ..., w_n)\) = \(P(w_1|\text{<}bos\text{>})\) \(* \prod_{i=1}^n P(w_{i+1}|w_1, ..., w_{i})\)

Models

NGLM

\(P(w_t|\ldots)\) \(\approx P(w_t|w_{t-(n+1)}, \ldots, w_{t-1})\)

RNNLM

Sorry, your browser does not support SVG.

4 Experiment

Experimental Setup

Difficulty

Maximize comparability of authentic and generated text

  • Unequal training size per author for LMs
  • Unequal training and test size per author for the attributor (important in AA)
  • Authentic text has doc-level structure, LM-generated text does not

Proposed method

  • Random even doc-level split (referred to as \(\alpha\) and \(\omega\) for simplicity)
  • Create 20 fixed-size (5000w) docs per split by sampling sentences
  • Sample a third set (\(\bar{\alpha}\)) from the author’s LMs trained on \(\alpha\)

Attribution Experiments

Dataset

  • Patrologia Latina (⩲ 113M words)
  • Ecclesiastical latin texts spanning 1000 years
  • Homogeneous institutionalized literary language (L2 language)

Language Model Fitting

NGLM

  • N-gram order is set to 6
  • Parameters estimated through MLE (no smoothing since only interested in generation)

RNNLM

Model definition

Parameter Value
Embedding size 24
RNN Cell LSTM
Hidden size 200
Hidden Layers 2

Training

Parameter Value
Batch size 50
Optimizer Adam (default params)
Learning rate 0.001
Gradient norm clipping 5.0
Dropout 0.3 (RNN output)
Epochs 50

Validation perplexity: 4.015 (± 0.183)

Attributor

  • Linear SVM (scikit-learn) on tf-idf character {2,3,4}-grams
  • No word-level features since RNN can produce unseen words
  • Grid-search over parameters
Parameter Grid values
Max-features 5,000; 10,000; 15,000; 30,000
C 1; 10; 100; 1,000

5 Results

5-fold cross-validated scores per experiment

Numbers


Source Experiment F1 P R
Real < α,ω > 0.8330.8180.869
< ω,α > 0.8110.7950.853
NGLM < α,ω > 0.7060.7440.750
< ω,α > 0.8370.8110.881
RNNLM< α,ω > 0.6350.7010.658
< ω,α > 0.7240.7780.775

Discussion

Why does NGLM outperforms RNNLM in both setups (\(<\bar{\alpha},\omega>, <\omega,\bar{\alpha}>\))?

SVM uses very local features, NGLM reproduces very well the local distribution

LEXICAL OVERLAP: Mean-normalized ngram Jaccard-similarity across authors.

Character-level

Why does NGLM outperform Real setup in (\(\omega\), \(\alpha\))?

  • Prunning effect? Eliminating “distractive” features and enhancing those that are more relevant
  • It might prove beneficial for actual AA

6 Self-learning (Data-augmentation) Experiments

  • Is there still some authorial signal in the RNNLM-generated data?
  • Is there an effect of the long-term dependencies learned by the RNNLM on the stylistic properties of the generated data?
  • If so, augmenting the authentic training data with RNNLM-generated data could yield attribution improvements

Experiment

\(<\alpha+\bar{\alpha}, \omega>\)

Numbers


Source Experiment F1 P R
Real < α,ω > 0.8330.8180.869
< ω,α > 0.8110.7950.853
NGLM < α + α,ω >0.8140.8090.850
< α,ω > 0.7060.7440.750
< ω,α > 0.8370.8110.881
RNNLM< α + α,ω >0.8720.8780.892
< α,ω > 0.6350.7010.658
< ω,α > 0.7240.7780.775

Discussion

-> The long-term dependencies prove beneficial (not redundant)

-> (Evidence for) RNNLM better modelling stylistic variation in the original distribution

7 Conclusion

  • LMs seem to capture stylistic properties to a certain extent
  • Arguably RNNLMs are richer models (based on the augmentation experiments)
  • More global attributors still needed. Stylistic evaluation still too local
  • Further exploration of Data Prunning (NGLM) & Data augmentation (RNNLM) for AA

8 Thank you for your attention!

9 Plots

DOC-LEVEL PCA: NGLM (150 most-freq ngrams doc-representation)

DOC-LEVEL PCA: RNNLM (150 most-freq ngrams doc-representation)