Académique Documents
Professionnel Documents
Culture Documents
Seq2Seq 35.52% 48.06 7.11 2. This story’s events occur in a PLAUSIBLE ORDER.
Unrestricted 15.82% 5.73 7.32
3. This story’s sentences MAKE SENSE given sentences
Clustered 94.29% 7.61 4.90
Test Corpus 24.64% n/a 7.37
before and after them.
marry
recruited and taught individuals to convert event sequences By making the equal interval assumption, we can turn the
into natural language before giving generated plots to human Likert values into numerals from 1 for Strongly Disagree to 5
judges. By having concise, grammatically- and semantically- for Strongly Agree.
guaranteed human translations of generated plot events we The first seven questions are taken from a tool designed
know that the human judges are evaluating the raw events and specifically for the evaluation of computer-generated stories
not the creative aspects of the way sentences are written. that has been validated against human judgements [Purdy
et al., 2018]. Each participant answered the questions for
Corpus Creation all three story conditions. The question about the story be-
ing a soap opera was added to determine how the perfor-
We randomly selected 5 stories generated by our DRL-
mance of the DRL story generator affects reader perceptions
clustered system, 5 stories generated by the Seq2Seq baseline
of the theme, since the system was trained on soap-opera–like
system, and 3 from the eventified testing corpus. We trained
plots. The single plot question was added to determine if our
26 people unassociated with the research team to “translate”
DRL model was maintaining the plot better than the Seq2Seq
events into short natural language sentences. The testing cor-
model. The questions about correct grammar, interesting lan-
pus was mainly used to verify that the translation process was
guage, and avoiding repetition are irrelevant to our evaluation
accurate since we do not expect our models to reach this up-
since the natural language was produced by the human trans-
per bound; thus only three stories were selected.
lators, but were kept for consistency with Purdy et al. [2018].
Each translator was instructed that their “primary goal is to
Finally, participants answered two additional questions that
translate stories from an abstract ‘event’ representation into a
required short answers to be typed:
natural language sentence.” The instructions then continued
with: (1) a refresher on parts of speech, (2) the format of the • Please give a summary of the story above in your own
event representation, (3) examples of events and their corre- words.
sponding sentences, (4) resources on WordNet and VerbNet • For THIS STORY, please select which of the previous
with details on how to use both, and (5) additional general attributes (e.g. enjoyable, plausible, coherent) you found
guidelines and unusual cases they might encounter (e.g., how to be the MOST IMPORTANT and explain WHY.
to handle empty parameters in events). The translators were
further instructed to try to produce sentences that were as The answers to these questions were not evaluated. Instead, if
faithful to the event as possible and to not add extraneous any participants failed to answer the short answer questions,
details, swap the order of words in the event, nor choose a their data was removed from the results. We removed 25 par-
better verb even if the plot would be improved. ticipants’ data in total.
Pairs of people translated plots individually and then came
together to reach a consensus on a final version of the plot. Results and Discussion
That is, human translators reversed the eventification process We performed one-way repeated-measures ANOVA on the
shown in Figure to create a human readable sentence from an collected data since each participant rated a story from each
event. Table shows an example of an entire eventified story category. Average scores and their significance across condi-
and the corresponding human translations. tions can be see in Figure 2.
Questions on interesting language and avoiding repetition
Experimental Setup are not found to be significant across all three conditions.
We recruited 175 participants on Amazon Mechanical Turk. Since these are not related to event generation model per-
Each participant was compensated $10 for completing the formance this provides an indication that the translations are
questionnaire. Participants were given one of the translated fair across all conditions. Grammar was significantly differ-
plots at a time, rating each of the following statements on ent between testing corpus stories and DRL-generated stories
a 5-point Likert scale for how much they agreed (Strongly (p < 0.05), which was unanticipated. Upon further analysis,
DRL Event Output (subject, verb, object, modifier) Translated Sentence
relative.n.01, disappearance-48.2, empty, empty My cousin died.
NE1, say-37.7-1, visit, empty Alexander insisted on a visit.
NE1, meet-36.3-1, female.n.02, empty Alexander met her.
NE0, correspond-36.1, empty, NE1 Barbara commiserated with Alexander.
physical entity.n.01, marry-36.2, empty, empty They hugged.
group.n.01, contribute-13.2-2, empty, LOCATION The gathering dispersed to Hawaii.
gathering.n.01, characterize-29.2-1-1, time interval.n.01, empty The community remembered their trip.
physical entity.n.01, cheat-10.6, pack, empty They robbed the pack.
physical entity.n.01, admire-31.2, social gathering.n.01, empty They adored the party.
Table 2: An example of an eventified story and the translation written by a pair of participants. NE refers to a named entity.
Figure 2: Amazon Mechanical Turk participants rated a story from each category across nine different dimensions on a scale of
1-Strongly Disagree to 5-Strongly Agree. Single asterisks stand for p<0.05. Double asterisks stand for p<0.01.
both the baseline Seq2Seq model and the DRL model gener- For all other dimensions, the DRL stories are not found to
ated empty values for the object and modifier at higher rates be significantly different than baseline stories. When contin-
than found in the corpus. It is harder to make complete, gram- uing the training of a pre-trained language model using a re-
matical sentences with only two tokens in an event, namely ward function instead of the standard cross entropy loss there
when a verb is transitive–requiring at least one object. Beyond is a non-trivial chance that model updates will degrade any
more expected results such as having a better plausible order, aspect of the model that is not related to goal achievement.
the testing corpus stories were also significantly more likely Thus, a positive result is one in which DRL-condition stories
to be perceived as being soap operas (p < 0.01), the genre are never significantly lower than Seq2Seq-condition stories.
from which the corpus stories were drawn. It is unclear why This shows that we are able to get to the goal state without
this would be the case, except that both the Seq2Seq and DRL any significant degradation in the other aspects of story gen-
models could be failing to learn some aspect of the genre de- eration compared to a baseline.
spite being trained on the same corpus.
Stories in the DRL condition were significantly perceived Conclusions
to have more plausible orderings than those in the baseline Language model–based story and plot generation systems
Seq2Seq condition (p < 0.05) and were significantly more produce stories that lack direction. Our reward shaping tech-
likely to be perceived as following a single plot (p < 0.05). nique learns a policy that generates stories that are probabilis-
Since stories generated by the baseline Seq2Seq model be- tically comparable with the training corpus but also achieve
gin to lose coherence as the story progresses, these results the goal ∼93%. Furthermore, the reward-shaping technique
confirm our hypothesis that the DRL’s use of reward shaping improves perplexity when generated plots are compared back
keeps the plot on track. The DRL is also perceived as gen- to the testing corpus. In plot generation, the comparison to
erating stories with more local causality than the Seq2Seq, an existing corpus is not the most significant metric because
although the results were not statistically significant. novel plots may also be good. A human subject study shows
that the reward shaping technique significantly improves the [Mostafazadeh et al., 2016] Nasrin Mostafazadeh, Lucy
plausible ordering of events and the likelihood of producing Vanderwende, Wen-tau Yih, Pushmeet Kohli, and James
a sequence of events that is perceived to be a single, coher- Allen. Story Cloze Evaluator: Vector Space Representa-
ent plot. We thus demonstrated for the first time that control tion Evaluation by Predicting What Happens Next. In 1st
over neural plot generation can be achieved in the form of Workshop on Evaluating Vector Space Representations
providing a goal that indicates how a plot should end. for NLP, pages 24–29, 2016.
[Ng et al., 1999] Andrew Y Ng, Daishi Harada, and Stuart
References Russell. Policy invariance under reward transformations:
[Bamman et al., 2013] David Bamman, Brendan O’Connor, Theory and application to reward shaping. In ICML, vol-
and Noah A Smith. Learning Latent Personas of Film ume 99, pages 278–287, 1999.
Characters. In ACL, pages 352–361, 2013.
[Pérez y Pérez and Sharples, 2001] Rafael Pérez y Pérez and
[Cavazza et al., 2002] M. Cavazza, F. Charles, and S. Mead. Mike Sharples. MEXICA: A computer model of a cogni-
Planning characters’ behaviour in interactive story- tive account of creative writing. Journal of Experimental
telling. Journal of Visualization and Computer Animation, and Theoretical Artificial Intelligence, 13:119–139, 2001.
13:121–131, 2002.
[Porteous and Cavazza, 2009] Julie Porteous and Marc
[Fan et al., 2018] Angela Fan, Mike Lewis, and Yann
Cavazza. Controlling narrative generation with planning
Dauphin. Hierarchical Neural Story Generation. In ACL,
trajectories: the role of constraints. In ICIDS, pages
pages 889–898, 2018.
234–245. Springer, 2009.
[Gehring et al., 2017] Jonas Gehring, Michael Auli, David
Grangier, Denis Yarats, and Yann N. Dauphin. Convolu- [Purdy et al., 2018] Christopher Purdy, Xinyu Wang, Larry
tional Sequence to Sequence Learning. In ICML, 2017. He, and Mark Riedl. Predicting story quality with quanti-
tative measures. In AIIDE, 2018.
[Gervás et al., 2005] Pablo Gervás, Belen Dı́az-Agudo, Fed-
erico Peinado, and Raquel Hervás. Story plot generation [Radford et al., 2019] Alec Radford, Jeffrey Wu, Rewon
based on CBR. Journal of Knowledge-Based Systems, Child, David Luan, Dario Amodei, and Ilya Sutskever.
18(4–5):235–242, 2005. Language Models are Unsupervised Multitask Learners.
[Jenks and Caspall, 1971] George F Jenks and Fred C Cas- Technical report, 2019.
pall. Error on choroplethic maps: definition, measurement, [Riedl and Young, 2010] Mark O. Riedl and R. Michael
reduction. Annals of the Association of American Geogra- Young. Narrative planning: Balancing plot and charac-
phers, 61(2):217–244, 1971. ter. Journal of Artificial Intelligence Research, 39:217–
[Khalifa et al., 2017] Ahmed Khalifa, Gabriella AB Bar- 268, 2010.
ros, and Julian Togelius. Deeptingle. arXiv preprint [Roemmele and Gordon, 2015] Melissa Roemmele and An-
arXiv:1705.03557, 2017. drew S. Gordon. Creative help: A story writing assistant.
[Khandelwal et al., 2018] U. Khandelwal, H. He, P. Qi, and In ICIDS, pages 81–92, 2015.
D. Jurafsky. Sharp nearby, fuzzy far away: How neural [Schuler and Kipper-Schuler, 2005] Karin Kipper Schuler
language models use context. In ACL, 2018. and Karen Kipper-Schuler. VerbNet: A Broad-Coverage,
[Lebowitz, 1987] Michael Lebowitz. Planning stories. In Comprehensive Verb Lexicon. PhD thesis, University of
CogSci, pages 234–242, 1987. Pennsylvania, 2005.
[Li et al., 2013] Boyang Li, Stephen Lee-Urban, George [Sutskever et al., 2014] Ilya Sutskever, Oriol Vinyals, and
Johnston, and Mark O. Riedl. Story generation with Quoc V Le. Sequence to sequence learning with neural
crowdsourced plot graphs. In AAAI, July 2013. networks. In NeurIPS, pages 3104–3112, 2014.
[Li et al., 2016] Jiwei Li, Will Monroe, Alan Ritter, Michel
[Sutton and Barto, 1998] Richard S. Sutton and Andrew G.
Galley, Jianfeng Gao, and Dan Jurafsky. Deep reinforce-
Barto. Reinforcement Learning: An Introduction. MIT
ment learning for dialogue generation. In EMNLP, pages
press, 1998.
1192–1202, 2016.
[Martin et al., 2018] Lara J. Martin, Prithviraj Am- [Swanson and Gordon, 2012] Reid Swanson and Andrew
manabrolu, Xinyu Wang, William Hancock, Shruti Gordon. Say Anything: Using textual case-based reason-
Singh, Brent Harrison, and Mark O. Riedl. Event Rep- ing to enable open-domain interactive storytelling. ACM
resentations for Automated Story Generation with Deep TiiS, 2(3):16:1–16:35, 2012.
Neural Nets. In AAAI, pages 868–875, 2018. [Ware and Young, 2011] Stephen Ware and R. Michael
[Meehan, 1977] James R. Meehan. TALE-SPIN: An inter- Young. CPOCL: A Narrative Planner Supporting Conflict.
active program that writes stories. In IJCAI, pages 91–98, In AIIDE, pages 97–102, 2011.
1977. [Williams, 1992] Ronald J. Williams. Simple statistical
[Miller, 1995] George A. Miller. Wordnet: a lexical database gradient-following algorithms for connectionist reinforce-
for english. Communications of the ACM, 38(11):39–41, ment learning. Machine Learning, 8(3):229–256, May
1995. 1992.
[Yao et al., 2019] Lili Yao, Nanyun Peng, Weischedel Ralph,
Kevin Knight, Dongyan Zhao, and Rui Yan. Plan-and-
write: Towards better automatic storytelling. In AAAI,
2019.