skip to main content
survey
Open Access

A Survey on Automatic Generation of Figurative Language: From Rule-based Systems to Large Language Models

Published:14 May 2024Publication History

Skip Abstract Section

Abstract

Figurative language generation (FLG) is the task of reformulating a given text to include a desired figure of speech, such as a hyperbole, a simile, and several others, while still being faithful to the original context. This is a fundamental, yet challenging task in Natural Language Processing (NLP), which has recently received increased attention due to the promising performance brought by pre-trained language models. Our survey provides a systematic overview of the development of FLG, mostly in English, starting with the description of some common figures of speech, their corresponding generation tasks, and datasets. We then focus on various modelling approaches and assessment strategies, leading us to discussing some challenges in this field, and suggesting some potential directions for future research. To the best of our knowledge, this is the first survey that summarizes the progress of FLG including the most recent development in NLP. We also organize corresponding resources, e.g., article lists and datasets, and make them accessible in an open repository. We hope this survey can help researchers in NLP and related fields to easily track the academic frontier, providing them with a landscape and a roadmap of this area.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Figurative language is a cover expression in language studies that includes a variety of figures of speech, such as hyperbole and metaphor, each of which can be used to accomplish a constellation of communicative goals [107]. Figurative expressions can be sentences or even single words, they can make your writing or speech more interesting and captivating, or help convey abstract concepts otherwise difficult to visualize. For instance, in the sentence pair “He is very happy” and “He is floating on cloud nine,” the former simply expresses a state of happiness, while the latter paints a vivid picture, helping people to better understand the intensity of the emotion being conveyed. Understanding and generation of figurative language are particularly challenging due to its implicit linguistic nature and are considered a “bottleneck” for automatic language processing [18, 114].

Natural Language Generation (NLG) is a fundamental yet challenging branch in Natural Language Processing (NLP) [36], which aims at automatically generating high-quality, coherent, and understandable human language from various forms of data, such as text [120], image [5], structured data [9, 25]. Figurative language generation (FLG), also known as creative text generation, is a text-to-text generation task whose goal is to reformulate a given text in a different one containing the desired figure of speech, while still being faithful to the original context (see Figure 1, where the literal snippet “I was very nervous when I learned the result was about to be announced” gets rewritten into a sentence which contains the idiom “My heart skipped a beat when I learned the result was about to be announced.”) This usually requires adding some additional information associated with the original context to trigger and realise the desired figure of speech while at the same time preserving a large portion of the input sentence [19, 105, 146].

Fig. 1.

Fig. 1. An illustration for figurative language generation from literal text, along with the required data resources and the corresponding applications.

Like with other NLG tasks, especially before the advent of large language models (LLMs, e.g., GPT-3 [16]), researchers have typically relied on knowledge-based approaches for FLG. These kinds of approaches involve a complex process, such as using prior linguistic knowledge to design rules that can capture textual patterns or construct relevant knowledge resources. However, the obtained results are often unsatisfactory, for example lacking the flexibility necessary and linguistic subtlety to instantiate figurative language [115, 140]. With the development of deep learning [41], neural networks achieve impressive performances on various NLP tasks and have become mainstream methods for NLG tasks [67, 89], especially the sequence-to-sequence (seq2seq) framework [120] in the context of rewriting. However, the majority of NLP research has concentrated on literal language, despite the ubiquity of figurative language in human language [18]. Consequently, the advancements in FLG may not have been as substantial as those in other text-to-text generation tasks, such as machine translation [55], question answering [131], and text summarization [24]. The main reason is that most of the earlier works, including the workshop on figurative language processing [11, 39, 54], mainly focused on figurative language detection rather than generation. Moreover, the NLP community lacks standard benchmark datasets for FLG, which is a crucial point to advance a research field.

Recent years have witnessed the emergence and rapid advancement of pre-trained language models (PLMs), in particular the commonly used Transformers [128] based models [16, 27, 65, 99, 101], which have become the most popular method in NLP and can achieve state-of-the-art performance in virtually all tasks. In this paradigm, neural models are pre-trained on large-scale unlabeled text collections in a self-supervised fashion to learn fundamental language representations. Then, the model can be fine-tuned for downstream tasks with task-specific training objectives, thus avoiding the need to train a new model from scratch. Recently, researchers apply PLMs to FLG including the generation of metaphor [1, 117], hyperbole [123, 145], simile [21, 143], irony [19, 148], idiom [146], pun [77, 141], analogy [12], and personification [76] (see Section 2.2 for more details on these figures of speech). Given the rapid development of FLG, a comprehensive literature review can help more NLP researchers and practitioners to participate and easily track the academic frontier in this area, especially making it more accessible for interested beginners to enter this field. Additionally, achieving the goal of enabling machines to write in different figures of speech has a wide range of applications, which are mainly divided into two categories: aiding in various downstream NLP tasks [31] and supporting the development of application products such as educational systems [21].

As far as we know, this is the first survey that presents a comprehensive review of FLG. It first introduces the background of different figures of speech, and surveys related generation tasks as well as their corresponding benchmark datasets. We find that existing works mainly focus on modelling single figurative forms, i.e., rewriting a literal sentence into one containing a specific figure of speech [21, 72, 95, 115, 124]. Then, we review the modelling approaches, from traditional to state-of-the-art, and divide them into two categories: knowledge-based and neural-based approaches. Regarding knowledge-based approaches, they usually require linguistic knowledge to design rules and templates to fit figurative patterns [1, 121], or to construct relevant resources for modelling [45, 91]. Neural approaches, instead, require less linguistic knowledge and complex linguistic processing but need large-scale labelled data (e.g., parallel literal-hyperbole pairs) and computational resources for training neural models. We also review the various assessment strategies for FLG, which include both human-based and automatic evaluation, discussing their corresponding pros and cons. Finally, based on our review and the analysis of current trends, we suggest several challenging problems and prospective research directions.

As shown in Figure 2, the main contributions of this survey can be summarized as follows:

Fig. 2.

Fig. 2. Overview of the survey.

Systematic Survey We offer the first comprehensive review of FLG, which includes background information for common figures of speech, benchmark datasets, modelling approaches, and evaluation methods.

Available Resources Based on the literature review, we collect and make accessible all available resources described in this survey (including corpora and paper lists), so as to foster new work in this field.1

Future Directions We discuss current progress and the corresponding limitations, thereby offering an outline of possible future research directions in FLG. We also suggest some potential solutions for a number of open issues.

The rest of the survey is organized as follows: In Section 2, we introduce background information for several figures of speech, along with some related FLG tasks and datasets. In Section 3 and 4, we discuss knowledge-based and neural-based approaches. In Section 5, we describe evaluation strategies, including both automatic and human-based methods. In Section 6, we bring forward the main challenges and limitations of FLG, and suggest improvements for future work in this area. We conclude the article with a short reflection in Section 7.

Skip 2BACKGROUND Section

2 BACKGROUND

We first provide an overview of FLG from a conditional distribution perspective, then introduce some background information for several figures of speech commonly studied in NLP, as well as their corresponding datasets in the deep learning era. We also discuss two closely related tasks, namely figurative language detection and text style transfer.

2.1 Figurative Language Generation

Figurative language generation refers to the task of reformulating a given text into one containing the desired figure of speech while still being faithful to the original content. Table 1 shows examples of rewriting the given literal texts in the target figurative form.

Table 1.
Figure of speechSentences
LiteralFrom the day you were born, you have been invincible.
SimileFrom the day you were born, you have been like a well-seasoned superhero.
LiteralLife has good and bad moments, often in fast succession.
MetaphorLife is a roller coaster with lots of ups and downs.
LiteralMy new laptop is very thin and light compared to the bulky old one.
HyperboleMy new laptop looks like a piece of paper compared to the bulky old one.
LiteralWe were all anxious as we waited for the surgeon’s report.
IdiomWe were all on pins and needles as we waited for the surgeon’s report.
LiteralI hate when people don’t think the rules apply to them.
Irony (Sarcasm)I love when people think the rules don’t apply to them.
LiteralEveryone should learn to be satisfied with what they have.
PunEveryone should learn to be satisfied with the state they are in.
LiteralThe birds chirping in spring.
PersonificationThe birds welcome the spring by singing melodious tunes.

Table 1. Examples of Figurative Language Generation from Literal Texts

For a source sentence \({\bf x}=\lbrace x_{1}, \cdots , x_{n}\rbrace\) of length \(n\), the goal is to generate a sentence \({\bf y}=\lbrace y_{1}, \cdots , y_{m}\rbrace\) of length \(m\) with a specific figurative form \(f\). Formally, the generative model of the conditional probability of the observable \({\bf x}\), given a target \({\bf y}\), can be formulated as (1) \(\begin{equation} p_{\theta }(\hat{{\bf y}} | {\bf x}, f) = \prod _{t=1}^{m} p_{\theta }(\hat{y}_{i}|\hat{y}_{1:i-1}, {\bf x}, f) . \end{equation}\) The target form \(f\) may vary according to the target figure of speech. To be considered successful, the generated sentence \(\hat{{\bf y}}\) must satisfy three criteria: (i) it must contain the target figurative form \(f\); (ii) it must preserve the original context; and (iii) it must exhibit appropriate natural language characteristics such as fluency, readability, and coherence. It is worth noting that there are two possible ways of generation: (i) the generated sentence only contains the target figurative form while the original form is removed; or (ii) the generated sentence contains both the source and target figurative forms. For example, the hyperbolic sentence “I am not happy that he urged me to finish all the hardest tasks in the world” can be rewritten as “Glad he urged me to finish all the hardest tasks in the world,” which contains both sarcasm and hyperbole.

2.2 Figures of Speech and Datasets

Table 2 shows the related FLG tasks and their corresponding datasets. The FLG tasks mainly focus on rewriting literal sentences into sentences with specific figures of speech, plus a few other tasks, such as generating a follow-up sentence that is coherent with the narrative and consistent with the figurative expression. As far as language is concerned, most datasets are in English, with a few in Chinese and German. Overall, each figure of speech contains at least one related task and one dataset, which are detailed below.

Table 2.
Figure of speechTaskDatasetTrainValidTestLangPara
SimileLiteral\(\leftrightarrow\)SimileChakrabarty et al. [21]82,6875,145150en\(\checkmark\)
Simile\(\leftrightarrow\)Context with simileZhang et al. [143]5.4M2,5002,500zh\(\checkmark\)
Narrative+Simile\(\rightarrow\)TextChakrabarty et al. [18]3,1003761,520en\(\checkmark\)
Concept\(\rightarrow\)Analogy + ExplanationInstructGPT [12]--148en\(\checkmark\)
Simile + Literal\(\rightarrow\)ExplanationFLUTE [22]1500--en\(\checkmark\)
MetaphorLiteral\(\leftrightarrow\)MetaphorStowe et al. [115]260k15,833250en\(\checkmark\)
Chakrabarty et al. [23]90k3,498150en\(\checkmark\)
Stowe et al. [117]248k-150en\(\checkmark\)
Mohammad et al. [85]--171en\(\checkmark\)
CMC [69]3,554/2,703--en\(\times\)
Metaphor + Literal\(\rightarrow\)ExplanationFLUTE [22]1500--en\(\checkmark\)
HyperboleLiteral\(\leftrightarrow\)HyperboleHYPO [124]709--en\(\checkmark\)
HYPO-cn [56]2,082/2,680--zh\(\times\)
HYPO-red [123]2,163/1,167--en\(\times\)
HYPO-XL [145]-/17,862--en\(\times\)
IdiomIdiom\(\rightarrow\)LiteralLiu and Hwa [72]88-84en\(\checkmark\)
Idiom (en)\(\leftrightarrow\)Literal (de)Fadaee et al. [31]1,998-1,500en/de\(\checkmark\)
Idiom (de)\(\leftrightarrow\)Literal (en)1,848-1,500de/en\(\checkmark\)
Literal\(\leftrightarrow\)IdiomPIE [146]3,784876876en\(\checkmark\)
Narrative+Idiom\(\rightarrow\)TextChakrabarty et al. [18]3,2043551,542en\(\checkmark\)
Irony (Sarcasm)Literal\(\leftrightarrow\)Irony(Sarcasm)Peled and Reichart [95]2,400300300en\(\checkmark\)
Mishra et al. [83]--203en\(\checkmark\)
Zhu et al. [148]112k/262k--en\(\times\)
Ghosh et al. [40]4,762--en\(\checkmark\)
Sarcasm + Literal\(\rightarrow\)ExplanationFLUTE [22]3356--en\(\checkmark\)
Image\(\rightarrow\)Sarcastic DescriptionSentiCap [79]--503en\(\checkmark\)
PunWord Senses\(\rightarrow\)PunSemEvaltask7 [82]1274--en\(\checkmark\)
Context\(\rightarrow\)Pun Sun et al. [119]2,753--en\(\checkmark\)
PersonificationTopic\(\rightarrow\)Personification Liu et al. [76]67,4413,7473,747zh\(\checkmark\)
  • Notes: (i) Lang = Language, Para = Parallel training data; (ii) all datasets reported here are mainly used to train neural network based models.

Table 2. List of Common Figures of Speech and their Related Datasets

  • Notes: (i) Lang = Language, Para = Parallel training data; (ii) all datasets reported here are mainly used to train neural network based models.

Simile It is a figure of speech that compares two different things by saying that one thing is like another one, so it often uses comparison expressions such as like, as, and than. For instance, the sentence “From the day you were born, you have been invincible” could be rephrased as “From the day you were born, you have been like a well-seasoned superhero.” Here, using the simile “like a well-seasoned superhero” can make the expressions and descriptions more emphatic (it is often used in literature and poetry to spark the reader’s imagination [94].)

To create a simile dataset, Chakrabarty et al. [21] collected sentences from the web containing the phrase like a. Then, the authors employed the generative model of the knowledge graph COMET [13] transforming English similes into their corresponding literal sentences, thereby automatically creating literal-simile pairs for supervised training. Zhang et al. [143] created a large-scale simile dataset in Chinese, including similes and corresponding contexts containing similes, from online free-access fiction. Chakrabarty et al. [18] released a simile-related dataset, which aims at generating a plausible next sentence that is coherent with the context and consistent with the meaning of the simile that follows the given narratives. More recently, Bhavya et al. [12] proposed the task of generating (i) a source concept analogous to a given target concept, and (ii) an explanation of the similarity between the target and the source concept. Given that this dataset contains a large number of “like” expressions, we consider it in the simile category. Similarly, Chakrabarty et al. [22] introduced FLUTE, a dataset containing samples of literal, and figurative (simile, metaphor, and sarcasm) sentences along with their natural language explanation.

Metaphor This is the most common figure of speech [58], which refers to one concept, usually more abstract, by means of another one, usually more concrete. Unlike similes that explicitly say that one thing is like another, a metaphor conveys that one thing is the same as another at some level. The literal sentence “Life has good and bad moments, often in fast succession,” for instance, can be reformulated with a metaphor as “Life is a roller coaster.” Here, the verb be (i.e., is) is used instead of like, and offers a vivid and concrete way (i.e., a roller coaster) to visualise life. Metaphor is not only a form used to add a creative flavour to the text, but can also be utilized to achieve textual goals that may be difficult to obtain through literal expressions [50].

Veale [129] states that computational metaphor generation can be used for many applications, such as entertainment, education, and even pure whimsy. Of all the figures of speech, metaphor receives the most attention from researchers. However, existing metaphor datasets for automatic generation are almost all in English. For instance, Mohammad et al. [85] released a metaphor evaluation dataset in English. They first extracted metaphorical and literal verb senses from WordNet [81], then created datasets of pairs of their usages by asking crowd workers to annotate the usage of verb occurrences as metaphorical or literal. Stowe et al. [117] and Chakrabarty et al. [23] built two large-scale literal-metaphor datasets by exploiting the Gutenberg Poetry corpus [49]. Specifically, they first trained a word-level metaphor detection classifier to identify and mask the metaphoric verbs in the sentences, and leveraged the pre-trained language model, which uses the unmasked context to predict the masked word. Similarly, Stowe et al. [115] used a mask-then-fill method to create a literal-metaphor dataset. They built a literal vocabulary and a metaphoric vocabulary using lexical resources (MetaNet [28] and FrameNet [6]), which were then used to mask the words of given sentences and select the best-fit words in the filling process.

Hyperbole It is a figurative form in which an expression is deliberately exaggerated and exceeds the credible limits of fact in the given context. It is the second most common form, after metaphor [58, 87], and is generally used for two main expressive purposes: (i) emphasizing your argument, e.g., the hyperbole “My son takes years to finish his homework” where the time taken to finish homework is not years: it is just a way to make the expression more impactful; and (ii) comparing or highlighting the differences between two things, e.g., the hyperbole “My new laptop looks like a piece of paper compared to the bulky old one,” where the aim is at highlighting the different weigh between the two laptops by comparing one with a piece of paper in an exaggerated way.

There are four datasets related to the hyperbole generation task. HYPO [124] is an English parallel dataset, which contains 709 hyperbolic sentences, each with a manually rewritten non-hyperbolic counterpart. HYPO-red [123] and HYPO-XL [145] are both English non-parallel datasets, the former is created by asking annotators to label the exaggerated attributes of sentences, while the latter is created in a semi-supervised way that uses a binary classifier to predict possible hyperbolic sentences. HYPO-cn [56] is a manually created Chinese parallel dataset.

Idiom An idiom is a figure of speech in which a group of words has established meaning over a long period of usage that in most cases cannot be deduced directly from the individual words in the expression, i.e., it is not computational [87]. For example, the sentence “We were all on pins and needles as we waited for the surgeon’s report” indicates people were all tense rather than literally sitting on pins and needles. Research work found idioms are usually problematic for (second) language learners [72, 136], because they are semantically opaque and very much language- and culture-dependent. Similarly, idiom understanding and generation is also a challenging problem in NLP tasks such as sentiment analysis and machine translation [57, 132, 136]. More concretely, Salton et al. [110] found that a machine translation system may only perform half as well on texts containing idioms compared to those without idiomatic expressions. Volk and Weber [132] even argued that translating idioms is one of the most difficult tasks for both human and automatic translations. Therefore, systems for idiom translation and generation could be used to assist in understanding and creative writing, especially for beginners and second language learners.

For the task of idiom generation, Fadaee et al. [31] constructed a dataset for idiom translation based on WMT’s German-English pairs from 2008 to 2016. Specifically, they used dict.cc, an online dictionary containing idiomatic and literal phrases, to select sentence pairs from the dataset whose source sentences contain idioms. Liu and Hwa [72] proposed a task of replacing idioms with literal English. To build a dataset, they randomly selected Tweets that contain idioms and usage examples from TheFreeDictionary.com. Finally, they presented these sentences along with each idiom’s definition to native speakers of English and asked them to manually shorten the definition. Zhou et al. [146] released a parallel dataset containing their manually created textual counterparts for idiomatic texts from the existing corpus MAG-PIE [42]. Chakrabarty et al. [18] released a manually created parallel English dataset for generating a target sentence that is coherent with the context and consistent with the meaning of the idiom. Recently, Stowe et al. [118] introduced a new natural language inference task for figurative language and created an English dataset consisting of paired sentences spanning idioms and metaphors, which could be a valuable resource for FLG.

Irony (Sarcasm) This is a figure of speech that can make the literal sentiment of the text different from the implied intent, often with an element of hostility, irritation, or just fun. Eric Partridge writes in Usage and Abusage: “Irony consists in stating the contrary of what is meant.” [93] Therefore, it usually takes longer for people to understand irony than to understand literal expressions [111]. For instance, the sentence “I hate when people don’t think the rules apply to them” can also be expressed in an ironic way “I love when people think the rules don’t apply to them.” It is worth briefly unpacking the relationship between irony and sarcasm, often considered a “type” of irony [14, 59, 64]. While irony is not necessarily negative (though the intended sentiment is mainly this), nor intentional, sarcasm is generally an intentional and negative form of verbal communication alternative to direct criticism.

In terms of datasets used in generation, Peled and Reichart [95] built a parallel dataset consisting of 3,000 sarcastic tweets, each augmented with five corresponding non-sarcastic ones rewritten by crowd workers. Mishra et al. [83] created a dataset for evaluation that contains 203 sentence pairs in which sarcastic utterances were manually translated into literal versions by linguists. Ghosh et al. [40] used crowdsourcing to create a parallel dataset containing explicit interpretations of verbal irony. Zhu et al. [148] trained an irony classifier to partition the sentences into ironic and non-ironic, automatically creating a large-scale non-parallel tweet dataset. Ruan et al. [109] recently introduced a multimodal sarcasm generation task, i.e., generating a sarcastic description for a given image, using the testing subset of 503 images in the dataset SentiCap[79].

Pun A pun, also known as paronomasia, is a form of wordplay in which a word suggests multiple meanings by exploiting polysemy, homonymy, or phonological similarity to others, with the purposed aim of yielding a humorous expression. The sentence “Everyone should learn to be satisfied with the state they are in,” for example, “state” here can be an organized political community forming part of a country, while it can also be the mode or condition of someone, triggering an ambiguity that in specific contexts could be perceived as appropriately amusing, or clever.

Miller et al. [82] released a SemEval-2017 dataset for the detection of English Puns. This contains human-written puns annotated with pun words and alternative words, which has been used as a benchmark in the pun generation task to test the models’ performance [77, 139]. However, most pun pairs only occur once in the dataset, while one given context could have been compatible with many other pun pairs. Based on the SemEval-2017 dataset, Sun et al. [119] constructed a new dataset containing 4,551 tuples of context keywords and an associated pun pair, each labelled with whether they are compatible for composing a pun or not, along with 2,753 human-written puns for the compatible pairs. For example, the tuple context: construction workers; pun pair: stair/stare is compatible, allowing for the creation of the pun “Two construction workers had a staring contest.”

Personification It is a figure of speech similar to the anthropomorphic metaphor in literature or art, which involves attributing human characteristics to non-human entities. This, therefore, allows people to create life and motion within inanimate objects, animals, and even abstractions by assigning them recognizable human behaviours and emotions. In other words, personification can provide readers with vivid human-like characteristics, ultimately making the expression more concrete and empathic. For example, the sentence “The birds welcomed the spring by singing melodious tunes” gives the bird human behaviour of welcoming spring.

For the task of personification generation, Liu et al. [76] created a parallel personification dataset, which is used for generating modern Chinese poetry for the given topics, while controlling for the use of metaphor and personification.

2.3 Related Tasks

Figurative language generation, as a subfield of figurative language processing, involves the creation or transformation of figures of speech when text is rewritten, where the generated text should contain specific figurative forms. A good number of NLP tasks could be conceived as related to FLG, namely anything which involves automatic language generation and non-literal meaning, thus including machine translation, summarisation, the automatic treatment of creativity, and so on. However, for a closer peek into the phenomenon at hand, here we mainly discuss the background of two tasks that we deem as most related to and relevant for FLG: one is figurative language detection which is related to figurative language itself, and the other is style transfer which is related to stylistic text rewriting.

Figurative Language Detection This task typically involves two levels of detection, the word level, which identifies the exact words within a sentence that trigger the figurative attribute, and the sentence-level detection, which determines whether a sentence is literal or non-literal. It is important to note that word-level detection is a crucial component in retrieval-based FLG models, as these methods typically require first identifying trigger words in sentences, followed by other operations such as replacement and generation. Over the past two decades, numerous approaches, mostly based on machine learning algorithms, have been proposed to address the problem of figurative language detection. With traditional machine learning methods, researchers must define and identify linguistic features, which are then fed into the model to learn task patterns, such as the recognition of metonymy [86], hyperbole [124], metaphor [10, 78, 125], idiom [68], irony [108]. Neural networks, instead, can achieve impressive results in figurative language detection without the need for feature engineering, such as using convolutional neural network [29, 102], LSTM [33, 138], and the pre-training model BERT [26, 127], and mT5 [63].

Text Style Transfer Text style transfer is the task of transforming a text of one style into another while preserving the original content. For example, given the informal sentence “different from what I’ve seen,” we can turn it into the formal counterpart “It is different from what I have seen” for the task defined as formality transfer [104]. Text style transfer is very similar to FLG as they both aim at achieving the generation of text with desired attributes, the former being a specific style while the latter a figurative form. However, there are also several differences between the two tasks in terms of the types of text changes involved. For instance, text style transfer may require modifying multiple parts of a sentence simultaneously, such as capitalization at the beginning, punctuation at the end, and some phrasing in the middle. In contrast, FLG often involves rewriting only certain expressions to trigger the figure of speech, while retaining other large parts of the sentence unchanged [146]. Most importantly, in text style transfer the original style must be completely changed, whereas in FLG the original figurative form could be still present in the generated sentence (see the example in Section 2.1).

Skip 3KNOWLEDGE-BASED APPROACHES Section

3 KNOWLEDGE-BASED APPROACHES

In earlier works, as with other NLG tasks, FLG research mainly focused on knowledge-based approaches. Generally, NLP researchers need to master linguistic knowledge to design rules and templates to fit figurative language patterns, or to construct the corresponding knowledge resources. We roughly divide knowledge-based approaches into two sub-categories: (i) rule (template) and (ii) knowledge resources. Their corresponding classic works, advantages (pros) and disadvantages (cons) are presented in Table 3 (first block).

Table 3.
TypeSubcategoryProsConsReferences
Knowledge-basedRule and template- Intuitive and simple- Tailored to a specific form- Poor flexibility and diversity[1, 51, 121, 129]
Knowledge resource- Exploiting knowledge resource- High interpretability- Prior linguistic knowledge- Construct desired resources[47, 96, 97, 130] [38, 72, 113, 126] [43, 45, 91, 117]
Neural-basedTraining from scratch- Straightforward- Combine retrieval approaches- Large-scale training data- Large computational resources[31, 76, 95] [44, 69, 139, 140] [77, 141, 146, 148] [83]
Fine-Tuning PLMs- Straightforward- Pre-trained knowledge- State-of-the-art results- Large computational resources[21, 143, 145, 147] [23, 115, 117] [18, 19, 84, 123] [61]
Prompt Learning- Straightforward- A few/no labelled samples- Prompt engineering- Large computational resources[12, 18, 84, 106]
  • We list their subcategories, along with corresponding pros and cons, and some classic references.

Table 3. A Summary of the Two Main Approaches to Figurative Language Generation

  • We list their subcategories, along with corresponding pros and cons, and some classic references.

3.1 Rule and Template

Rule- and template-based generation is a simple yet efficient method consisting of a set of pre-defined modules that represent cues and indicators of figurative forms. These types of systems often do not require a training process and can generate the desired figurative language quickly and efficiently once the templates are created.

A representative method is the template “A (e.g., vehicle) is like B (e.g., topic)” used to acquire relations between A and B which can be used for metaphor generation [1, 121]. The computational systems are developed to generate simple metaphors, which are based on the probabilistic relationships between words in the textual data. Similarly, Veale [129] used XYZ comparisons with the template form “X is the Y of Z” to generate creative and metaphoric tweets. In this template, Y refers to a proper-named individual that represents an entire class of people in a figurative sense. For example, the sentence “the potato is the Tom Hanks of the vegetable world,” where Hanks (Y) is the representative of the versatile actor. The authors collected a vast amount of tuples of the form XYZ from the Internet, using the capitalization of the Y field as a cue to find those that pivot around people. Instead of using a single template, Joshi et al. [51] designed a system consisting of eight rule-based modules, each targeting a certain type of sarcastic expression, to generate sarcasm responses for given user inputs.

In summary, rule- and template-based FLG can be customized to specific figures of speech, making the whole process pretty transparent. However, this method is extremely inflexible, as humans must manually design rules and templates for each figure of speech separately. Additionally, it is primarily geared toward the most common figurative patterns and thus lacks diversity, and certainly does not take full advantage of empirical knowledge of the language.

3.2 Knowledge Resources

This type of approach often involves the operations of search, identification, replacement, and mapping done by utilizing existing lexical resources or self-built knowledge resources.

WordNet One example of the widely used lexical resource for FLG is WordNet [81]. Based on two specific templates “X is the Y of Z” and “X is as Y as Z”, Pereira et al. [96] employed WordNet and structure mapping algorithms to enhance texts with various figures of speech. With WordNet, the lexical realization process is enriched by lexical selection, where a decision is made between lexical alternatives representing the same content. Petrović and Matthews [97] proposed an unsupervised system to generate “I like my X like I like my Y, Z jokes”. The X, Y, and Z here are variables to be filled in using WordNet and Google n-gram data [2]. Veale and Hao [130] presented a WordNet-based framework, which allows for a concise and adaptable representation of concepts for both metaphor interpretation and generation. Hong and Ong [47] developed a system that utilizes WordNet and ConceptNet [74] to automatically extract word relationships, such as synonyms, hypernyms, sounds-like, and semantic relationships in puns, and stored them in templates. Using these templates and linguistic resources, the system can generate puns starting with a keyword input from the user. Valitutti et al. [126] proposed a lexical replacement approach for generating humour that incorporates a simple form of punning: they created word substitution blocks using WordNet to introduce incongruity and taboo words, thereby attempting to elicit a sense of humour. Shutova [113] proposed an approach for metaphor interpretation, which generates literal paraphrases for given metaphorical expressions. Concretely, for a metaphorical expression, the system (i) generates a list of possible paraphrases that might appear in the same context based on a large corpus; (ii) ranks these generated paraphrases according to their likelihood derived from the corpus; (iii) discriminates between literal and figurative paraphrases by detecting selectional preference violations; and (iv) disambiguates the sense of the literal paraphrases using WordNet’s inventory of senses.

Other Resources Researchers also employed other linguistic resources to develop the FLG system. Liu and Hwa [72] proposed a retrieval-then-replacement approach to translate idioms into literal paraphrases. They first used TheFreeDictionary.com to retrieve the related replacement phrase for the idiom, and then replace the idiom with the retrieved phrase through appropriate grammatical and referential transformations. Gero and Chilton [38] used the open-source knowledge graph ConceptNet and a modified Word Mover’s Distance algorithm to develop a metaphorical writing collaborative system. This system can provide users with a large, rank list of suggested metaphorical connections based on their input, and generate coherent content with the given context, thereby helping users to produce more diverse texts. Recently, Stowe et al. [117] proposed an unsupervised model for metaphor generation, which is based on frame embeddings [6]. The authors employed FrameNet to learn the joint embedded representations for domains and lexical entities, which are used to represent conceptual domain mappings, and can be applied to generate a word in the target domain for the given input words in the source domain, which is a typical mechanism for metaphor creation.

Creating Resources NLP researchers have also created dedicated resources for FLG, in addition to leveraging existing resources. For instance, Hervás et al. [45] built the mappings between the concepts of source and target domains in metaphors, which are used to generate a set of possible metaphorical references for each concept. They then assessed the clarity and suitability of these metaphorical references to filter the inappropriate ones. The remaining metaphors are used to refer to the concept in a given context only if there is no loss of meaning or unnecessary ambiguity. Since some common sense knowledge can be approximated by parsing a sentence and abstracting from its syntactic structure, Ovchinnikova et al. [91] adopted this approach to design proposition databases, which are used for generating conceptual metaphors and automatically finding the corresponding linguistic expressions in corpora. To reduce the manual workload, Harmon [43] developed a web-driven approach to form a preliminary knowledge base of nouns and their characteristics, and employed it to generate metaphors and similes with various properties like clarity and novelty.

In summary, these kinds of approaches can leverage available language knowledge resources such as WordNet and FrameNet to build FLG systems. The language generation process of such systems is usually interpretable, but they still lack flexibility since new resources, or lexical mappings and constraint conditions, must be designed by experts, usually ad-hoc.

Skip 4NEURAL-BASED APPROACHES Section

4 NEURAL-BASED APPROACHES

In recent years, neural networks have become the main methods for language generation tasks thanks to the growth of computational power and the availability of large-scale data. As depicted in Figure 3, neural language generation models can be broadly classified into two categories: decoder-only and sequence-to-sequence (seq2seq) architectures. In the following sections, we will provide a brief overview of each.

Fig. 3.

Fig. 3. Overview for decoder-only and seq2seq architectures.

The GPT family model [16, 98, 99] is representative of decoder-only models, consisting of the decoder component only. These models are mainly used to predict future behaviour based on past observations, i.e., the next generated token is based on what has been seen until then in the sequence. In short, this can be considered as iteratively predicting the probability of the next token, which can be formally described as (2) \(\begin{equation} p(x_{1}, \cdots , x_{m}) = \prod _{i=1}^{m} p(x_{i} | x_{1}, \cdots , x_{i-1}) . \end{equation}\) Where \(x_{1:i-1}\) is the sequence of tokens preceding the \(i\)th time step, which is used to predict the token for that step. Since tokens are predicted autoregressively, these kinds of models are naturally suitable for text generation tasks.

Unlike decoder-only models, seq2seq models contain both an encoder and a decoder, where the former encodes the source text and the decoder is used for the autoregressive generation. This thus unifies language understanding and generation in a single framework. The seq2seq models are usually based on architectures such as recurrent neural network (RNN) [120], convolutional neural network (CNN) [37], and Transformer [128]. Given a source sentence \({\bf x}=\lbrace x_{1}, \cdots , x_{m}\rbrace\) and a target sentence \({\bf y}=\lbrace y_{1}, \cdots , y_{n}\rbrace\), the encoder first learns to encode a variable-length sequence of source sentences into a fixed-length vector representation, which can be formulated as (3) \(\begin{equation} ({\bf e}_{1}, \cdots , {\bf e}_{m}) = \textrm {Encoder}({\bf w}_1, \cdots , {\bf w}_{m}) , \end{equation}\) where \({\bf w}_{i}\) is the fixed-length embedding of token \(x_{i}\), and \({\bf e}_{i}\) is its corresponding contextualized hidden representation. After that, the decoder is used to decode the vector representation into a variable-length sequence. Specifically, the decoder generates a token at each time step like the decoder-only model, which consumes the previously generated sequence as additional input. This process can be represented as (4) \(\begin{equation} {\bf d}_{i} = \textrm {Decoder}({\bf e}_{1, \cdots , m}, \hat{{\bf w}}_{1, \cdots , i-1}) , \end{equation}\) (5) \(\begin{equation} p_{\theta }(\hat{y}_{i} | \hat{y}_{1}, \cdots , \hat{y}_{i-1}) = \textrm {Logit}({\bf d}_{i}) , \end{equation}\) where \(\hat{{\bf w}}_{i}\) is the fixed-length embedding of generated token \(\hat{y}_{i}\), and Logit(\(\cdot\)) is a nonlinear multi-layered function that predicts the probability of output \(\hat{y}_{t}\). A seq2seq architecture is commonly used in sequence learning and has shown great success in various text-to-text generation tasks [24, 65, 101, 131].

In the field of FLG, almost all neural-based models are based on these two structures. We classify all the models into three categories based on their training methods: (i) training from scratch; (ii) fine-tuning PLMs; and (iii) prompt learning. In the following, we will review the related FLG literature and discuss their advantages and disadvantages in detail.

4.1 Training From Scratch

Training a neural model from scratch involves initializing a new neural network randomly (which may be based on an existing architecture) and then training it for a target task using the corresponding training objective and dataset. In this context, the long short-term memory (LSTM [46]) network with the attention mechanism [4, 53] is widely used and very efficient for language generation tasks. In the seq2seq framework, the attention mechanism is employed to allow the decoder to use the most relevant parts of the input sequence in a flexible way, by a weighted combination of the output from the encoder, with the most relevant parts being attributed to the highest weights.

Basic seq2seq training Generally, a seq2seq framework needs to be trained with parallel pairs containing source and target texts. For instance, Peled and Reichart [95] used literal-sarcastic pairs to train an LSTM-based seq2seq model for sarcasm utterance interpretation, which aims at translating sarcastic texts into non-sarcastic. Fadaee et al. [31] trained an attention-based LSTM seq2seq model using parallel data to perform the task of idiom translation between English and German. Liu et al. [76] presented a seq2seq model for metaphor and personification generation based on poetry topics. In order to control the rhetoric modes, they introduced additional strategies to capture various rhetorical patterns and regulate the generation process. Stowe et al. [116] trained a seq2seq model with synthetic parallel data, where the authors employed a metaphor masking framework for metaphor generation. They replaced metaphoric words in the input texts with unique metaphor masks, resulting in parallel training data consisting of the source sentence with masked words and the target sentence with the original words (e.g., “The war <MET> many people\(\rightarrow\)“The war uprooted many people”). At inference time, they fed the verb-masked literal texts into the model trained with metaphor masking to generate the target outputs.

Training without parallel data Supervised learning usually requires large-scale parallel training data, but the creation procedure is very time-consuming and costly, especially for different figures of speech. In light of this, many unsupervised approaches with no need for manually labelled parallel data are commonly used. For instance, Yu et al. [139] proposed a model for generating homographic puns without using any pun training data. Specifically, they first trained a conditional neural language model from a general text corpus, which can generate a sentence containing a given word with specific senses. Then, based on two senses of a target word as inputs, they designed a novel joint beam search algorithm to generate two pun sentences in parallel. The two pun sentences should be the same except for the input words, and are suitable for two specified senses of a homographic word (e.g., For the pun “Math teachers have lots of problems,” the “problem” here can refer to either (i) a source of difficulty; or (ii) a question proposed for consideration or solution.) Yu and Wan [140] presented a framework for verb-oriented metaphor generation, which is trained on an English Wikipedia corpus without manually labelled metaphor data. They designed an unsupervised approach to automatically extract the metaphorically used verbs and their fit words. For example, in “she devoured (enjoyed) his novels,” the literal sense of the fit word “enjoyed” represents the sense of “devoured” in this context. Then, they employed a POS-constrained language model to generate a sentence containing the given verb, while considering its fit word in the decoding using a specifically dedicated algorithm. Li et al. [69] presented a GPT-2 based multitask framework for metaphor generation. Specifically, they used a small amount of metaphor-labelled data to train a metaphor identification module on top of the GPT-2 contextualized embedding, which is then used to obtain potential metaphors from a large-scale unlabelled corpus. Based on the identification module that can compute the metaphorical probability, a metaphor weighting mechanism is designed to encourage the generation model to pay more attention to metaphor-related parts of the input, thereby improving the metaphoricity of the generated text.

Neural retrieval A neural-based retrieval approach has also been explored in FLG, which generally does not require parallel training data. This approach often involves a series of steps, such as retrieval, replacement, and generation. He et al. [44] introduced a local-global surprisal principle in pun generation, where they hypothesized that there is a strong association not only between the pun word and the distant context but also between the corresponding alternative word and the immediate context. For example, the text “Yesterday I accidentally swallowed some food colouring. The doctor says I’m OK, but I feel like I’ve dyed (died) a little inside,” the pun word “dyed” indicates that the person is coloured inside by food colouring, while the alternative word “died” implied another interpretation in context: the person could be dying due to the accident. This contrast can create surprise, which in turn creates a sense of humour. To instantiate the local-global surprisal principle, they introduced an unsupervised approach based on a retrieve-and-edit framework that was developed to replace the alternative word with the pun word, thereby generating puns starting from an unhumorous corpus. However, such direct replacement usually leads to grammatical errors in the generated sentences since the part-of-speech tags of puns and their alternative words are often different. With candidate sentences containing the alternative word at hand, Yu et al. [141] proposed a seq2seq framework with lexical constraints to address this issue. For a given pair of homophones, they first retrieved sentences containing the pun word and the alternative word. Then, they designed a selection algorithm to extract positive constraints between the pun word and the support word in the corresponding sentence (e.g., the pun word “tuna” and the support word “fisherman” from the sentence “the fisherman catch tuna with several methods,”) as well as the negative ones between the alternative word and the weak word in the corresponding sentence (e.g., the alternative word “tune” and the weak word “boy” from the sentence “the whistling boy was always out of tune.”) Finally, based on selected constraints, they used a large amount of parallel paraphrase data to train a seq2seq generator for rewriting the candidate sentence into a homophonic pun (e.g., “the whistling fisherman was always out of tuna.”)

A supervised retrieval-based pipeline was introduced by Zhou et al. [146], which aims at transforming a given literal sentence into its idiomatic counterpart. The authors designed three different modules in their system: (i) a retrieval module that is used to retrieve the appropriate idiom for a given literal sentence from a pool of available idioms and their definitions; (ii) a span extraction module that identifies the span of the literal input to be replaced with the retrieved idiom; (iii) the generation module which is used to generate the idiomatic text based on the retrieved idiom and the literal input without the identified span.

RL for FLG Researchers have also employed reinforcement learning (RL) techniques to optimize specific objectives in FLG. The main idea is to design a reward strategy that targets the desired figure of speech and employ the algorithms (i.e., policy gradient [137]) to maximize the expected reward during training. Zhu et al. [148] presented an unsupervised style transfer model to explore irony generation. They designed an irony reward strategy to control the irony accuracy and used a denoising auto-encoder and back-translation to preserve content. Additionally, they introduced a sentiment reward strategy to preserve sentiment polarity in the transformation from non-ironic to ironic. Luo et al. [77] presented an unsupervised framework for pun generation, which consists of a generator producing pun texts, and a pun classifier distinguishing between the generated text and the literal texts with specific word senses. Here, the probability output of the classifier is used as a reward signal to train the generator through reinforcement learning, which encourages the generator to produce texts that support two-word senses simultaneously. Mishra et al. [83] developed an unsupervised framework to generate sarcastic texts given literal negative opinions. Given that sarcasm comes from context-incongruity, the authors designed three modules to introduce incongruity into the literal input: (i) filtering factual content; (ii) retrieving incongruous phrases; and (iii) synthesising sarcastic text. The framework is trained with non-parallel sarcastic and literal texts through reinforced neural seq2seq learning, which takes the sarcastic confidence score of the discriminator (e.i. literal vs. sarcastic) as the reward for the figurative form.

In summary: on the one hand, training a neural model usually requires large-scale training data (parallel or non-parallel), to improve the model’s generalization ability; also, the training process usually needs large computational resources. On the other hand, neural models can be trained based on the users’ needs without too much extra-linguistic knowledge or complex linguistic processing, although integrating linguistic knowledge and retrieval approaches into neural networks can enhance the models’ performance and interpretability.

4.2 Fine-Tuning Pre-trained Models

The development of PLMs is considered a revolutionary breakthrough in NLP and even neural networks. PLMs can be trained by self-supervised learning on a large amount of raw textual data, learning language knowledge and representations in a universal space. The models are then fine-tuned with usually limited amounts of labelled data for downstream NLP tasks using task-specific objective functions. Therefore, this paradigm can avoid training models from scratch and usually achieves state-of-the-art performance.

Context training Similar to [116], Zhang and Wan [145] proposed a masking framework for hyperbole generation without literal-hyperbole pairs. They first fine-tuned the seq2seq PLM BART [65] to infill masked hyperbolic spans of source sentences in the training process, then masked part of the input literal sentence and fed it into the model to generate multiple candidate hyperbolic sentences during the inference process. Finally, a ranker was used to select the best candidate based on the hyperbolicity score and paraphrase quality of the sentences. Similarly, Zhou et al. [147] fine-tuned BART to generate a literal paraphrase given an idiomatic sentence. Specifically, they trained a masked conditional sentence generation model to fill the masked word using the definition of the word and its part-of-speech tag. Based on parallel data of simile and simile-free segment contexts, Zhang et al. [143] trained a framework consisting of BERT [27] based encoders and Transformers [128] based decoders for the task of writing polishment with similes. They first locate places in a given text where similes can be inserted and then generate location-specific similes from them (e.g., “He appeared there [INSERT], holding the door frame with one hand, blocking her retreat\(\rightarrow\)“He appeared there like a ghost, holding the door frame with one hand, blocking her retreat.”)

Pseudo-parallel data construction and training PLMs have been explored for automatically compiling parallel training data in FLG. Chakrabarty et al. [21] first collected comments containing the phrase like a from the social media site Reddit, and then leveraged the pre-trained generative model of the knowledge graph COMET [13] to transform simile into literal text utilizing the PROPERTY relation. This process creates literal-simile pairs, which are then used to fine-tune BART to generate novel similes for a given literal sentence. Stowe et al. [117] identified and masked metaphoric verbs of sentences from Gutenberg Poetry corpus [49], then employed BERT to infill the mask tokens. Based on the resulting large-scale parallel data, the authors proposed to control the generation process by encoding conceptual mappings between cognitive domains. Accordingly, they incorporated both target and source domains using FrameNet [6] into the source literal sentence as control codes. The pairs are then used to fine-tune BART to generate meaningful metaphoric texts. Similarly, Chakrabarty et al. [23] and Stowe et al. [115] employed masked language modelling combined with commonsense inferences to automatically create a large number of literal-metaphor pairs. Chakrabarty et al. [23] designed a metaphor detection model as a discriminator to guide the model’s decoding during generation, addressing the problem that the model tends to generate literal tokens rather than metaphorical ones. In order to compare the free and controlled generation methods, Stowe et al. [115] fine-tuned the PLM T5 [101] for metaphor generation. For the free generation, the model is trained with original literal-metaphoric pairs while the controlled generation involves adding additional constraints to the generation objective, encouraging the model to generate a specific metaphor. They found that controlled generation can improve the metaphoricity of the generated sentences, while free generation tends to generate more fluent and coherent outputs.

Combining knowledge resource Unlike Chakrabarty et al. [21] who used COMET to construct a parallel dataset, Tian et al. [123] leveraged it to develop a framework for hyperbole generation directly. They employed COMET and its reverse models to perform commonsense and counterfactual inference, and then generate multiple hyperbole candidates. Finally, they trained a generic hyperbole classifier and a specific pattern classifier to rank and select high-quality hyperboles. Chakrabarty et al. [19] also explored a retrieve-and-edit method using COMET. They developed an unsupervised framework for sarcasm generation given literal input sentences, which models two main characteristics of sarcasm: reversal of sentiment valence and semantic incongruity with the context. To implement the reversal of valence, they used WordNet [81] and SentiWordNet [30] to identify the evaluative word and select antonyms to generate the sarcastic utterance. Then, they employed COMET to retrieve relevant commonsense context to be added to the generated sarcastic texts. Mittal et al. [84] first used a reverse dictionary to generate a list of related concepts that are monosemous for both senses given two sense definitions of a target pun word. Based on the related concepts, they explored three methods: extractive-based, similarity-based, and generative-based, to generate context words. Finally, they fine-tuned T5 to generate humorous puns given the pun word and generated context words. Chakrabarty et al. [18] proposed the task of generating a plausible continuation given fictional narratives containing a figurative expression such as an idiom or a simile. To do so, they introduced a knowledge-enhanced strategy (PARA-COMET [35] and COMET-ConceptNET [48]) to fine-tune PLMs such as BART, T5, and GPT-2. This helps the models to infer the meaning from the context and rely on the literal meanings of constituent words, possibly following more closely human strategies in interpreting figurative language. Based on COMET, Ruan et al. [109] proposed an extraction-generation-ranking approach for image-sarcasm generation. Specifically, they first extracted diverse information from an image and used COMET to infer the consequence of the sentimental descriptive caption to generate candidate sarcastic texts. Then, they designed a ranking algorithm that considers image-text relation, sarcasticness, and grammaticality to select a final text from the candidate texts. Tian et al. [122] presented a framework for the task of generating puns. For a given pun word pair, the authors first retrieved a context word and a phrase from a large corpus based on Wikipedia and Gutenberg BookCorpus. Then, they fine-tuned the PLM GPT-2 model to learn the task of generating a sentence containing the input, including a keyword and a phrase. Finally, they trained a word-level label predictor based on linguistic attributes of puns, which is used to steer the model to generate puns.

More recently, Lai and Nissim [61] presented the task of multi-figurative language generation, addressing the limitations of existing works that focus on modelling single figurative forms separately, thereby possibly missing out on shared characteristics. Specifically, they provided a benchmark for the automatic generation of five common figurative forms by combining together and reusing existing resources, and proposed a BART-based framework with a mechanism for injecting information about the target figure of speech (or literal expression) into the encoder. This approach enables the transformation of sentences between different forms, including literal and figurative, without relying on parallel figurative-figurative sentence pairs.

In summary, this paradigm of fine-tuning general PLMs for the target downstream tasks, and more specifically in the case of FLG discussed here, usually achieves state-of-the-art results. Many works have also employed PLMs to construct synthetic parallel data, which can be used for fine-tuning PLMs and thereby mitigate the need for large-scale labelled data. However, fine-tuning PLMs with specific training objectives often requires a large computational capacity due to their large-scale sizes.

4.3 Prompt Learning

While fine-tuning PLMs often yields state-of-the-art results, this strategy does not work too well if only a handful of examples are available for the target task, which is very common in NLP. A new paradigm, prompt learning has been proposed for the scenario where the number of available examples is limited. This approach can utilize limited data as supervised information to rapidly generalize to the target task [135]. In particular, with the development of LLMs such as GPT-3, prompt learning has shown impressive performance when prompting the model with a small number of labelled examples, or even under zero examples conditions (i.e., zero-shot learning). Generally, the prompting method is designed to manipulate the model behaviour by prepending the task instruction, allowing LLMs to generate the desired output [16, 75, 99, 101]. For example, a source sentence from an English-to-Italian translation task can be reformulated as “Translate English to Italian: [source English sentence].”

In the field of FLG, Reif et al. [106] presented a prompting and augmented zero-shot learning method for metaphor generation, which frames the generation process as a sentence rewriting task. Concretely, they leveraged large PLMs to perform zero-shot learning, which requires only natural language instruction and an example from a relevant task (e.g., style transfer) without the need for model fine-tuning or examples specific to metaphor generation. Chakrabarty et al. [18] conducted zero-shot experiments on GPT-2 XL and GPT-3 to generate a coherent and contextually consistent next sentence for a given narrative containing a figurative expression. They let the models generate up to 20 tokens, stopping when an end-of-sentence token was generated. In addition, they performed a few-shot experiment in which they prompted GPT-3 with four task examples followed by a test input, to guide the model to generate corresponding outputs. However, the experiment results indicate that regardless of model size, language models struggled to perform well in zero-shot and few-shot settings when compared to fine-tuning PLMs such as BART. Additionally, all methods still lag significantly behind human performance levels. Recently, Mittal et al. [84] investigated GPT-3’s few-shot learning to generate puns with context. Specifically, they provided two examples with GPT-3 to generate context words, and then generate puns incorporating context words from two different senses. Similar to [18], they found that fine-tuning T5 outperforms the GPT-3 with the few-shot method in terms of generating funny pun sentences. Bhavya et al. [12] studied the task of generating analogies by prompting LLM InstructGPT [90] in the zero-shot setting. They found that InstructGPT is effective at generating meaningful analogies, and the largest model can achieve human-level performance in generating analogies for given target concepts. However, the model is pretty sensitive to different prompts in various temperature settings.

In summary, these prompting methods require only a few labelled samples or no samples at all leveraging the prior knowledge of LLMs. However, the inference process for LLMs still demands significant computational resources compared to fine-tuning general PLMs which could be much smaller than LLMs. On the other hand, LLMs are sensitive to prompts for different tasks, and it is not clear how a prompt influences a model’s performance or how to design an effective prompt for a given task.

Skip 5EVALUATION Section

5 EVALUATION

Evaluation is one of the most important parts of natural language generation, since it not only involves the assessment of the system, but also influences the modelling approach. For instance, while automatic metrics can be used to quickly evaluate different models, giving researchers the performance feedback they need right during development, most works also include human evaluation at the final stage as it can provide more reliable assessments. However, the evaluation in FLG, whether carried out by humans or through automated methods, continues to pose a challenge due to its subtle structures and subjective interpretations. Generally, evaluating a generated text is a multifaceted task requiring the assessment of different criteria, such as context preservation,2 style strength, coherency, fluency, and so on.

5.1 Automatic Evaluation

Automatic metrics are very popular due to the ease of running automated assessments as systems are being developed, quickly providing a reproducible and scalable evaluation. Table 4 (left block) shows some commonly used metrics in the automatic evaluation of FLG, while let us bear in mind that most metrics are not designed specifically for FLG. These metrics can be divided into three different dimensions: (i) context preservation: whether the context of the generated sentence is the same as the original sentence; (ii) form strength: to what extent the generated text fits the target figurative form; and (iii) fluency: how fluent the language in the generated text is.

Table 4.
Automatic EvaluationHuman Evaluation
MetricTotalMetricTotalCriterionTotalCriterionTotal
BLEU12Average Length3Fluency13Success Rate4
BERT(Score)8Novelty2Creativity8Adequacy3
Perplexity7Embedding Similarity2Grammaticality6Funniness3
Distinct7BLEURT2Coherence5Sarcasticness2
ROUGE6Log-Likelihood2Metaphoricity4Semantic Similarity2
Rhetoric Classifier5Word Number2Meaning4Readability2
METEOR4HM/GM2Overall4Relevance2
Others18--Others32--

Table 4. Automatic Metrics used for the Automatic Evaluation and Criteria Set for the Human Evaluation in FLG from 34 Papers

Context Preservation The generated texts are compared to human-written references (or source texts) in terms of their semantic similarity using automatic metrics. As with other NLG tasks, there are many semantic metrics that can be applied to measure sentence similarity in FLG. These metrics can be categorized into surface-based and neural-based methods:

BLEU [92] is the most popular metric used in many natural language generation tasks, including machine translation, question answering, text summarization, and style transfer. It compares generated texts to source or reference texts using a precision-oriented approach based on \(n\)-gram overlap. A BLEU-\(n\) score can be roughly computed as follows: (6) \(\begin{equation} \textrm {BLEU-}n = \frac{\sum _{C\in \left\lbrace Candidates \right\rbrace }\sum _{n-gram \in C} Count_{match}(n-gram)}{\sum _{C\in \left\lbrace Candidates \right\rbrace }\sum _{n-gram \in C} Count(n-gram)} , \end{equation}\) where \(C\) represents a generated text in the candidate set, and \(match\) means that a \(n\)-gram appears in both the generated and reference texts.

ROUGE [70] is another metric based on overlap counting, but a recall-oriented approach commonly used in automatic summarization evaluation. The formula of ROUGE-\(n\) is as follows: (7) \(\begin{equation} \textrm {ROUGE-}n = \frac{\sum _{C\in \left\lbrace Candidates \right\rbrace }\sum _{n-gram \in C} Count_{match}(n-gram)}{\sum _{R\in \left\lbrace References \right\rbrace }\sum _{n-gram \in R} Count(n-gram)} , \end{equation}\) where \(R\) represents a candidate text in the reference set. Compared to the BLEU formula in Equation (6), the only difference is the denominator which is the total number of n-grams in the reference set.

METEOR [7] is an automatic metric based on a generalized concept of unigram matching between the generated texts and their corresponding references. It computes the similarity score of two texts using a combination of unigram-precision, unigram-recall, and some additional measures like stem and synonymy matching. Specifically, WordNet is used to expand the synonym set, and the word form is also taken into account for stem matching. The candidate can therefore be assessed even without an exact match with the references.

Embedding similarity [73] measures the semantic distance between texts using pre-trained word embeddings, such as Word2Vec [80]. This kind of metric usually calculates the cosine distance between output and reference using sentence-level embeddings. Embedding Average measures a sentence’s embeddings by averaging the word embeddings of each token in the sentence. Vector Extrema is another way to calculate sentence-level embeddings: it takes the most extreme value (maximum or minimum) amongst all word vectors for each dimension of the word vectors, and uses that value for sentence-level embedding. Greedy matching, instead, is a greedily token matching method based on the cosine similarity of word embeddings, and the overall score is then averaged across all words: \(\begin{equation*} \textrm {G}(C, R) = \frac{\sum _{w_{1}\in C}\textrm {max}_{w_{2} \in R}\textrm {cosine}(e_{w_{1}}, e_{w_{2}})}{|C|} \nonumber \nonumber \;\;, \quad \textrm {GM}(C,R) = \frac{G(C, R) + G(R, C)}{2}. \end{equation*}\) Where \(C\) and \(R\) represent the candidate output and reference, respectively. Since the formula G (\(\cdot\)) is asymmetric, the greedy matching scores in both directions are averaged to get a final value.

BERTScore [71] is a BERT-based automatic metric for text generation. Instead of an exact match, it uses greedy matching to compute a similarity score for each token in the candidate text with each token in the reference sentence, using contextual embeddings. This metric combines precision and recall to compute an F measure, which can be formulated as \(\begin{equation*} \textrm {R} =\frac{1}{|C|} \sum _{c \in C} \max _{r \in R} \mathbf {c}^\top \mathbf {r} \nonumber \nonumber \;\;, \quad P =\frac{1}{|R|} \sum _{r \in R} \max _{c \in C} \mathbf {c}^\top \mathbf {r} \nonumber \nonumber \;\;, \quad F = 2\frac{R \cdot P }{ R + P }\;\;. \end{equation*}\) Where \(C\) and \(R\) represent the candidate and reference, respectively.

BLEURT [112] is a learnable metric based on BERT and trained on human judgements. It employs a pre-training scheme that uses millions of synthetic examples to help the model generalize: it has shown a high correlation with human judgement in the evaluation of machine translation [112] and formality transfer [60]. (8) \(\begin{equation} \textrm {BLEURT}(C, R) = \mathbf {W}\mathbf {v}_{[CLS]} + \mathbf {b} . \end{equation}\) Where \(\mathbf {W}\) and \(\mathbf {v}\) are the weight matrix and bias vector respectively. \(\mathbf {v}_{[CLS]}\) is the representation for the special [CLS] token concatenated at the beginning with the candidate output \(C\) and reference \(R\).

In our review, we observe that, as in most NLG tasks, using BLEU is the most popular strategy in FLG due to its advantages of light weight and fast computation. However, \(n\)-gram based metrics usually fail to recognise information beyond the lexical level, and BLEU has shown with poor correlation with human judgement in the evaluation of machine translation [8], question answering [17, 73], and formality transfer [15, 60]. In recent years, many neural network based metrics (e.g., BERTscore and BLEURT) have been developed as alternatives to \(n\)-gram based metrics. These metrics generally have a higher correlation with human judgement as they can capture semantic or syntactic variations of the given reference beyond the surface level.

Form Strength The strength of the desired target form in the generated text can be assessed by a figurative language classifier, as commonly done with style classifier in style transfer [15, 60]. This process can be considered a binary classification problem (i.e., literal vs. figurative), where a classifier is used to automatically evaluate the form strength of the generated texts [76, 115]. A binary classifier, which has to be trained beforehand, is employed to assess whether each generated text adheres to the target figurative form. The recall, precision, and F score are then used to assess the model’s performance. Research on text style transfer has shown that such classifiers correlate well with human judgement [60, 66].

Using a regressor in place of a classifier is another way to assess the strength of the figurative form, modelling the problem on a continuous scale of figurativeness rather than with discrete categories.

Fluency This is a fundamental characteristic of human language that frequently plays an important role in the evaluation of NLG. To automate the assessment of this aspect, the perplexity (PPL) of generated texts is calculated against a language model that has been pre-trained with in-domain training data. This can be formulated as (9) \(\begin{equation} \textrm {PPL} = \sqrt [n]{ {\textstyle \prod _{i=1}^{n}} \frac{1}{p(w_{i}|w_{1:i-1})} } , \end{equation}\) where \(n\) is the number of tokens in the generated text, and \(p(w_{i})\) is the probability of the \(i\)th token \(w\) predicted by the language model. A low perplexity indicates high quality of the evaluated texts since it measures the probability of tokens appearing in text sequences.

From Table 4, we see that various automatic methods have been employed in FLG, including more than 10 metrics that are used only once. This suggests that the NLP community lacks shared standards in automatic evaluation, and it is yet unclear if these metrics are adequate for evaluating generated creative text. Extended research is needed specifically on automatic FLG evaluation.

5.2 Human Evaluation

Human evaluation can often provide a more reliable assessment of the FLG compared to automatic metrics although it is obviously more costly and can not be easily used during iterative development. To conduct a human evaluation, specific criteria aspects, such as context preservation, creativity, fluency, and overall judgement can be assessed. Annotators can rate the generated texts based on these aspects, or even compare and rank the outputs generated by different models at the same time.

Table 4 (right block) presents human evaluation criteria commonly used in FLG. Similar to automatic evaluation, we see that (i) the most commonly assessed criteria aspects are fluency, creativity, meaning, and overall; (ii) there are a variety of measures, with more than 30 criteria being used no more than twice again, suggesting a lack of agreement in strategy for human evaluation; (iii) more importantly, due to the lack of relevant research, there is no standard human evaluation framework that can be used to avoid biased assessment such as reference bias in machine translation evaluation [34], which might result in different works that are neither directly comparable nor easy to reproduce.

In Table 5, we summarize the pros and cons of the above-mentioned evaluation methods. Overall, there are not as yet fully shared standards in the practices of human-based and automatic evaluations.

Table 5.
Evaluation MethodProsCons
Context preservationBLEU- Intuitive - Commonly used - Easy-to-implement- Lexical-level evaluation
ROUGE- Intuitive - Commonly used - Easy-to-implement- Lexical-level evaluation
METEOR- Intuitive - Commonly used - Easy-to-implement - Considering non-exact matching- Lexical-level evaluation - Unigram matching only
Embedding similarity- Semantically measuring- Lexical-level evaluation
BERTScore- Semantically measuring - Contextual learning- Lexical-level evaluation - Large computational capacity
BLEURT- Semantically measuring - Mimicking human judgment- Large computational capacity
Form StrengthClassifier- Commonly used- Having to train specific classifiers
FluencyPerplexity- Intuitive - Common- Having to train specific LMs
Human evaluationHuman evaluation- More reliable- Costly - Time-consuming - Requiring high-quality annotators

Table 5. A Summary of the Automatic Evaluation and Human Evaluation

5.3 Benchmark Results

Table 6 shows benchmark results for each figure of speech. It is worth noting that these results vary across tasks and evaluation methods and are therefore not comparable to each other. Here we aim at providing a simple glimpse of the current status of this field, including the state-of-the-art models (see references) and their corresponding evaluation. For automatic evaluation, we see that different models are evaluated with different metrics, making it difficult to derive a consistent overview. When looking at human evaluation, we observe that in all cases performances cannot be described as particularly good. These observations suggest that FLG (i) still needs plenty of work towards more performant models, and (ii) is in dire need of a shared evaluation framework.

Table 6.
Figure of SpeechTaskDatasetReferenceAutomatic EvaluationHuman Evaluation
MetricResultMetricResult
SimileSimile\(\rightarrow\)Context with simile[143][143]Distinct87.8creativity (0,1)0.7
MetaphorLiteral\(\rightarrow\)Metaphor[32][106]--Style Strength (1–100)\(\gt\)75
HyperboleLiteral\(\rightarrow\)Hyperbole [124][61]Form Strength84.4--
IdiomLiteral\(\rightarrow\)Idiom[146][61]Form Strength76.4--
Irony (Sarcasm)Image\(\rightarrow\)Sarcasm[79][109]Cosine Similarity25.3Sarcasticness (unknown)2.9
PunWord Senses\(\rightarrow\)Pun[82][84]Distinct-196.3Fun (1–5)3.0
PersonificationTopic\(\rightarrow\)Personification[76][76]BLEU47.0Rhetorical Aesthetics (1–5)3.2

Table 6. Benchmark Results for each FLG Task

Skip 6CHALLENGES AND FUTURE DIRECTIONS Section

6 CHALLENGES AND FUTURE DIRECTIONS

In spite of the recent interest in NLP, figurative language generation is still relatively understudied compared to other NLG tasks, though the progress of this field would be beneficial for a variety of practical applications, such as education. In this section, we outline several challenging problems and prospective research directions that we believe are critical and valuable, as well as some potential impacts of FLG.

6.1 FLG in Multi-Figurative/-Lingual Settings

Multi-Figurative Language Generation At present, most existing work mainly focuses on modelling a single figurative form, i.e., reformulating a literal text into one with a specific target figure of speech. This strategy has the disadvantage of having to train separate models, one for each figure of speech, and of not exploiting potential knowledge transfer across figurative forms. Multi-figurative language generation is thus an interesting research direction. Recently, Lai and Nissim [61] took the first step towards multi-figurative language modelling and provided a benchmark for the automatic generation of five common figurative forms. This direction could potentially be further explored in the future, including (i) expanding and modelling more figures of speech; and (ii) exploring knowledge transfer across figurative forms, especially leveraging high-resource forms to model low-resource ones, since datasets are not equally available for the different figurative forms.

Multi-Lingual Figurative Language Generation As discussed in Section 2.3, almost all existing FLG research focuses on English, with a small amount of work on Chinese and German. Apart from extending FLG work to other languages, looking at this task from a multilingual viewpoint might shed even more light on cross-lingual regularities and thus potentially help to tackle this task better both theoretically and from a practical perspective. A multilingual viewpoint in figurative language can also provide some insights into machine translation and thus advance this area. The main problem is the lack of data for other languages so that more work should be expected and welcomed in this direction. Cross-lingual learning can also support the learning of computational models for low-resource languages and domains.

Benchmark Datasets Like many other tasks, using standard benchmark datasets is a crucial step in FLG research that can enable the development of various models, the assessment of their performances, and comparisons among them. Therefore, creating high-quality datasets is crucial and can greatly advance the progress of this field. In recent years, some (large) datasets of different figures of speech have been produced. However, most datasets are in English and in single figurative form (e.g., literal-to-idiom). As future work, more benchmark datasets developed in a multi-figurative and multilingual perspective as discussed above are needed and would also form a significant contribution. Datasets existing for yet other partially related tasks that have to do with creative language and double meanings, such as Tricky riddles [144], could also be considered in the context of FLG.

6.2 Methodology

We limit here our discussion to issues related to neural models as they are the most popular approach to FLG, and the one with the most potential for future development.

Reducing the need for parallel data Fine-tuning PLMs can avoid having to train the model from scratch, thereby accelerating the convergence of the network. Although this recently has achieved some promising results in FLG, it still requires good amounts of task-specific parallel data (e.g., literal-idiom pairs). Self-supervised learning has been applied to FLG, for example masking the figurative words in the sentence and predicting which words should replace these masks [116, 145]. While these works just apply self-supervised learning for the specific downstream task, they also suggest a potential direction for training tailor-designed PLMs for FLG. This strategy of training a figurative-mask language model is worth exploring to reduce the need for task-specific parallel data in the fine-tuning stage. The key challenge is how to design an effective self-supervised training objective that is close to the target task. Another strategy is to exploit prior domain and task knowledge learned from transfer learning and pre-training to explore unsupervised approaches that do not require parallel data, such as PLM-based back translation and cross-figurative knowledge transfer.

RL for FLG Although some existing works have employed reinforcement learning algorithms for FLG and achieved promising results, they focused on training neural models from scratch [77, 83, 148]. These works, however, have not been well studied when using PLMs. Therefore, one possible line of research is to augment PLMs with reward learning. As future work, we believe it is well worth exploring how to build reliable reward strategies to improve and balance multiple core aspects of FLG (e.g., form strength and context preservation), including metric-based reward methods [103], reinforcement learning with human feedback (RLHF, 90), or direct preference optimization (DPO, 100).

LLMs in FLG The capabilities of LLMs have been continuously increasing by scaling model sizes, dataset sizes, and computation, with various preference optimization techniques [90, 100]. There are some works that have applied prompt learning with LLMs in FLG [18, 84, 106], where task instructions with few or no examples are designed to guide the model on the target task not always with great success. However, as introduced in Section 4.3, most work primarily focused on GPT-3, while more powerful models such as ChatGPT3 and GPT-4 [88], have not been used for FLG. Additionally, we believe there is a need to explore prompt learning methods based on LLMs specifically for FLG. For instance, what mostly influences LLMs to control the figure of speech of the generated text? And how can prompts and instructions be designed to better control these variables? On the other hand, optimizing LLMs specifically for FLG has received less attention, which recently has been employed in a more general domain of creative writing [134].

6.3 Evaluation Methods

As discussed in Section 4, the NLP community has not yet reached fully shared standards and protocols for the evaluation of FLG. For instance, there are many automatic metrics, but the commonly used metrics are mainly borrowed from other NLG generation tasks (e.g., BLEU) and are not necessarily appropriate for this task. Furthermore, to the best of our knowledge, there is no work that focuses on evaluation practices, including the standard framework for conducting human evaluation alongside automatic metrics. Therefore, research on new evaluation metrics that target FLG is an important direction in the future. A good evaluation metric should be able to reflect the true performance of different models and can be used to guide researchers in developing and improving them. Most importantly, it should have a high correlation with human judgements. Ideally, new automatic metrics should be able to evaluate the relation between the source and output, that between reference and output, or among all of them. Therefore, more work on evaluation practices, that is, employing human judgement to navigate automatic metrics, either existing or new ones, is an interesting direction. In the context of LLM-based evaluation, recent work shows that ChatGPT achieves state-of-the-art or competitive correlation with golden human judgments on many NLG tasks by prompting the model with specific task institutions [62, 133]. This suggests that LLMs can also potentially be applied to the evaluation of FLG, which requires more exploration and validation.

As a final note, in the context of human evaluation of LLM’s output, recent work [20] suggests to rely on the Torrance Test metrics for Creative Thinking (TTCT) for assessing LLMs’ creativity. While there is not necessarily a correspondence between creativity and figurative language, this line of work could provide some future insights for extending and strengthening the human-based evaluation of FLG.

6.4 Applications

Based on the summary of current work, we can expect that the development of FLG not only affects other NLP tasks, but also can be applied to specific downstream applications or studies.

NLP Research FLG can advance NLP downstream NLP tasks. For instance, the non-compositionality of idiom expressions exists in different degrees in different languages and is one of the open challenges in idiom translation [31]. The additional control of the figurative form of the translated text can be practically useful and also technically insightful. Also, computational approaches can be employed to provide a better understanding of linguistic phenomena and more specifically different figures of speech [61].

Conversational Agents There are widely applied across different domains to serve various purposes, from providing automated assistance to companionship. Recent work shows that metaphors with various degrees of perceived warmth and competence can shape users’ expectations of an agent, leading to different effects on such aspects, as willingness to use and cooperation [52]. Overall, we would expect that FLG can be a valuable complement to conversational agents, enhancing their ability to engage users in natural and expressive conversations, such as naturalness, emotional expression, personalization, and cultural sensitivity.

Education Recent progress in generative AI has inspired substantial research on how humans could collaborate with LLMs for the creative writing [20, 134, 142]. Similarly, FLG can also be useful for creative writing assistance or even literary or poetic creation [21]. Therefore, educational applications can greatly benefit from FLG, for example in the form of automated modules that supply (second) language learning. Particularly, figurative language is an important part of K-12 education, LLMs-powered agents in FLG could be instrumental in creating diverse educational content, spanning introductions to literal and figurative language, advanced analysis, and writing exercises. Overall, we believe that figurative language generation can serve as a multifaceted tool for linguistic understanding, encompassing, including fostering comprehension, language proficiency, critical thinking, literary analysis, and communication skills.

6.5 Impact of FLG

In recent years, there has been increased attention to ethical issues related to Artificial Intelligence research. As with most language technologies, the development of FLG can benefit and improve human life with applications such as those discussed above. However, this could also lead to substantial inaccuracies, stereotypes, or demeaning, which could be propagated in further processing. For example, there is evidence of negative metaphors used in media discourse, which is a troubling potential harm [3]. With this in mind, a word of warning is necessary regarding the direct deployment of FLG models. Particularly, we should be wary of how this technology might be misused, and who might be harmed by it. Therefore, writing about risks explicitly in scientific papers advancing FLG research and also raising awareness of this possibility in the general public are ways to contain the effects of potentially harmful consequences. As practitioners in the field, we should be open to any discussion and suggestions to minimise such risks.

Skip 7CONCLUSION Section

7 CONCLUSION

In this survey, we have comprehensively reviewed existing representative research work on figurative language generation, including common figures of speech, corresponding tasks, various approaches and evaluation strategies. Based on the critical analysis of the existing research trends, we have identified a series of key challenges and problems in this field and highlighted several directions for future work. We hope that this survey can provide researchers with a roadmap to easily track current research in FLG and grasp its core challenges, so as to make meaningful advances in this area.

Footnotes

  1. 1 https://github.com/laihuiyuan/figurative-language-generation

    Footnote
  2. 2 We follow previous work [21, 61, 123, 145] in using “context,” instead of “content” which is commonly used in style transfer, suggesting that it is more the general context/topic of the source sentence that has to be preserved rather than its very exact content.

    Footnote
  3. 3 https://openai.com/blog/chatgpt.

    Footnote

REFERENCES

  1. [1] Abe Keiga, Sakamoto Kayo, and Nakagawa Masanori. 2006. A computational model of the metaphor generation process. In Proceedings of the 28th Annual Meeting of the Cognitive Science Society. Retrieved from https://escholarship.org/uc/item/5d96219gGoogle ScholarGoogle Scholar
  2. [2] Aiden Erez Lieberman and Michel Jean-Baptiste. 2011. Quantitative analysis of culture using millions of digitized books. Science 331 (2011), 176182. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Arcimaviciene Liudmila and Bağlama Sercan Hamza. 2018. Migration, metaphor and myth in media representations: The ideological dichotomy of “Them” and “Us”. SAGE Open 8 (2018). Retrieved from https://api.semanticscholar.org/CorpusID:149710932Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Bahdanau Dzmitry, Cho Kyung Hyun, and Bengio Yoshua. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations. https://arxiv.org/abs/1409.0473Google ScholarGoogle Scholar
  5. [5] Bai Shuang and An Shan. 2018. A survey on automatic image caption generation. Neurocomputing 311 (2018), 291304. DOI:DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Baker Collin F., Fillmore Charles J., and Lowe John B.. 1998. The berkeley framenet project. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1. Association for Computational Linguistics, Montreal, Quebec, Canada, 8690. DOI:DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Banerjee Satanjeev and Lavie Alon. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.Goldstein Jade, Lavie Alon, Lin Chin-Yew, and Voss Clare (Eds.), Association for Computational Linguistics, Ann Arbor, Michigan, 6572. Retrieved from https://aclanthology.org/W05-0909Google ScholarGoogle Scholar
  8. [8] Barrault Loïc, Bojar Ondřej, Costa-jussà Marta R., Federmann Christian, Fishel Mark, Graham Yvette, Haddow Barry, Huck Matthias, Koehn Philipp, Malmasi Shervin, Monz Christof, Müller Mathias, Pal Santanu, Post Matt, and Zampieri Marcos. 2019. Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the 4th Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1).Bojar Ondřej, Chatterjee Rajen, Federmann Christian, Fishel Mark, Graham Yvette, Haddow Barry, Huck Matthias, Yepes Antonio Jimeno, Koehn Philipp, Martins André, Monz Christof, Negri Matteo, Névéol Aurélie, Neves Mariana, Post Matt, Turchi Marco, and Verspoor Karin (Eds.), Association for Computational Linguistics, Florence, Italy, 161. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Basile Valerio and Bos Johan. 2011. Towards generating text from discourse representation structures. In Proceedings of the 13th European Workshop on Natural Language Generation.Gardent Claire and Striegnitz Kristina (Eds.), Association for Computational Linguistics, Nancy, France, 145150. Retrieved from https://aclanthology.org/W11-2819Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Klebanov Beata Beigman, Leong Chee Wee, Gutierrez E. Dario, Shutova Ekaterina, and Flor Michael. 2016. Semantic classifications for detection of verb metaphors. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Berlin, Germany, 101106. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Klebanov Beata Beigman, Shutova Ekaterina, Lichtenstein Patricia, Muresan Smaranda, and Wee Chee (Eds.). 2018. In Proceedings of the Workshop on Figurative Language Processing. Association for Computational Linguistics, New Orleans, Louisiana. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Bhavya Bhavya, Xiong Jinjun, and Zhai ChengXiang. 2022. Analogy generation by prompting large language models: A case study of InstructGPT. In Proceedings of the 15th International Conference on Natural Language Generation.Shaikh Samira, Ferreira Thiago, and Stent Amanda (Eds.), Association for Computational Linguistics, Waterville, Maine, USA and virtual meeting, 298312. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Bosselut Antoine, Rashkin Hannah, Sap Maarten, Malaviya Chaitanya, Celikyilmaz Asli, and Choi Yejin. 2019. COMET: Commonsense transformers for automatic knowledge graph construction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Korhonen Anna, Traum David, and Màrquez Lluís (Eds.), Association for Computational Linguistics, Florence, Italy, 47624779. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Bowes Andrea and Katz Albert. 2011. When sarcasm stings. Discourse Processes 48, 4 (2011), 215236. DOI:DOI:arXiv:https://doi.org/10.1080/0163853X.2010.532757Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Briakou Eleftheria, Agrawal Sweta, Tetreault Joel, and Carpuat Marine. 2021. Evaluating the evaluation metrics for style transfer: A case study in multilingual formality transfer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.Moens Marie-Francine, Huang Xuanjing, Specia Lucia, and Yih Scott Wen-tau (Eds.), Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 13211336. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Brown Tom, Mann Benjamin, Ryder Nick, Subbiah Melanie, Kaplan Jared D., Dhariwal Prafulla, Neelakantan Arvind, Shyam Pranav, Sastry Girish, Askell Amanda, Agarwal Sandhini, Herbert-Voss Ariel, Krueger Gretchen, Henighan Tom, Child Rewon, Ramesh Aditya, Ziegler Daniel, Wu Jeffrey, Winter Clemens, Hesse Chris, Chen Mark, Sigler Eric, Litwin Mateusz, Gray Scott, Chess Benjamin, Clark Jack, Berner Christopher, McCandlish Sam, Radford Alec, Sutskever Ilya, and Amodei Dario. 2020. Language models are few-shot learners. In Proceedings of the Advances in Neural Information Processing Systems.Larochelle H., Ranzato M., Hadsell R., Balcan M.F., and Lin H. (Eds.), Vol. 33, Curran Associates, Inc., 18771901. Retrieved from https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdfGoogle ScholarGoogle Scholar
  17. [17] Chaganty Arun, Mussmann Stephen, and Liang Percy. 2018. The price of debiasing automatic metrics in natural language evalaution. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Gurevych Iryna and Miyao Yusuke (Eds.), Association for Computational Linguistics, Melbourne, Australia, 643653. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Chakrabarty Tuhin, Choi Yejin, and Shwartz Vered. 2022. It’s not rocket science: Interpreting figurative language in narratives. Transactions of the Association for Computational Linguistics 10 (2022), 589606. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Chakrabarty Tuhin, Ghosh Debanjan, Muresan Smaranda, and Peng Nanyun. 2020. R^3: Reverse, retrieve, and rank for sarcasm generation with commonsense knowledge. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.Jurafsky Dan, Chai Joyce, Schluter Natalie, and Tetreault Joel (Eds.), Association for Computational Linguistics, Online, 79767986. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Chakrabarty Tuhin, Laban Philippe, Agarwal Divyansh, Muresan Smaranda, and Wu Chien-Sheng. 2023. Art or artifice? Large language models and the false promise of creativity. arXiv:2309.14556. Retrieved from https://arxiv.org/abs/2309.14556Google ScholarGoogle Scholar
  21. [21] Chakrabarty Tuhin, Muresan Smaranda, and Peng Nanyun. 2020. Generating similes effortlessly like a pro: A style transfer approach for simile generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.Webber Bonnie, Cohn Trevor, He Yulan, and Liu Yang (Eds.), Association for Computational Linguistics, Online, 64556469. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Chakrabarty Tuhin, Saakyan Arkadiy, Ghosh Debanjan, and Muresan Smaranda. 2022. FLUTE: Figurative language understanding through textual explanations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.Goldberg Yoav, Kozareva Zornitsa, and Zhang Yue (Eds.), Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 71397159. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Chakrabarty Tuhin, Zhang Xurui, Muresan Smaranda, and Peng Nanyun. 2021. MERMAID: Metaphor generation with symbolism and discriminative decoding. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Toutanova Kristina, Rumshisky Anna, Zettlemoyer Luke, Hakkani-Tur Dilek, Beltagy Iz, Bethard Steven, Cotterell Ryan, Chakraborty Tanmoy, and Zhou Yichao (Eds.), Association for Computational Linguistics, Online, 42504261. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Chopra Sumit, Auli Michael, and Rush Alexander M.. 2016. Abstractive sentence summarization with attentive recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Knight Kevin, Nenkova Ani, and Rambow Owen (Eds.), Association for Computational Linguistics, San Diego, California, 9398. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Colin Emilie, Gardent Claire, M’rabet Yassine, Narayan Shashi, and Perez-Beltrachini Laura. 2016. The WebNLG challenge: Generating text from DBPedia data. In Proceedings of the 9th International Natural Language Generation Conference. Isard Amy, Rieser Verena, and Gkatzia Dimitra (Eds.), Association for Computational Linguistics, Edinburgh, UK, 163167. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Dankers Verna, Rei Marek, Lewis Martha, and Shutova Ekaterina. 2019. Modelling the interplay of metaphor and emotion through multitask learning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Hong Kong, China, 22182229. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).Burstein Jill, Doran Christy, and Solorio Thamar (Eds.), Association for Computational Linguistics, Minneapolis, Minnesota, 41714186. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Dodge Ellen, Hong Jisup, and Stickles Elise. 2015. MetaNet: Deep semantic automatic metaphor analysis. In Proceedings of the 3rd Workshop on Metaphor in NLP. Shutova Ekaterina, Klebanov Beata Beigman, and Lichtenstein Patricia (Eds.), Association for Computational Linguistics, Denver, Colorado, 4049. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Dubey Abhijeet, Kumar Lakshya, Somani Arpan, Joshi Aditya, and Bhattacharyya Pushpak. 2019. “When Numbers Matter!”: Detecting sarcasm in numerical portions of text. In Proceedings of the 10th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. Association for Computational Linguistics, Minneapolis, USA, 7280. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Esuli Andrea and Sebastiani Fabrizio. 2006. SENTIWORDNET: A publicly available lexical resource for opinion mining. In Proceedings of the 5th International Conference on Language Resources and Evaluation.Calzolari Nicoletta, Choukri Khalid, Gangemi Aldo, Maegaard Bente, Mariani Joseph, Odijk Jan, and Tapias Daniel (Eds.), European Language Resources Association (ELRA), Genoa, Italy. Retrieved from http://www.lrec-conf.org/proceedings/lrec2006/pdf/384_pdf.pdfGoogle ScholarGoogle Scholar
  31. [31] Fadaee Marzieh, Bisazza Arianna, and Monz Christof. 2018. Examining the tip of the iceberg: A data set for idiom translation. In Proceedings of the 11th International Conference on Language Resources and Evaluation. European Language Resources Association (ELRA), Miyazaki, Japan. Retrieved from https://aclanthology.org/L18-1148Google ScholarGoogle Scholar
  32. [32] Fan Angela, Lewis Mike, and Dauphin Yann. 2018. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).Gurevych Iryna and Miyao Yusuke (Eds.), Association for Computational Linguistics, Melbourne, Australia, 889898. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Felbo Bjarke, Mislove Alan, Søgaard Anders, Rahwan Iyad, and Lehmann Sune. 2017. Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, 16151625. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Fomicheva Marina and Specia Lucia. 2016. Reference bias in monolingual machine translation evaluation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Berlin, Germany, 7782. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Gabriel Saadia, Bhagavatula Chandra, Shwartz Vered, Bras Ronan Le, Forbes Maxwell, and Choi Yejin. 2021. Paragraph-level commonsense transformers with recurrent memory. In Proceedings of the AAAI Conference on Artificial Intelligence. 1285712865. Retrieved from https://ojs.aaai.org/index.php/AAAI/article/view/17521/17328Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Garbacea Cristina and Mei Qiaozhu. 2020. Neural language generation: Formulation, methods, and evaluation. arXiv:2007.15780. Retrieved from https://arxiv.org/abs/2007.15780Google ScholarGoogle Scholar
  37. [37] Gehring Jonas, Auli Michael, Grangier David, Yarats Denis, and Dauphin Yann N.. 2017. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 70).Precup Doina and Teh Yee Whye (Eds.), PMLR, 12431252. Retrieved from https://proceedings.mlr.press/v70/gehring17a/gehring17a.pdfGoogle ScholarGoogle Scholar
  38. [38] Gero Katy Ilonka and Chilton Lydia B.. 2019. Metaphoria: An algorithmic companion for metaphor creation. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk). Association for Computing Machinery, New York, NY, USA, 112. Retrieved from Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Ghosh Debanjan, Klebanov Beata Beigman, Muresan Smaranda, Feldman Anna, Poria Soujanya, and Chakrabarty Tuhin (Eds.). 2022. Proceedings of the 3rd Workshop on Figurative Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid). Retrieved from https://aclanthology.org/2022.flp-1.0Google ScholarGoogle Scholar
  40. [40] Ghosh Debanjan, Musi Elena, and Muresan Smaranda. 2020. Interpreting verbal irony: Linguistic strategies and the connection to thetype of semantic incongruity. In Proceedings of the Society for Computation in Linguistics 2020. Association for Computational Linguistics, New York, New York, 8293. Retrieved from https://aclanthology.org/2020.scil-1.10.pdfGoogle ScholarGoogle Scholar
  41. [41] Goodfellow Ian, Bengio Yoshua, and Courville Aaron. 2016. Deep Learning. MIT Press. Retrieved from http://www.deeplearningbook.orgGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Haagsma Hessel, Bos Johan, and Nissim Malvina. 2020. MAGPIE: A large corpus of potentially idiomatic expressions. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 279287. Retrieved from https://aclanthology.org/2020.lrec-1.35Google ScholarGoogle Scholar
  43. [43] Harmon Sarah. 2015. FIGURE8: A novel system for generating and evaluating figurative language. In Proceedings of the 6th International Conference on Computational Creativity. 7177. Retrieved from https://computationalcreativity.net/iccc2015/proceedings/4_1Harmon.pdfGoogle ScholarGoogle Scholar
  44. [44] He He, Peng Nanyun, and Liang Percy. 2019. Pun generation with surprise. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).Burstein Jill, Doran Christy, and Solorio Thamar (Eds.), Association for Computational Linguistics, Minneapolis, Minnesota, 17341744. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Hervás Raquel, Costa Rui P., Costa Hugo, Gervás Pablo, and Pereira Francisco C.. 2007. Enrichment of automatically generated texts using metaphor. In Proceedings of the Advances in Artificial Intelligence. Retrieved from https://link.springer.com/chapter/10.1007/978-3-540-76631-5_90Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Computation 9, 12 (1997), 1735–80. Retrieved from https://www.bioinf.jku.at/publications/older/2604.pdfGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Hong Bryan Anthony and Ong Ethel. 2009. Automatically extracting word relationships as templates for pun generation. In Proceedings of the Workshop on Computational Approaches to Linguistic Creativity. Feldman Anna and Loenneker-Rodman Birte (Eds.), Association for Computational Linguistics, Boulder, Colorado, 2431. Retrieved from https://aclanthology.org/W09-2004Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Hwang Jena D., Bhagavatula Chandra, Bras Ronan Le, Da Jeff, Sakaguchi Keisuke, Bosselut Antoine, and Choi Yejin. 2021. (comet-) atomic 2020: On symbolic and neural commonsense knowledge graphs. In Proceedings of the AAAI Conference on Artificial Intelligence. 63846392. Retrieved from https://ojs.aaai.org/index.php/AAAI/article/view/16792/16599Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Jacobs Arthur M.. 2018. The gutenberg english poetry corpus: Exemplary quantitative narrative analyses. Frontiers Digit. Humanit. 5 (2018), 5. Retrieved from https://www.frontiersin.org/articles/10.3389/fdigh.2018.00005/fullGoogle ScholarGoogle ScholarCross RefCross Ref
  50. [50] Jones Mark Alan. 1992. Generating a specific class of metaphors. In Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Newark, Delaware, USA, 321323. DOI:DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Joshi Aditya, Kunchukuttan Anoop, Bhattacharyya Pushpak, and Carman Mark. 2015. SarcasmBot: An open-source sarcasm-generation module for chatbots. In Proceedings of the 4th International Workshop on Issues of Sentiment Discovery and Opinion Mining. Retrieved from https://www.cse.iitb.ac.in/adityaj/sarcasmbot-wisdom15-kdd.pdfGoogle ScholarGoogle Scholar
  52. [52] Khadpe Pranav, Krishna Ranjay, Fei-Fei Li, Hancock Jeffrey T., and Bernstein Michael S.. 2020. Conceptual metaphors impact perceptions of human-AI collaboration. Proc. ACM Hum.-Comput. Interact. 4, CSCW2 (2020), 26 pages. DOI:DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Kim Yoon, Denton Carl, Hoang Luong, and Rush Alexander M.. 2017. Structured attention networks. In 5th International Conference on Learning Representations., Toulon, France, April 24–26, 2017, Conference Track Proceedings. OpenReview.net. Retrieved from https://openreview.net/forum?id=HkE0NvqlgGoogle ScholarGoogle Scholar
  54. [54] Klebanov Beata Beigman, Shutova Ekaterina, Lichtenstein Patricia, Muresan Smaranda, Wee Chee, Feldman Anna, and Ghosh Debanjan (Eds.). 2020. Proceedings of the 2nd Workshop on Figurative Language Processing. Association for Computational Linguistics, Online. Retrieved from https://aclanthology.org/2020.figlang-1.0Google ScholarGoogle Scholar
  55. [55] Koehn Philipp. 2010. Statistical Machine Translation (1st. ed.). Cambridge University Press, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. [56] Kong Li, Li Chuanyi, Ge Jidong, Luo Bin, and Ng Vincent. 2020. Identifying exaggerated language. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online, 70247034. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Korkontzelos Ioannis, Zesch Torsten, Zanzotto Fabio Massimo, and Biemann Chris. 2013. SemEval-2013 task 5: Evaluating phrasal semantics. In Proceedings of the 2nd Joint Conference on Lexical and Computational Semantics, Volume 2: Proceedings of the 7th International Workshop on Semantic Evaluation, Manandhar Suresh and Yuret Deniz (Eds.), Association for Computational Linguistics, Atlanta, Georgia, USA, 3947. Retrieved from https://aclanthology.org/S13-2007Google ScholarGoogle Scholar
  58. [58] Kreuz Roger J. and Roberts Richard M.. 1993. The empirical study of figurative language in literature. Poetics 22, 1 (1993), 151169. Retrieved from https://www.sciencedirect.com/science/article/abs/pii/0304422X9390026DGoogle ScholarGoogle ScholarCross RefCross Ref
  59. [59] Kumon-Nakamura Sachi, Glucksberg Sam, and Brown M.. 1995. How about another piece of pie: the allusional pretense theory of discourse irony. Journal of Experimental Psychology. General 124, 1 (1995), 321. Retrieved from https://psycnet.apa.org/record/1995-21193-001Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Lai Huiyuan, Mao Jiali, Toral Antonio, and Nissim Malvina. 2022. Human judgement as a compass to navigate automatic metrics for formality transfer. In Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems. Association for Computational Linguistics, Dublin, Ireland, 102115. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Lai Huiyuan and Nissim Malvina. 2022. Multi-figurative language generation. In Proceedings of the 29th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Gyeongju, Republic of Korea, 59395954. Retrieved from https://aclanthology.org/2022.coling-1.519Google ScholarGoogle Scholar
  62. [62] Lai Huiyuan, Toral Antonio, and Nissim Malvina. 2023. Multidimensional evaluation for text style transfer using ChatGPT. arXiv:2304.13462. Retrieved from https://arxiv.org/abs/2304.13462Google ScholarGoogle Scholar
  63. [63] Lai Huiyuan, Toral Antonio, and Nissim Malvina. 2023. Multilingual multi-figurative language detection. In Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, Toronto, Canada, 92549267. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  64. [64] Leggitt John and Gibbs Raymond. 2000. Emotional reactions to verbal irony. Discourse Processes - DISCOURSE PROCESS 29, 01 (2000), 124. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Lewis Mike, Liu Yinhan, Goyal Naman, Ghazvininejad Marjan, Mohamed Abdelrahman, Levy Omer, Stoyanov Veselin, and Zettlemoyer Luke. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.Jurafsky Dan, Chai Joyce, Schluter Natalie, and Tetreault Joel (Eds.), Association for Computational Linguistics, Online, 78717880. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Li Juncen, Jia Robin, He He, and Liang Percy. 2018. Delete, retrieve, generate: A simple approach to sentiment and style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 18651874. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  67. [67] Li Junyi, Tang Tianyi, Zhao Wayne Xin, Nie Jian-Yun, and Wen Ji-Rong. 2022. A survey of pretrained language models based text generation. arXiv:2201.05273. Retrieved from https://arxiv.org/abs/2201.05273Google ScholarGoogle Scholar
  68. [68] Li Linlin and Sporleder Caroline. 2009. Classifier combination for contextual idiom detection without labelled data. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 315323. Retrieved from https://aclanthology.org/D09-1033Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. [69] Li Yucheng, Lin Chenghua, and Guerin Frank. 2022. Nominal metaphor generation with multitask learning. In Proceedings of the 15th International Conference on Natural Language Generation.Shaikh Samira, Ferreira Thiago, and Stent Amanda (Eds.), Association for Computational Linguistics, Waterville, Maine, USA and virtual meeting, 225235. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  70. [70] Lin Chin-Yew. 2004. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 7481. Retrieved from https://aclanthology.org/W04-1013Google ScholarGoogle Scholar
  71. [71] Lin Chin-Yew. 2004. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 7481. Retrieved from https://aclanthology.org/W04-1013Google ScholarGoogle Scholar
  72. [72] Liu Changsheng and Hwa Rebecca. 2016. Phrasal substitution of idiomatic expressions. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Knight Kevin, Nenkova Ani, and Rambow Owen (Eds.), Association for Computational Linguistics, San Diego, California, 363373. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  73. [73] Liu Chia-Wei, Lowe Ryan, Serban Iulian, Noseworthy Mike, Charlin Laurent, and Pineau Joelle. 2016. How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.Su Jian, Duh Kevin, and Carreras Xavier (Eds.), Association for Computational Linguistics, Austin, Texas, 21222132. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  74. [74] Liu Hugo and Singh Push. 2004. ConceptNet–A practical commonsense reasoning tool-kit. BT Technology Journal 22, 06 (2004). DOI:DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. [75] Liu Pengfei, Yuan Weizhe, Fu Jinlan, Jiang Zhengbao, Hayashi Hiroaki, and Neubig Graham. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55, 9 (2023), 35 pages. DOI:DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. [76] Liu Zhiqiang, Fu Zuohui, Cao Jie, Melo Gerard de, Tam Yik-Cheung, Niu Cheng, and Zhou Jie. 2019. Rhetorically controlled encoder-decoder for modern chinese poetry generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.Korhonen Anna, Traum David, and Màrquez Lluís (Eds.), Association for Computational Linguistics, Florence, Italy, 19922001. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  77. [77] Luo Fuli, Li Shunyao, Yang Pengcheng, Li Lei, Chang Baobao, Sui Zhifang, and Sun Xu. 2019. Pun-GAN: Generative adversarial network for pun generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Inui Kentaro, Jiang Jing, Ng Vincent, and Wan Xiaojun (Eds.), Association for Computational Linguistics, Hong Kong, China, 33883393. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  78. [78] Mao Rui, Lin Chenghua, and Guerin Frank. 2018. Word embedding and WordNet based metaphor identification and interpretation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 12221231. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  79. [79] Mathews Alexander, Xie Lexing, and He Xuming. 2016. SentiCap: Generating image descriptions with sentiments. In Proceedings of the 13th AAAI Conference on Artificial Intelligence (Phoenix, Arizona). AAAI Press, 35743580. Retrieved from https://xmhe.bitbucket.io/papers/senti_desc_full.pdfGoogle ScholarGoogle ScholarCross RefCross Ref
  80. [80] Mikolov Tomas, Sutskever Ilya, Chen Kai, Corrado Greg S., and Dean Jeff. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in Neural Information Processing Systems. Burges C.J., Bottou L., Welling M., Ghahramani Z., and Weinberger K.Q. (Eds.), Vol. 26, Curran Associates, Inc.Retrieved from https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdfGoogle ScholarGoogle Scholar
  81. [81] Miller George A.. 1995. WordNet: A lexical database for english. Communications of the ACM 38, 11 (1995), 3941. DOI:DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. [82] Miller Tristan, Hempelmann Christian, and Gurevych Iryna. 2017. SemEval-2017 task 7: Detection and interpretation of english puns. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Bethard Steven, Carpuat Marine, Apidianaki Marianna, Mohammad Saif M., Cer Daniel, and Jurgens David (Eds.), Association for Computational Linguistics, Vancouver, Canada, 5868. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  83. [83] Mishra Abhijit, Tater Tarun, and Sankaranarayanan Karthik. 2019. A modular architecture for unsupervised sarcasm generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Inui Kentaro, Jiang Jing, Ng Vincent, and Wan Xiaojun (Eds.), Association for Computational Linguistics, Hong Kong, China, 61446154. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  84. [84] Mittal Anirudh, Tian Yufei, and Peng Nanyun. 2022. AmbiPun: Generating humorous puns with ambiguous context. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Carpuat Marine, Marneffe Marie-Catherine de, and Ruiz Ivan Vladimir Meza (Eds.), Association for Computational Linguistics, Seattle, United States, 10531062. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  85. [85] Mohammad Saif, Shutova Ekaterina, and Turney Peter. 2016. Metaphor as a medium for emotion: An empirical study. In Proceedings of the 5th Joint Conference on Lexical and Computational Semantics.Gardent Claire, Bernardi Raffaella, and Titov Ivan (Eds.), Association for Computational Linguistics, Berlin, Germany, 2333. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  86. [86] Nissim Malvina and Markert Katja. 2003. Syntactic features and word similarity for supervised metonymy resolution. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Sapporo, Japan, 5663. DOI:DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. [87] Nunberg Geoffrey, Sag Ivan A., and Wasow Thomas. 1994. Idioms. Language 70 (1994), 491538. Retrieved from https://philpapers.org/rec/NUNIGoogle ScholarGoogle ScholarCross RefCross Ref
  88. [88] OpenAI. 2023. GPT-4 technical report. arXiv:2303.08774. Retrieved from https://arxiv.org/abs/2303.08774Google ScholarGoogle Scholar
  89. [89] Otter Dan, Medina Julian Richard, and Kalita Jugal Kumar. 2021. A survey of the usages of deep learning for natural language processing. IEEE Transactions on Neural Networks and Learning Systems 32 (2021), 604624. Retrieved from https://ieeexplore.ieee.org/document/9075398Google ScholarGoogle ScholarCross RefCross Ref
  90. [90] Ouyang Long, Wu Jeffrey, Jiang Xu, Almeida Diogo, Wainwright Carroll, Mishkin Pamela, Zhang Chong, Agarwal Sandhini, Slama Katarina, Ray Alex, Schulman John, Hilton Jacob, Kelton Fraser, Miller Luke, Simens Maddie, Askell Amanda, Welinder Peter, Christiano Paul F., Leike Jan, and Lowe Ryan. 2022. Training language models to follow instructions with human feedback. In Proceedings of the Advances in Neural Information Processing Systems.Koyejo S., Mohamed S., Agarwal A., Belgrave D., Cho K., and Oh A. (Eds.), Vol. 35, Curran Associates, Inc., 2773027744. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdfGoogle ScholarGoogle Scholar
  91. [91] Ovchinnikova Ekaterina, Zaytsev Vladimir, Wertheim Suzanne, and Israel Ross. 2014. Generating conceptual metaphors from proposition stores. arXiv:1409.7619. Retrieved from https://arxiv.org/abs/1409.7619Google ScholarGoogle Scholar
  92. [92] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311318. DOI:DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. [93] Partridge Eric. 1999. Usage & Abusage. Penguin Books.Google ScholarGoogle Scholar
  94. [94] Paul Anthony M.. 1970. Figurative language. Philosophy and Rhetoric 3, 4 (1970), 225248. Retrieved from https://www.jstor.org/stable/40237206Google ScholarGoogle Scholar
  95. [95] Peled Lotem and Reichart Roi. 2017. Sarcasm SIGN: Interpreting sarcasm with sentiment based monolingual machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Barzilay Regina and Kan Min-Yen (Eds.), Association for Computational Linguistics, Vancouver, Canada, 16901700. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  96. [96] Pereira Francisco, Hervás Raquel, Gervás Pablo, and Cardoso Amílcar. 2006. A multiagent text generator with simple rhetorical habilities. AAAI Workshop - Technical Report (2006). Retrieved from http://nil.fdi.ucm.es/sites/default/files/CABH06.pdfGoogle ScholarGoogle Scholar
  97. [97] Petrović Saša and Matthews David. 2013. Unsupervised joke generation from big data. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).Schuetze Hinrich, Fung Pascale, and Poesio Massimo (Eds.), Association for Computational Linguistics, Sofia, Bulgaria, 228232. Retrieved from https://aclanthology.org/P13-2041Google ScholarGoogle Scholar
  98. [98] Radford Alec, Narasimhan Karthik, Salimans Tim, Sutskever Ilya, et al. 2018. Improving language understanding by generative pre-training. (2018).Google ScholarGoogle Scholar
  99. [99] Radford Alec, Wu Jeff, Child Rewon, Luan David, Amodei Dario, and Sutskever Ilya. 2019. Language models are unsupervised multitask learners. (2019).Google ScholarGoogle Scholar
  100. [100] Rafailov Rafael, Sharma Archit, Mitchell Eric, Ermon Stefano, Manning Christopher D., and Finn Chelsea. 2023. Direct preference optimization: Your language model is secretly a reward model. In Proceedings of the ICML 2023 Workshop The Many Facets of Preference-Based Learning. Retrieved from https://openreview.net/forum?id=53HUHMvQLQGoogle ScholarGoogle Scholar
  101. [101] Raffel Colin, Shazeer Noam, Roberts Adam, Lee Katherine, Narang Sharan, Matena Michael, Zhou Yanqi, Li Wei, and Liu Peter J.. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21, 140 (2020), 167. Retrieved from https://jmlr.org/papers/volume21/20-074/20-074.pdfGoogle ScholarGoogle Scholar
  102. [102] Rangwani Harsh, Kulshreshtha Devang, and Singh Anil Kumar. 2018. NLPRL-IITBHU at SemEval-2018 task 3: Combining linguistic features and emoji pre-trained CNN for irony detection in tweets. In Proceedings of The 12th International Workshop on Semantic Evaluation. Association for Computational Linguistics, New Orleans, Louisiana, 638642. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  103. [103] Ranzato Marc’Aurelio, Chopra Sumit, Auli Michael, and Zaremba Wojciech. 2016. Sequence level training with recurrent neural networks. In Proceedings of the 4th International Conference on Learning Representations. https://arxiv.org/abs/1511.06732Google ScholarGoogle Scholar
  104. [104] Rao Sudha and Tetreault Joel. 2018. Dear sir or madam, may I introduce the GYAFC dataset: Corpus, benchmarks and metrics for formality style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 129140. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  105. [105] Regel Stefanie. 2009. The Comprehension of Figurative Language: Electrophysiological Evidence on the Processing of Irony. Ph. D. Dissertation. Max Planck Institute for Human Cognitive and Brain Sciences Leipzig. Retrieved from https://pure.mpg.de/rest/items/item_726953/component/file_726952/contentGoogle ScholarGoogle Scholar
  106. [106] Reif Emily, Ippolito Daphne, Yuan Ann, Coenen Andy, Callison-Burch Chris, and Wei Jason. 2022. A recipe for arbitrary text style transfer with large language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).Muresan Smaranda, Nakov Preslav, and Villavicencio Aline (Eds.), Association for Computational Linguistics, Dublin, Ireland, 837848. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  107. [107] Roberts Richard M. and Kreuz Roger J.. 1994. Why do people use figurative language? Psychological Science 5, 3 (1994), 159163. Retrieved from https://journals.sagepub.com/doi/10.1111/j.1467-9280.1994.tb00653.xGoogle ScholarGoogle ScholarCross RefCross Ref
  108. [108] Rohanian Omid, Taslimipoor Shiva, Evans Richard, and Mitkov Ruslan. 2018. WLV at SemEval-2018 task 3: Dissecting tweets in search of irony. In Proceedings of The 12th International Workshop on Semantic Evaluation. Association for Computational Linguistics, New Orleans, Louisiana, 553559. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  109. [109] Ruan Jie, Wu Yue, Wan Xiaojun, and Zhu Yuesheng. 2024. Describe images in a boring way: Towards cross-modal sarcasm generation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 57015710. Retrieved from https://openaccess.thecvf.com/content/WACV2024/papers/Ruan_Describe_Images_in_a_Boring_Way_Towards_Cross-Modal_Sarcasm_Generation_WACV_2024_paper.pdfGoogle ScholarGoogle ScholarCross RefCross Ref
  110. [110] Salton Giancarlo, Ross Robert, and Kelleher John. 2014. An empirical study of the impact of idioms on phrase based statistical machine translation of english to brazilian-portuguese. In Proceedings of the 3rd Workshop on Hybrid Approaches to Machine Translation. Rafael E. Banchs, Costa-jussà Marta R., Rapp Reinhard, Lambert Patrik, Eberle Kurt, and Babych Bogdan (Eds.), Association for Computational Linguistics, Gothenburg, Sweden, 3641. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  111. [111] Schwoebel John, Dews Shelly, Winner Ellen, and Srinivas Kavitha. 2000. Obligatory processing of the literal meaning of ironic utterances: Further evidence. Metaphor and Symbol 15 (2000), 4761. Retrieved from https://www.tandfonline.com/doi/abs/10.1080/10926488.2000.9678864Google ScholarGoogle ScholarCross RefCross Ref
  112. [112] Sellam Thibault, Das Dipanjan, and Parikh Ankur. 2020. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 78817892. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  113. [113] Shutova Ekaterina. 2010. Automatic metaphor interpretation as a paraphrasing task. In Proceedings of the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics.Kaplan Ron, Burstein Jill, Harper Mary, and Penn Gerald (Eds.), Association for Computational Linguistics, Los Angeles, California, 10291037. Retrieved from https://aclanthology.org/N10-1147Google ScholarGoogle ScholarDigital LibraryDigital Library
  114. [114] Shutova Ekaterina V.. 2011. Computational approaches to figurative language. Technical Report (2011). Retrieved from https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-803.pdfGoogle ScholarGoogle Scholar
  115. [115] Stowe Kevin, Beck Nils, and Gurevych Iryna. 2021. Exploring metaphoric paraphrase generation. In Proceedings of the 25th Conference on Computational Natural Language Learning.Bisazza Arianna and Abend Omri (Eds.), Association for Computational Linguistics, Online, 323336. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  116. [116] Stowe Kevin, Beck Nils, and Gurevych Iryna. 2021. Exploring metaphoric paraphrase generation. In Proceedings of the 25th Conference on Computational Natural Language Learning. Bisazza Arianna and Abend Omri (Eds.), Association for Computational Linguistics, Online, 323336. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  117. [117] Stowe Kevin, Chakrabarty Tuhin, Peng Nanyun, Muresan Smaranda, and Gurevych Iryna. 2021. Metaphor generation with conceptual mappings. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).Zong Chengqing, Xia Fei, Li Wenjie, and Navigli Roberto (Eds.), Association for Computational Linguistics, Online, 67246736. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  118. [118] Stowe Kevin, Utama Prasetya, and Gurevych Iryna. 2022. IMPLI: Investigating NLI models’ performance on figurative language. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Muresan Smaranda, Nakov Preslav, and Villavicencio Aline (Eds.), Association for Computational Linguistics, Dublin, Ireland, 53755388. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  119. [119] Sun Jiao, Narayan-Chen Anjali, Oraby Shereen, Gao Shuyang, Chung Tagyoung, Huang Jing, Liu Yang, and Peng Nanyun. 2022. Context-situated pun generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.Goldberg Yoav, Kozareva Zornitsa, and Zhang Yue (Eds.), Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 46354648. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  120. [120] Sutskever Ilya, Vinyals Oriol, and Le Quoc V.. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (Montreal, Canada). MIT Press, Cambridge, MA, USA, 31043112. Retrieved from https://proceedings.neurips.cc/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdfGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  121. [121] Terai Asuka and Nakagawa Masanori. 2010. A computational system of metaphor generation with evaluation mechanism. In Proceedings of the 20th International Conference on Artificial Neural Networks. 142147. Retrieved from https://link.springer.com/chapter/10.1007/978-3-642-15822-3_18Google ScholarGoogle ScholarCross RefCross Ref
  122. [122] Tian Yufei, Sheth Divyanshu, and Peng Nanyun. 2022. A unified framework for pun generation with humor principles. In Findings of the Association for Computational Linguistics: EMNLP 2022. Goldberg Yoav, Kozareva Zornitsa, and Zhang Yue (Eds.), Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 32533261. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  123. [123] Tian Yufei, Sridhar Arvind krishna, and Peng Nanyun. 2021. HypoGen: Hyperbole generation with commonsense and counterfactual knowledge. In Findings of the Association for Computational Linguistics: EMNLP 2021.Moens Marie-Francine, Huang Xuanjing, Specia Lucia, and Yih Scott Wen-tau (Eds.), Association for Computational Linguistics, Punta Cana, Dominican Republic, 15831593. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  124. [124] Troiano Enrica, Strapparava Carlo, Özbal Gözde, and Tekiroğlu Serra Sinem. 2018. A computational exploration of exaggeration. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 32963304. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  125. [125] Tsvetkov Yulia, Boytsov Leonid, Gershman Anatole, Nyberg Eric, and Dyer Chris. 2014. Metaphor detection with cross-lingual model transfer. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Baltimore, Maryland, 248258. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  126. [126] Valitutti Alessandro, Toivonen Hannu, Doucet Antoine, and Toivanen Jukka M.. 2013. “Let everything turn well in your wife”: Generation of adult humor using lexical constraints. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).Schuetze Hinrich, Fung Pascale, and Poesio Massimo (Eds.), Association for Computational Linguistics, Sofia, Bulgaria, 243248. Retrieved from https://aclanthology.org/P13-2044Google ScholarGoogle Scholar
  127. [127] Hee Cynthia Van, Lefever Els, and Hoste Véronique. 2018. SemEval-2018 task 3: Irony detection in english tweets. In Proceedings of the 12th International Workshop on Semantic Evaluation. Association for Computational Linguistics, New Orleans, Louisiana, 3950. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  128. [128] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems.Guyon I., Luxburg U. Von, Bengio S., Wallach H., Fergus R., Vishwanathan S., and Garnett R. (Eds.), Vol. 30, Curran Associates, Inc.Retrieved from https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdfGoogle ScholarGoogle Scholar
  129. [129] Veale Tony. 2016. Round up the usual suspects: Knowledge-based metaphor generation. In Proceedings of the 4th Workshop on Metaphor in NLP. Klebanov Beata Beigman, Shutova Ekaterina, and Lichtenstein Patricia (Eds.), Association for Computational Linguistics, San Diego, California, 3441. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  130. [130] Veale Tony and Hao Yanfen. 2008. A fluid knowledge representation for understanding and generating creative metaphors. In Proceedings of the 22nd International Conference on Computational Linguistics. Scott Donia and Uszkoreit Hans (Eds.), Coling 2008 Organizing Committee, Manchester, UK, 945952. Retrieved from https://aclanthology.org/C08-1119Google ScholarGoogle ScholarCross RefCross Ref
  131. [131] Vinyals Oriol and Le Quoc. 2015. A neural conversational model. arXiv:1506.05869. Retrieved from https://arxiv.org/abs/1506.05869Google ScholarGoogle Scholar
  132. [132] Volk Martin and Weber Nico. 1998. The automatic translation of idioms. machine translation vs. translation memory systems. Sprachwissenschaft, Computerlinguistik und neue Medien1 (1998), 167192. Retrieved from https://www.zora.uzh.ch/id/eprint/19070/Google ScholarGoogle Scholar
  133. [133] Wang Jiaan, Liang Yunlong, Meng Fandong, Sun Zengkui, Shi Haoxiang, Li Zhixu, Xu Jinan, Qu Jianfeng, and Zhou Jie. 2023. Is ChatGPT a good NLG evaluator? A preliminary study. In Proceedings of the 4th New Frontiers in Summarization Workshop. Dong Yue, Xiao Wen, Wang Lu, Liu Fei, and Carenini Giuseppe (Eds.), Association for Computational Linguistics, Singapore, 111. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  134. [134] Wang Tiannan, Chen Jiamin, Jia Qingrui, Wang Shuai, Fang Ruoyu, Wang Huilin, Gao Zhaowei, Xie Chunzhao, Xu Chuou, Dai Jihong, Liu Yibin, Wu Jialong, Ding Shengwei, Li Long, Huang Zhiwei, Deng Xinle, Yu Teng, Ma Gangan, Xiao Han, Chen Zixin, Xiang Danjun, Wang Yunxia, Zhu Yuanyuan, Xiao Yi, Wang Jing, Wang Yiru, Ding Siran, Huang Jiayang, Xu Jiayi, Tayier Yilihamu, Hu Zhenyu, Gao Yuan, Zheng Chengfeng, Ye Yueshu, Li Yihang, Wan Lei, Jiang Xinyue, Wang Yujie, Cheng Siyu, Song Zhule, Tang Xiangru, Xu Xiaohua, Zhang Ningyu, Chen Huajun, Jiang Yuchen Eleanor, and Zhou Wangchunshu. 2024. Weaver: Foundation models for creative writing. arXiv:2401.17268. Retrieved from https://arxiv.org/abs/2401.17268Google ScholarGoogle Scholar
  135. [135] Wang Yaqing, Yao Quanming, Kwok James T., and Ni Lionel M.. 2020. Generalizing from a few examples: A survey on few-shot learning. ACM Computing Surveys 53, 3 (2020), 34 pages. DOI:DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  136. [136] Williams Lowri, Bannister Christian, Arribas-Ayllon Michael, Preece Alun, and Spasić Irena. 2015. The role of idioms in sentiment analysis. Expert Systems with Applications 42, 21 (2015), 73757385. Retrieved from https://www.sciencedirect.com/science/article/pii/S0957417415003759Google ScholarGoogle ScholarDigital LibraryDigital Library
  137. [137] Williams Ronald J.. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8 (1992), 229256. Retrieved from https://link.springer.com/article/10.1007/BF00992696Google ScholarGoogle ScholarDigital LibraryDigital Library
  138. [138] Wu Chuhan, Wu Fangzhao, Wu Sixing, Liu Junxin, Yuan Zhigang, and Huang Yongfeng. 2018. THU_NGN at SemEval-2018 task 3: Tweet irony detection with densely connected LSTM and multi-task learning. In Proceedings of the 12th International Workshop on Semantic Evaluation. Association for Computational Linguistics, New Orleans, Louisiana, 5156. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  139. [139] Yu Zhiwei, Tan Jiwei, and Wan Xiaojun. 2018. A neural approach to pun generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).Gurevych Iryna and Miyao Yusuke (Eds.), Association for Computational Linguistics, Melbourne, Australia, 16501660. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  140. [140] Yu Zhiwei and Wan Xiaojun. 2019. How to avoid sentences spelling boring? Towards a neural approach to unsupervised metaphor generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).Burstein Jill, Doran Christy, and Solorio Thamar (Eds.), Association for Computational Linguistics, Minneapolis, Minnesota, 861871. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  141. [141] Yu Zhiwei, Zang Hongyu, and Wan Xiaojun. 2020. Homophonic pun generation with lexically constrained rewriting. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Webber Bonnie, Cohn Trevor, He Yulan, and Liu Yang (Eds.), Association for Computational Linguistics, Online, 28702876. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  142. [142] Yuan Ann, Coenen Andy, Reif Emily, and Ippolito Daphne. 2022. Wordcraft: Story writing with large language models. In Proceedings of the 27th International Conference on Intelligent User Interfaces (Helsinki, Finland). Association for Computing Machinery, New York, NY, USA, 841852. DOI:DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  143. [143] Zhang Jiayi, Cui Zhi, Xia Xiaoqiang, Guo Yalong, Li Yanran, Wei Chen, and Cui Jianwei. 2021. Writing polishment with simile: Task, dataset and a neural approach. In Proceedings of the AAAI Conference on Artificial Intelligence. 1438314392. Retrieved from https://ojs.aaai.org/index.php/AAAI/article/view/17691/17498Google ScholarGoogle ScholarCross RefCross Ref
  144. [144] Zhang Yunxiang and Wan Xiaojun. 2021. BiRdQA: A bilingual dataset for question answering on tricky riddles. In Proceedings of the 36th AAAI Conference on Artificial Intelligence. Retrieved from https://api.semanticscholar.org/CorpusID:237605111Google ScholarGoogle Scholar
  145. [145] Zhang Yunxiang and Wan Xiaojun. 2022. MOVER: Mask, over-generate and rank for hyperbole generation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle, United States, 60186030. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  146. [146] Zhou Jianing, Gong Hongyu, and Bhat Suma. 2021. From solving a problem boldly to cutting the gordian knot: Idiomatic text generation. arXiv:2104.06541. Retrieved from https://arxiv.org/abs/2104.06541Google ScholarGoogle Scholar
  147. [147] Zhou Jianing, Zeng Ziheng, Gong Hongyu, and Bhat Suma. 2022. Idiomatic expression paraphrasing without strong supervision. In Proceedings of the AAAI Conference on Artificial Intelligence. 1177411782. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  148. [148] Zhu Mengdi, Yu Zhiwei, and Wan Xiaojun. 2019. A neural approach to irony generation. arXiv:1909.06200. Retrieved from https://arxiv.org/abs/1909.06200Google ScholarGoogle Scholar

Index Terms

  1. A Survey on Automatic Generation of Figurative Language: From Rule-based Systems to Large Language Models

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Computing Surveys
      ACM Computing Surveys  Volume 56, Issue 10
      October 2024
      325 pages
      ISSN:0360-0300
      EISSN:1557-7341
      DOI:10.1145/3613652
      Issue’s Table of Contents

      Copyright © 2024 Copyright held by the owner/author(s).

      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 14 May 2024
      • Online AM: 30 March 2024
      • Accepted: 10 March 2024
      • Revised: 27 February 2024
      • Received: 14 April 2023
      Published in csur Volume 56, Issue 10

      Check for updates

      Qualifiers

      • survey
    • Article Metrics

      • Downloads (Last 12 months)1,201
      • Downloads (Last 6 weeks)993

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader