“Had you told me when I came to grad school in computer science that I would be buying drugs, carrying burner phones and answering phone calls as names like “Sanjoy Sanchez,” I probably would not have believed you.”—Chris Kanich, spam researcher.
Humor identification is a hard natural language understanding problem. We identify a subproblem — the “that’s what she said” problem — with two distinguishing characteristics: (1) use of nouns that are euphemisms for sexually explicit nouns and (2) structure common in the erotic domain. We address this problem in a classification approach that includes features that model those two characteristics. Experiments on web data demonstrate that our approach improves precision by 12% over baseline techniques that use only word-based features.
[…] To our knowledge, related research has not studied the task of identifying double entendres in text or speech. The task is complex and would require both deep semantic and cultural understanding to recognize the vast array of double entendres. We focus on a subtask of double entendre identification: TWSS recognition. We say a sentence is a TWSS if it is funny to follow that sentence with “that’s what she said”. We frame the problem of TWSS recognition as a type of metaphor identification.
We define three functions to measure how closely related a noun, an adjective, and a verb phrase are to the erotica domain.
The noun sexiness function NS(n) is a real-valued measure of the maximum similarity a noun n ∈/ SN has to each of the nouns ∈ SN−. For each noun, let the adjective count vector be the vector of the absolute frequencies of each adjective that modifies the noun in the union of the erotica and the Brown corpora. We define NS(n) to be the maximum cosine similarity, over each noun ∈ SN−, using term frequency-inverse document frequency (tf-idf) weights of the nouns’ adjective count vectors. […] Example nouns with high NS are “rod” and “meat”.
The adjective sexiness function AS(a) is a real-valued measure of how likely an adjective a is to modify a noun ∈ SN. We define AS(a) to be the relative frequency of a in sentences in the erotica corpus that contain at least one noun ∈ SN. Example adjectives with high AS are “hot” and “wet”.