Natural Language Processing

Natural Language Processing:

Morphemes: Minimal informational unit.

Affixes: Prefix and stems and suffixes.

Inflectional morphology: Tense, gender etc.

Derivational morphology: Eg; un- , re-, anti-

E-Insertion rule: Eg


This particular rule is read as saying that the empty string maps to ‘e’ in the context where it is preceded by an s,x, or z and an affix boundary and followed by an s. For instance, this maps boxˆs to boxes.

Finite state automata: Can be used for recognition.

Finite state transducers: To map words to morphemes eg;

Morphological processing: Eg; Porter Stemmer.

Top level morphology: Finite state transducers, to produce words.

Balanced Corpus: Different genres eg; newspapers, fiction etc.

Prediction: Probability of next word based on the last word.

POS tagging: Applies verb/adj etc. to words. Can use bigrams.

Bigrams: Make table with: Sequence| Count | Probability. Count is the number of sentences the sequence appears in. Probability is the number of sentences/count.

Evaluation of POS tagging: Training data 90% test data 10%. The probability of tags given a sequence P(T|W) =P(T)P(W|T).

Baselines: Very basic approach.

Ceiling: Human performance.

Error analysis: Certain problems are common.

Reproducable: The corpus should be available.

Generative grammar: A formal grammar that only produces acceptable sentences of a natural language.

((the (big dog)) slept): Weakly equivalent grammars would produce the same strings.Strongly equivalent grammars would produce the same strings and the same bracketing.

Context Free Grammars: CFG's consist of Non terminals eg VP. Terminals ie words.Rules/productions. Start symbol, S. Eg; S-> NP VP NP-> fish

Chart parsing: Need to remember applied rules to prevent redoing.

Bottom up passive chart parser:

Id

Left

Right

Mother

Daughter

1

0

1

NP

(they)

This algorithm can be packed: List numerous under daughters rather then new row each time.

Can't use FSA for natural language: Due to center embedding A-> aAb. Useful for partial grammars which dont require full recursion.

Problems with CFG's: No account of subject verb agreement. So it fish is allowed as well as they fish. Also no subcategorization, eg Kim adored Sam. Could put V-intrans or V-distrans instead to show subcat but is cumbersome. Also has problem of long distance dependencies.

FS Grammar Fragments: Encode agreement.

Verb-object rule:

Subject verb rule:

MOTHER

CAT

VP

MOTHER

CAT

S

 

AGR

1

 

AGR

1

DTR1

CAT

V

DTR1

CAT

NP

 

AGR

1

 

AGR

1

DTR2

CAT

NP

DTR2

CAT

VP

 

AGR

[]

 

AGR

1

Eg;


Parsing “they like it”

 

Templates: Store words with templates to save space.

Sematics: Can extend to described semantics, eg; they

HEAD

 

CAT

Noun

 

 

AGR

Pl

OBJ

Filled

 

 

SUBJ

Filled

 

 

SEM

Index

1

 

 

Pred

Pron

 

 

ARG1

1

 

Meaning postulates: Vx[bachelor(x) -> man(x) ^ unmarried(x)]

Taxonomies: Dog is the hyponym of anmial. Animal is the hypernym of dog.

Wordnet: Synonyms ordered by hyponomy. Eg; cat->leapoard->leapoardess.

Word sense disambiguation: Supervised learning, machine readable ditionaries.

Yarowsky's unsupervised learning approach to WSD: Identify all examples of the word to be disambiguated in the training corpus and store their contexts. Identify some seeds to disamgiguate a few uses. Train a decision list classifier which gives a list of criteria which are tried in order until an applicable test is found. Apply the decision list classifier to the training set and add all examples which are tagged with greater than a threshold reliability. Iterate until convergence.

Coherence: Discourses have to have connectivity to be coherent. Eg; Kim got into her car.Sandy likes apples.

Factors influencing discourse interpretation: Cue phrases, punctuation, real world content, tense and aspect.

Niall Ferguson is prolific, well-paid and a snappy dresser. Stephen Moss hated him — at least until he spent an hour being charmed in the historian’s Oxford study. (quote taken from the Guardian)

Referent: a real world entity that some piece of text (or speech) refers to. e.g., the two people who are mentioned in this quote.

Referring expressions: bits of language used to perform reference by a speaker. In, the paragraph above, Niall Fergusonhim and the historian are all being used to refer to the same person (they corefer).

Antecedent: the text evoking a referent. Niall Ferguson is the antecedent of him and the historian

Anaphora: the phenomenon of referring to an antecedent: him and the historian areanaphoric because they refer to a previously introduced entity.

Cataphora: When pronouns appear before their referents are introduced, eg the first she in "Although she couldn't see any dogs, Kim was sure she'd heard barking".

Lappin and Leass: Resolves anaphora.

1. Divide by two the global salience factors for each existing equivalence class.

2. Identify referring NPs (i.e., exclude pleonastic it etc)

3. Calculate global salience factors for each NP (see below)

4. Update the discourse model with the referents and their global salience scores.

5. For each pronoun:

(a) Collect potential referents (cut off is four sentences back).

(b) Filter referents according to binding theory and agreement constraints (e.g., remove referents that are plural if the pronoun is it). To do this step properly for gender information requires lexical resources.

(c) Calculate the per pronoun adjustments for each remaining referent (see below).

(d) Select the referent with the highest salience value for its equivalence class plus its per-pronoun adjustment. In case of a tie, prefer the closest referent in the string.

(e) Add the pronoun in to the equivalence class for that referent, and increment the salience factor by the non-duplicate salience factors pertaining to the pronoun.

Eg; On “Niall Ferguson is prolific, well-paid and a snappy dresser. Stephen Moss hated him —at least until he spent an hour being charmed in the historian’s Oxford study.” trying to resolve the he after the -.

N has score 155 + 280 ((subject + non-embedded + non-adverbial + recency)/2 + (direct object + non-embedded + non-adverbial + recency))

S has score 310 (subject + non-embedded + non-adverbial + recency) + same role per-pronoun 35 = 345

H has score 100 (recency) - 175 (cataphora) = -75

O has score 100 (recency) - 175 (cataphora) = -75