Information Retrieval

Information Retrieval

Types of search: Precise information seeking search (eg when was boole born), Known item search (eg find booles book), open ended search (many potential answers exist).

Information Scarcity problem: Needle in haystack problem. As opposed to information abundance problem.

Template recognition: Find predefined relations in unrestricted text. Name entity recognition: Find entities of a certain semantic type, eg names, dates. Co reference resolution: Decide which strings refer to the same entities.

Summarisation by sentence extraction: Use importance indicators. N highest ranking sentences are extracted.

Retrieval Models:

Free indexing language: Use natural language as indexing language. Word meanings isn't 1:1 due to synonmy (many words to one meaning, sofa-couch) and polysemy (1 word to many meanings, bank-bank. Terms may need to be manipulated (decapitalised, stemmed, pos tagged).

Inverted Files: Term | Doc No | Frequency | Offset

Boolean search: Binary decision, document is relevant or not. Presence of term is sufficient for match. Boolean set operations (AND,OR,NOT,BUT). Unranked results, requires expert to form search terms.

Ranked algorithms: Takes frequency of terms into accounts. Eg vector space model (SMART) and probabilistic model (OKAPI).

Vector space model: 3 Dimensions: information, retrieval, system. Select documents with highest document-query similarity. Need to choose weights and proximity measure.

A proximity measure: can be defined either by similarity of dissimilarity. It should be symmetric ( d(j,i)=d(i,j) ), maximal for similarity (or minimal if measuring dissimilarity). A distance metric is a dissimilarity metric that satisfies the triangle inequality d(i,j) + d(i,k) > d(j,k) and is always non negative.

Binary similarity measures: X is the set of all terms occurring in document x, Y is the set of all terms in y.

Raw overlap: raw_overlap(X,Y)=||.

Dice's coefficient: Normalisation by average size of the two original vectors


Jaccard's coefficient: Normalisation by size of combined vector, penalised small number of shared feature values.


Overlap coefficient:


Cosine: Normalisation by vector lengths.


Cosine (or normalised inner product) is the measure of choice for IR.

Distance measures: Euclidean distance: How far apart in vector space.



Manhattan distance: How far apart, measured in city blocks.

Zipf's law: The frequency rank of a word is reciprocally proportional to its frequency.

Eg; "the" is ranked low.


Common words such as "is" can be ignored by IR. Overly specific words and typos can be ignored.

Term Widthing TF*IDF: Words that are frequent in a document but not in the corpus are most useful.


Eg;

Term manipulation: Because of semantic similarities. Same string between punctuation, same prefix, same stem, same linguistic lemma.

Porter Stemmer: Words can be represented by whether each letter is a consonant or vowel. Eg; CVCV. Slightly more complex, [C](VC)^m[V] - square brackets denote arbitary presence of their contents, (VC)^m denote VC repeated m times. m is called the measure of a word, that is how many times the pattern VC occurs.

Rules for removing a suffix are given as: (condition) S1 -> S2. If a word ends with the suffix S1, and the stem before S1 satisfies the given condition, S1 is replaced by S2. The condition is normally given in terms of m eg (m>1) EMENT -> null - Replacement-> Replac.

The ‘condition’ part may also contain the following:

*S - the stem ends with S (and similarly for the other letters).

*v* - the stem contains a vowel.

*d - the stem ends with a double consonant (e.g. -TT, -SS).

*o - the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP)

May also contain and, or, not.

 

Step 1 deals with plurals and past participles, then four steps of derivational morphology then cleaning up.

Step 1a eg; SSES -> SS | caresses -> caress

Step 1b eg; (*v*) ED -> null | plastered -> plaster

if succesfull AT -> ATE | conflat(ed) -> conflate

Step 1c (*v*) Y -> I happy->happi

Step 2 eg; (m>0) ATIONAL -> ATE | relational -> relate

Step 3 eg; (m>0) ICATE -> IC | triplicate -> triplic

Step 4 eg; (m>1) AL -> | revival -> reviv

Suffixes are now removed, so now clean up

Step 5a eg; (m>1) E -> | probate -> probat

Step 5b eg; (m > 1 and *d and *L) -> single letter | controll -> control

IR Evaluation: Predict future from past experience. Hard due to large variation in queries.Also hard to see contribution of collection coverage, document indexing, query formulation, matching algorithm.

Trec conferences queries: Consist of <num>,<title>,<desc>,<narr>.

 

Relevant

Irrelevant

Total

Retrieved

A

B

A+B

Not retrieved

C

D

C+D

Total

A+C

B+D

A+B+C+D

 

Recall: Proportion of retrieved items amongst the relevant items. 

 

Precision: Proportion of relevant items amongst retrieved items. 

 

Accuracy: Proportion of correctly classified items as relevant/irrelevant. Not a good measurement as it likes performance on irrelevant items. 

Precision against recall: Plotting gives recall precision curve which can be used as an evaluation measure. Can also plot precision/recall against no. of documents retrieved.Precision critical task: Web search. Recall critical task: Patent search.

Problems with judgements: Subjective, can be relative in different ways.Countermeasures are to use guidelines and a large sample of queries.


F-Measure: Weighted harmonic mean of P and R. High a means precision is more important, low a means recall is more important. a is normally set to 0.5. An approximation of cross over point of precision and recall.

 

Pooling: Would take a long time to judge with humans all results from a query. The pool is constructed by putting together top N retrieval results from a set of n systems. Humans judge every document in this pool, anything outside is considered to be irrelevant. May wish to measure precision at a certain rank, recall value, at last relevant document.

Mean average Precision: Determines the precision at each point when a new relevant document is retrieved. Average for each query, then over queries.


Where P(doci)=Precision at ith relevant document, N=number of queries, Qj=number of relevant documents for each query .

11 point average precision: Define 11 standard recall points 0,0.1,...,1. Can calculate the precision at each recall point, but need to interpolate between them.


Example:

The internet: Large, redundant, unstructured, herterogenous, dynamic, hyperlinked.

Page rank: Every link is a vote. Simulates a random surfer: Given a random page, follows links for a while with probability q. Gets bored after a while and jumps to the next random page, with probability 1-q. Page rank is the number of visits to each page.


Also have e vector to counteract sinks from pages without outbound links, probability of random surfer jumping to random page.


.

Hypertext Induced Topic Search: HITS aims to find authorities on a certain topic. Hums are recommendation pages with links to high quality pages. Authorities are pages that are recognized by hubs as experts. If a page is linked to by many hubs, probably a good authority. If a page links to many authorities, probably a good hub. Normalize weights after each iteration.

Start with a root set (pages containing query terms), create the base set (root set plus all pages pointing to the root set).


.

Message Understanding Conference: Participants given scenario and a training corpus to adapt to. Analysts manually fill in answer key. Compare against test corpus.

NYU System: Matches to templates, eg X of Company, X is position. X retires from Y as Z- X is person in leave job template, Y is company, Z is position.

Named Entity recognition: ENAMEX (type=person, organization, location) - hard for example is Granada a place or a company? TIMEX (type=time,date). BUMEX (type=money,percent). Can use gazetteers, list of countries, names etc.

Mikheev Cascading Name Entity Recognition:

Uses external (position in sequence etc.) and internal (suffix etc.) information.

1. Apply Grammar Rule Set 1 (Sure fire rules eg; X is a RELATIVE means X is a person)

-> tag as definite NEs of given type

2. Use ML for variants (probabilistic partial match)

  • Generate all possible substrings of sure-fire tagged NEs:

- Adam Kluver Ltd -> Adam Kluver, Adam Ltd, Kluver Ltd

  • ME model gives probability for possible string and NE type

  • Tag all occurrences of NE in text (over prob. threshold) with type

3. Apply Grammar Rule Set 2 (Relaxed rules)

  • Mark anything that looks like a PERSON (using name grammar)

  • Resolve coordination, genitives, sentence initial capitalized modifiers

-Coordinated or possessive name parts, or rest of sentence initial coordinated name seen on their own? If not, assume one name (Murdoch's News Corp,

Daily, Bridge and Mason)

4. Apply ML again (for new variants)

  • X and Y are of same type->resolved typo ``Un7ited States and Russia''

5. Apply specialised ME model to title (capitalisation, different syntax).

Lexico semantic patterns: Flattened out semantic representations with lexemes directly hard wired into them. Eg; <Perpetrator> (APPOSITION) blows/blew/has blowing himself/herself up. Can use patterns (templates with syntactic conditions, eg passive means victim, active means perpetrator).

Snowball: Find locations of companies and associated names.

1 Start from table containing some <o, l> tuples (handful of company,location pairs)

2 Perform NE

3 System searches for occurrences of the example < o, l > tuples in documents

4 System learns extraction patterns from these example contexts, e.g.:

<ORGANIZATION> 's headquarters in <LOCATION>

<LOCATION>-based <ORGANIZATION>

5 Evaluate patterns; use best ones to find new < o; l > tuples

6 Evaluate new tuples, choose most reliable ones as new seed tuples

7 Iteratively repeat the process

A snowball pattern is a 5 part tupl<left,tag1,middle,tag2,right>Want productive and reliable patterns.

Pattern reliability:

Productivity: Conf(P)log2(P.positive)

Questions in Trec: Type of question: Reason, definition, list of instances, sensitive to previous questions. Source of question: Designed for evaluation or real? Type of answer string: Length. Guaranteed existence of answer?

Labels: Ambiguous and unsupported answers are incorrect.

Mean reciprocal rank: 1 over the where the first correct answer in top 5 returns were. Can be 0,0.2,0.23,0.33,0.5,1. The score of a run is the mean over n questions.

Average accuracy: Only one answer per question allowed, accuracy is 

Matching words to question words: Eg; who->PERSON, when->TIME, how often->FREQUENCY.

Trec winner: Variants/feedback loops: morphological, lexical, syntactic, by reasoning.Comparison between answer candidate and question on basis of logical form. Uses morphology (main verb is invent, who is subject),lexical (far is an attribute of distance), semantic alterations (volcano is a mountain).

The Microsoft system: Reorders words, removes question word, morpholohical variations.Matching done by web query. Answers generated with weighted answer strings. Expected answer type improvement.

Rewrite module of microsoft system: outputs search string, position in text where answer is expected, confidence score. Scores each n gram according to the weight of the query that retrieved it. Sums across all summaries containing the ngram.


Merge similar answers. Time sensitive questions.

Text summarisation: Useful, also shows if IR systems understand documents. Informative summaries: Replaces full document. Indictive: Should i read this document? Abstract: Generated text. Extract: Text of snippets.

Full text->(Text analysis)->Semantic representation of full text->(Compression)->Semantic representation of summary->(Generation)->Summary.

Good extracts will have coherence and cohesion.

Summarisation by fact extraction (Radev): Compress several descriptions about the same event. New templates generated by mixing templates. Rules: Change of perspective - if the same source reports conflicting information over time, report both pieces of information. Contradiction- If two or more sources report con#icting information, choose the one that is reported by independent sources. Addition- If additional information is reported in a subsequent article, include the additional information.

Refinement- Prefer more specific information over more general one (name of a terrorist group rather than the fact that it is Palestinian). Agreement- Agreement between two sources is reported as it will heighten the reader's confidence in the reported fact.Superset/Generalization- If the same event is reported from different sources and all of them have incomplete information, report the combination of these pieces of information.Trend- If two or more messages reflect similar patterns over time, these can be reported in one statement (e.g. three consecutive bombings at the same location). No Information- Report the lack of information from a certain source when this would be expected.

Domain specificty is built into the templates: To solve this problem-

  • Split text in units (paragraphs or sentences or text tiles)

  • Assign each unit a score of importance/extract worthiness, using sentential and/or relational features

- Sentential features of a unit can be calculated in isolation, e.g. number of TF/IDF words or location

-Relational features of a unit are calculated in context of other units, e.g. unit with highest amount of shared terms

  • Extract sentences with highest score verbatim as extract

External marking of more important material: Location feature. Paragraph and headline structures. Important terms (tf/idf) mark more important prepositions. Sentence length.

Sentential features: Concept feature: Find concepts using tf*idf, sentence score=no of frequency concepts in sentence. Header feature: Find concepts in title, sentence score=no of title concepts in sentence. Location feature: Divide text into n equal sections.Sentences in section 1 < i < n get sentence score = 1/i. Always used in combination.Paragraph feature: First sentence in paragraph gets a higher score. Cue phrases. First sentence in section feature. Sentence length. Occurrence of bonus or malus word.Occurrence of a named entity. Combination of sentential features: Manual feature combination: Score(S) = aA +bB + wO. A,B,O: Feature scores. a,b,w: Manual weights.

Create examples of sentences that are abstract worthy, calculate their features, using 5 well known features.


Kupiec: Find best match for abstract sentence by automatic similarity measure. One example for a similarity measure is based on the longest common substring:


(where editi,d is the minimum number of deletions and insertions needed to transform X into Y).

Strategies for summary evaluation: Subjective judgements- how much do subjects like this summary? Comparison to gold standard- predefined right answer. Task based evaluation- how well can humans perform the task. Does the recipient of the summary have to change it?