simple solutions for complex problems
Computer Vision
Bayesian inference
used to integrate prior knowledge about the world with new incoming data
interprets probability as “degree of belief” rather than “frequency of occurrence”
informally
if
H represents a hypothesis about the state of the world (e.g. an object in the image) and
D represents available image data
then
the explanatory conditional probabilities p(H|D) and p(D|H) are related by
p(H|D) = ( p(D|H) . p(H)) / p(D)
this rule can be applied iteratively/recursively for repeatedly updating assessment of a visual hypothesis as more data arrives
use the latest posterior as the new prior
example priors include
matter cannot just disappear but regularly becomes occluded
a uniform texture on a complex shape is more likely than a complex texture on a simple shape
rotation in 3D is a better explanation for deforming boundaries then boundary changes
could have no useful priors
may need to solve pattern recognition some vector of acquired features from a given object, to decide if a vector is consistent with a particular class
have to make a decision
same, could be a
correct accept (hit)
false accept (false alarm)
different, could be a
false reject (miss)
correct reject
there will be a cross over in the probability distribution graphs
receiver operating characteristic curves show the trade-offs between false rejects and false accepts as the confidence level is modified
detectability of a signal is measured as

a larger d' indicates higher decidability of the problem
what if C2 were much more likely than C1? we wouldn't want to assume something is C1 given a smallish x: need Bayesian probability:
prior probabilities: define P(C1) and P(C2) as their relative proportions
e.g. a verses b – a is 4 times more likely in English than b
can do
p( Ck | x ) = ( p( x | Ck ) . p(Ck)) / p(x)
where
p( Ck | x ) is the posterior probability
classification
if you want to minimise the overall error you assign x to class Ck if
P(Ck | X ) > P(Cj | x) for all j
could choose another classification criterion in order to trade off between false positives and false negatives
3D
in 3D situations our goal is to maximise p(H | D), i.e. to find the most likely hypothesis (surface reconstructions) that explains the image data (reflectance)
Scale space
this is used for detecting edges at multiple scales
it is a plot showing the zero crossings of an image after being convolved with a linear operator
one of the axes (dimensions) is the width of the Gaussian linear operator
image pyramids
similar to a scale space, but
does sub-sampling as well as smoothing
only has discrete scale levels
these are useful for efficiently detecting edges, as can determine them at a coarser level and then have a rough idea where to look for them at the next more detailed level
if edge detection is performed on an image that has been smoothed, it will detect only the edges that are visible at that level of detail
given that we can concatenate filtering (smoothing) and filtering (edge detection) functions, we can create functions that detect edges only at a certain scale of analysis
zero crossings (edges) can only ever merge and hence become fewer when becoming coarser
no new zero crossings can be introduced by using a coarser Gaussian
this property is called “image causality” or fingerprint theorem”
the wavelet transform can be regarded as a kind of scale transform
wavelets take projections of image structure into zero-mean basis functions all of which are dilates, translates, and rotates of each other
the wavelet approximation to an image at a certain scale of analysis uses only wavelets up to a certain size, or dilation
the error signal which remains (the difference between the original and this approximation to it) is the input signal for the next wavelet level
because of orthogonality, this projection is the same as projecting the original signal onto the new level
since wavelets of all different sizes are self-similar, such a transform is a way of analysing the signal at a variety of scales
since the wavelets are dilates, translates, and rotates of each other, such a transform seeks to extract image structure in a way that may be invariant to dilation, translation, and rotation of the original image or pattern

Laplacian operator
e.g. right the Laplacian operator in a 3*3 array
detects edges in all orientations
is a filter
is insensitive to the brightness of the scene due to zero-sum property
Non linear operators
can have advantageous properties such as
reduced noise sensitivity
greater applicability for extracting features that are more complicated than edges
these may be built up from linear operations such as filtering to extract a particular scale of analysis or an orientation of image structure
the actual responsiveness to a feature such as colour or motion or texture requires the artful use of non-linearity
disadvantages
don't have translation invariance
output is phase modulated
motion information:
extracted by “energy detectors"
built by taking the squared-modulus of linear spatio-temporal filters forming a quadrature pair
Colour information
can only be extracted by “discounting the illuminant,"
this requires non-linear operations such as taking ratios over neighbouring regions of differently coloured surfaces
permits the spectral reflectance properties of surfaces themselves (e.g. their pigments) to be inferred, independently of the wavelengths of light illuminating them
Stereoscopic information
this requires dichoptic integration by solving the Correspondence Problem (matching up the corresponding points of two images acquired with some displacement disparity)
this requires Cooperativity Processes, which are profoundly non-linear kinds of computation
Quadrature pair: a pair of complex-valued filters one of which is real-valued, and the other imaginary-valued
Hilbert Pair
a pair of functions
the Fourier Transform of one is equivalent to the Fourier Transform of the other except that
the phase of all positive frequency components has been increased by π/2
the phase of all negative-frequency components has been decreased by π/2
Hilbert Transform
this can be computed either by
the phase-shifting operation in the Fourier domain described above or by
convolving the original image by the hyperbola,1/x.
modulus of the Hilbert Transform
useful approach for detecting facial features, extracting them, or detecting motion
this amounts to taking the sum of the squares of its resulting real and imaginary parts
Gabor Logons
these are quadrature pair wavelets
the family of filters which achieve the lowest possible conjoint uncertainty/minimal dispersion/variance in both space and Fourier domain
complex exponential multiplied by the Gaussian
![]()
NB that these wavelets have Fourier transforms with the same functional form but with parameters interchanged
these functions are mutually non-orthogonal
wavelet basis
a family of Gabor functions
parametrised so that they are self similar (are dilates and translates of each other)
these classes of wavelets can be used as the expansion basis for signals
because of self similarity it amounts to analysing a signal at different scales (multi-resolution analysis)
optimal in extracting from an image
what (orientation and modulation of image structure)
where (2D position)
bandpass filter
when x0 = 0 then only permits in frequencies within around
of μ0
2D wavelets are defined as
![]()
unification of the image and the Fourier domains
Gabor wavelets generalise these two as being opposite ends of the spectrum
varying the α parameter
the Gaussian term becomes 1 and the expansion reduces to the Fourier basis
the Gaussian term becomes a discrete Delta function and the expansion reduces to the pixel by pixel image basis
Convolution
![]()
combines these two functions to generate a third function h(x; y), whose value at location (x, y) is equal to the integral of the product of functions f and g after one is flipped and undergoes a relative shift by amount (x, y)
is used to combine two functions using each one to blur the other
is important for computer vision because
it is the basis for filtering (convolving the image with a filter kernel)
convolution theorem
convolving two functions f and g multiplies their two 2DFT's in the Fourier domain, i.e.
![]()
Correlation
![]()
indicates the strength and direction of a linear relationship between two functions (images)
i.e. it is a way of combining two functions, such as an image array and a pattern that one is searching for in the image, in order to generate a third function which would have a large peak if there were a strong correlation between the two
this is used in computer vision for
motion detection
texture classification
image segmentation
pattern matching/recognition
Filters
used to extract edge information
variable properties include
isotropic (circularly symmetric) or anisotropic (directional)
self similar (dilates of each other) or not self similar
separable (expressible as products of 1D functions) or not
size of support (number of “taps”/pixels in the kernel)
preferred non-linear outputs (zero crossings, phasor moduli, energy)
bandpass filtering
this is filtering an image i(x,y) so that certain frequencies are emphasised and certain others are reduced
the function g(x,y) determines the pass band
normally this is done in the Fourier domain rather than the image domain, i.e.
![]()
Edge detection by derivative zero crossings
this is a technique to find edges in an image by examining the second derivative zero crossings
this can be done at different scales (of blurring) defined by the constant σ
this is the space constant of a Gaussian
[▼2 Gσ(x,y)] convolved with I(x,y)
when this is zero then there is an edge
Correspondence problem
this is the problem of identifying corresponding regions (e.g. the same book) in two images
stereoscopic vision
from spatially displaced cameras
stereoscopic disparity
objects will project onto each other into different places in the image
this difference in projection between objects in the image is proportional to the distance between those objects
it is also proportional to the distance between the cameras
these differences in projection/errors indicate the distance between the objects, if the distance between the cameras is known
motion vision
from temporally displaced cameras (i.e. same place, later time)
disparity
in a similar way the same object will project into a different place in the image
however the error/difference in projection will be proportional to the distance it has travelled between the two frames
this gives us the velocity if we know the time difference between the images
complexity
since we are trying to identify the disparities of position of certain objects in the image, first we must find the objects and then identify them
hence with more objects it is more complex to identify and tag them all
the complexity varies quadratically with the number of objects/features n
in theory any object in the first image could be mapped with any other object in the second
this means that the number of possible combinations is n!
an approach to simplify the problem – multi scale image pyramids
this helps reduce complexity of finding objects in the first place
these are useful for efficiently detecting edges, as can determine them at a coarser level and then have a rough idea where to look for them at the next more detailed level
adequate alignments are found for blurred copies of the image pair
the process is repeated on less and less blurred version, but each time the search space is restricted more by only looking in the areas where there were values found at the coarser levels
another approach to simplify the problem – stochastic relaxation
this helps reduce the complexity of deciding how to relate (already found) objects in the two images
large deviation hypothesis are no longer considered once sufficient evidence has been amassed to indicate a more conservative solution
The main purpose of powdering one's face is to specify s and n in this expression:
![]()