A function that maps data to labels. Informally,
it's a mathematical model (or computer program, if you wish) that
attempts to identify objects based on measurements
or other data. It may return the most likely identification, or the
probability distribution for all possible identifications, or a
simple "no idea".
You can think of Peterson's Field Guide to
Birds as a kind of classifier, in that it maps characteristics like size,
color, body shape, and locale to species. These data aren't precise,
and they're usually not part of the formal definition of a species.
But even though "seldom seen in deserts" isn't part of the definition
of geese, it's certainly useful in identifying what's likely to be a
goose and what isn't. The characteristics used by a classifer are
known as predictor variables.
A classifier is built using the statistics of these data. Several
factors may be considered:
- How much classifying information is there in each predictor
variable? A variable such as "Does it have feathers?" doesn't help
much.
- Are there overlapping or correlated variables? In other words, if we know the
value of one variable do we know the value of other variables? If we
know that a bird has large, nasty claws, we probably don't have to ask
whether it also has a large, nasty beak.
- Are certain variables significant only in combination with each
other? Height and weight aren't significant in
classifying people as skinny or fat by themselves but they are in
combination. (Sorry, I couldn't think of any bird examples.)
Once we know which variables are significant, we have to determine how
to combine them. Again, there are many choices:
- Convert the variables to numbers, multiply each by a
weight, take the sum, then give labels to the ranges of results. This
is known as regression, and if the sum ranges from 0 to 1 and
represents a probability, we have logistic regression.
- Make a decision tree. At the root, we use a variable to
split the space of identities based on certain values of that
variable, and at each split we apply other variables, and so on, until
the leaves of the tree are unique identities. This is fields guides
often do. (Formally, this is a classification tree.)
- Train a neural network to accept the data and return
classifications. Of course, we would have no clue as to which
variables it was actually using or how it was combining them, but we'd
have a decent classifier.
- Compare new data with objects previously classified by an
expert and choose the identity of the object that appears to be most
similar. This is known as a nearest neighbor algorithm. ("This one
sure looks like it...hey, the tag here says it's a goose".)
- Calculate the conditional probability of the values of each
variable given the identity, calculate the prior probability of
each identity, then use Bayes' theorem to calculate the probability
of the identity given the data. This is called a Bayesian
classifier. If we make the strong assumption that the variables are
independent (that is, the value of one variable doesn't tell you
anything about the values of another variable), then we have a naive
Bayesian classifier.
- Create some wild-ass function that maps data to
identities. This can get you published if you can pass the
peer review.
Although the specific process of creating (or as it's sometimes called,
inducing) each type of classifier is too long to be
described here, there are is a strategy for creating classifiers in
general:
- Create a set of pre-identified data and divide it randomly into a
training set and a test set.
- Create a classifier using the training set.
- Apply the classifier to the test set.
- Compare the results of the classifier with the actual identities
from the test set. In general, there are two kinds of mistakes.
Assuming that we're really interested in identifying geese:
- False positive – the classifier claims that it's found a
goose when in fact it's not really a goose
- False negative – the classifier claims it's not a goose
when in fact it is a goose
The importance of false positives and false negatives are apparent
when you think of important classifications, such as
disease
diagnosis. A false positive classification of
appendicitis may mean
someone gets an appendix extracted needlessly. A false negative may
mean someone dies.