|
Peer
Review of the Stoelting CPS:
Dear
Colleague:
At a meeting
organized by the Central Intelligence Agency, a group of distinguished
scientists evaluated the Polyscore algorithm developed by Johns
Hopkins University/Applied Physics Laboratory...the Polyscore
algorithm is used on Axciton and Lafayette instruments...NOT Stoelting.
The meeting was held at the APL facility and was attended by their
representatives.
|
Excerpts
of the scientists' comments are below:
"The
effects of amplifiers, filters, and sampling procedures
have clearly adulterated the data."
"This
is not a scientific project."
"...there
is no reason to believe that the model developed at APL
has any power whatsoever to discriminate between truthful
and deceptive subjects."
"In
summary, the contractors [APL] have developed, delivered,
and sold an algorithm to separate DI from NDI subjects that
has no demonstrated validity."
|
A complete
copy of the peer review of the scoring algorithm used by Axciton
and Lafayette follows is attached. For your convenience,
we have italicized some of their more significant comments.
Thank you for your interest in Stoelting's Computerized Polygraph
System. Only Stoelting determines a suspects truthfulness
using software based on data from verified U.S. Secret Service
criminal investigations.
Report
of Peer Review of Johns Hopkins University/Applied Physics
Laboratory
Stephen W. Porges, PhD,
Chair of Review Committee
Date:
Match 28-29,1996 Location: JHU/APL Facility, Laurel, MD
Summary:
A team
of distinguished scientists met to evaluate the research conducted
by JHU/APL on automated scoring of polygraph examinations. The
committee chaired by Dr. Stephen W. Porges included Dr. Raymond
Johnson, Dr. John Kircher, Dr. John Stern, and Dr. John Thornton.
Dr. Porges is a former President of the Society for Psychological
Research, a Professor of Human Development and Psychology at the
University of Maryland, and an active researcher for the past
30 years in the area of autonomic psychophysiology, physiology,
and signal processing. Dr. Porges has chaired and participated
on the review committee for several government agencies including
the National Institutes of Health. Dr. Johnson is an Associate
Professor of Psychology at Queens University and has been an active
researcher for over 20 years in the area of forensic psychophysiology.
Dr. John Kircher is an Associate Professor of Educational Psychology
at the University of Utah and has been an active researcher for
about 20 years in the area of forensic psychophysiology with special
expertise in statistical models and methods for deception from
autonomic responses. Dr. John Stern is the former chairman of
the Department of Psychology at Washington University, a former
President of the Society for Psychophysiological Research, a pioneer
in psychophysiological methodology and application and currently
involved in evaluating the use of eye movements in the detection
of deception. Dr. John Thornton is a biostatistician and a former
department chairman of biostatistics at a major medical school.
The team had collective research experience covering all areas
of statistics, signal processing, and psychophysiology relevant
to the JHU/APL project. Based on scientific peer-review criteria,
the evaluation team viewed the JHU/APL effort inadequate in the
application of scientific technology. The committee was perplexed by the theoretical and descriptive
approach conducted by the research team, as well as their lack
of understanding of physiology, psychophysiology, signal processing
strategies, physiological monitoring hardware, and statistical
theory. Specific issues will be discussed below.
SOW
History:
JHU/APL was given the task to develop an algorithm that discriminates
between deceptive and non-deceptive in the data set provided.
The data set consisted of polygraph tests on which two expert
polygraphers had agreement. The JHU/APL team developed an algorithm
that works well with the data set provided. The algorithm was
able to replicate with a high degree of accuracy the decisions
of the expert polygraphers. Thus, based upon the initial description
of task, JHU/APL has completed the task. However, from a scientific
point of view, especially in terms of test development and application
of the algorithm to the field, the approach was faulty from the
start. Questions such as whether the algorithm works, under what
circumstances does the algorithm work, and whether the algorithm
works better than other algorithms have not been addressed. From
a scientific approach, these questions are not appropriate, because
the strategy used to develop the algorithms represent an
atheoretical and naive approach to psychophysiological signal
processing and test development.
Basic
Problems with the project
1.
Generalizability of the JHU/APL algorithm.
JHU/APL approached the question as an empirical task of developing
and algorithm that distinguished between the polygraph tracings
designated as deceptive from those designated as not deceptive.
This empirical direction is frequently articulated by the JHU/APL
team as the “power” or the “strength” of the computers and their
ability to extract the most meaningful patterns. Unfortunately,
no approach is truly empirical. All approaches to statistical
analysis are limited to the quality of the data input and subsequently
dependent upon sampling theory. The true goal is not to separate
the “given” data set into two groups, but to develop an algorithm
that may be “generalized” and used in field applications. Thus,
we must shift from “descriptive” atheoretical empirical descriptions,
to inferential statistics and their associated distributions of
estimators and of course, the probabilities of these estimators.
In the polygrapher’s world, this would be stated in terms of the
“probability” that individual is innocent or guilty. Methodologically,
the JHU/APL algorithm can be used with the data set from which
it was developed, but generalization of the probabilities based
on the development sample field to field tests is not statistically
appropriate. Basically, given the methods
used to develop the algorithm, there is no scientific basis for
the algorithm to be generalized to field application. Thus, the
marketing of the Polyscore algorithm as providing a probability
of deception in the field is unwarranted and professionally irresponsible.
2.
Hardware used to collect the data for the development of the
JHU/APL algorithms.
The committee was critical of the Axciton computerized polygraph
system that was used to collect the data. To the committee the
technology of this system appears to represent 40-year old hardware.
Although the outputs of the amplifiers are computerized, the amplifiers
and the interface with the subject reflect an old technology.
JHU/APL was unaware of the specifications of the amplifiers of
Axciton or the potential confounding effects of the time constants
and filters designed into the amplifiers on raw data. JHU/APL
had no understanding or knowledge of the transfer functions of
the amplifiers. This appeared to the review committee as inappropriate,
since JHU/APL had received a Federal contract to generate a transformation
function to statistical adjust data from another polygraph manufacturer
to be similar to the Axciton data. No appropriate answer was obtained
regarding the decision to select the system. Nor was there an
adequate reason to recommend this system to polygraphers in government
agencies. Apparently, because the JHU/APL algorithms were developed
on data generated from the system, the current marketing of the
algorithm forces users to purchase the Axciton.
It is important
to acknowledge that all AC amplifiers (i.e., self-centering) filter
data, especially lower frequency input. If the slow trend in GSR
is important or if the latency to peak or recovery of GSR is important,
this information might be lost or adulterated by the frequency
response characteristics of the amplifiers. From an engineering
point of view, this reflects poor design specifications. However,
since JHU/APL does not have an experienced research physiologist
or psychophysiologist or biomedical engineer on their team, the
question of attenuation of input signal or corruption of signal
by amplifiers and filters was not in their realm of expertise
and, thus, not considered in conducting their task.
3.
Software. The software incorporates several stages of smoothing, filtering,
and signal extraction. Each of the stages represent
a decision to extract specific features from the data. Although
the JHU/APL team argues that the approach is empirical and the
“computer” selects the features, this argument is not totally
accurate. Decisions were made by the team and each decision functioned
to “filter” the entire empirical data set. The transfer functions
of each of these steps needs to be “empirically” presented. There
is no doubt the several iterations of the algorithm were accomplished
via hard work, however, the decision sequences and the analyses
supporting these decisions were not available for review. The
development of an algorithm is not an empirical exercise as suggested
by JHU/APL, but rather it must be based upon statistical theory
and requires clear statements of these assumptions and consequences
for each transformation and each filter.
4.
Physiological systems.
Since the 1880’s researchers have been quantifying electrodermal
activity in response to a variety of stimuli including “words”
and “thoughts” and “visualizations.” Even Carl Jung looked at
GSR to words. There is a large literature in psychophysiology,
physiology and biomedical engineering regarding the quantification
and feature detection of physiological data. JHU/APL appeared
to be totally ignorant of this literature on the special problems
of signal processing and filtering of physiological signals.
5.
Summary of scientific effort. This
is not a scientific project. The project represents
a contractual relationship between Federal agencies and JHU/APL
to distinguish two groups in existing data base with a computer
algorithm. If we accept that as the task, the contract has been
fulfilled. However, if we were to evaluate
this project in terms of scientific quality, and hold this project
to the criteria of peer review by psychophysiologists, signal
processors, statistical modelers, or psychometricians, our evaluation
would be negative. For example, the methodology does not meet
the standards for publication in a quality peer review journal.
Factors
contributing to the inadequacy of the project
1.
Functioned in scientific vacuum. The
project has functioned as if it were in a scientific vacuum and
has not incorporated knowledge of several related disciplines.
For example, specific features of GSR have been known for over
30 years to be sensitive to psychological phenomenon. For over
a decade research in the forensic psychophysiology has documented
that several of these features are related to deception. These
features are easily identified and quantified and should, at least
be entered into the model to determine whether they discriminate
better than the “empirically” derived variables. Empirically derived
variables, may at times, miss theoretically important or relevant
characteristics. Moreover, as stated above, there are no “truly”
empirically derived variables, because decisions rules have been
made regarding the collection of the raw data (e.g., digitizing
rates, transfer functions of the amplifiers, etc.) and the processing
of the data (e.g., numbers of samples to look at, features to
detect such as slope or peak or level, etc.).
The evaluation
committee was surprised there was no input into this project by
individuals who were expert in the problems of signal processing
of physiological signals. Physiological signals present special
problems, because of their unique statistical characteristics
such nonstationarity, dynamic changes, physiological state changes,
and latency characteristics. Once this is understood, it is clear
that transfer functions of the amplifiers and software filters
would impact on true physiological signals, and this impact may
have a predictable pattern. The JHU/APL team was ignorant of these
issues and argued adamantly that these problems were irrelevant.
The evaluation team viewed the JHU/APL team as inappropriate and
unprepared to deal with the research problem. The credentials
of the JHU/APL team do not fit the skills necessary for a successful
treatment of the task.
The JHU/APL
team argued irrelevant points such as relating sending a missile
in the space (task which JHU/APL contributed to) with developing
an acceptable algorithm to deal would psychophysiological of data.
In addition, several government employees have argued that the
“how and why” of the development of the algorithm does not matter.
Rather, it is whether the algorithm works or does not work. This
is not an appropriate question, because the ability of the system
to work is dependent upon the physical and sampling characteristics
of the data. The effects of the amplifiers,
filters, and sampling procedures have clearly adulterated the
data. Thus, even with the best statistical techniques,
the findings would be limited. However, the statistical techniques
employed suffer from a lack of understanding of statistical theory
and methodologies that would allow one to generalize findings
from a large sample to the field application of a single client.
Finally, there
is no documented scientific product that can be scientifically
evaluated. In science this occurs in terms of peer reviewed publications.
How would this project fair in the Journal of Applied Physiology,
or an IEEE journal or a signal processing journal or a psychophysiology
journal? How would peers evaluate the product? These answers to
this question is clear, the methodology would be viewed as inadequate.
An interesting
example of this criticism occurred at the 1994 meeting of the
Society for Psychophysiological Research. This meeting occurred
while I was President of the Society. As President of the Society
for Psychophysiological Research, I was involved in the development
of the program for annual meeting this past October. My charge
to the program chairman was to stimulate submissions from several
affinity groups including forensic psychophysiology. One symposium
focusing on forensic psychophysiology was accepted. I attended
this symposium and was appalled by the attitudes expressed and
the scientific bases of the JHU/APL presentation. My view was
shared by many of the Society members who witnessed the talk.
Following the meeting I received a copy of an email from a member
who is conducting polygraph research in Europe. His message included
the following statements:
To all concerned
scholars and scientists:
During the
last meeting of the Society for Psychophysiology Research, held
in Atlanta, Ga., October 5-9, a symposium was scheduled on the
topic of ‘Decision-Making Algorithms in Forensic Psychophysiology.’
The reason to address you is my concern about the ‘scientific’
contribution by Dale E. Olsen, John C. Harris, and Wendy W.
Chiu (Johns Hopkins University), entitled ‘The Development of
a Physiological Detection of Deception Scoring Algorithm’ and,
to a lesser extent, the contribution by Patrick F. Castelaz,
(Loma Linda University), and John E. Angus (Claremont Graduate
School), entitled ‘Deception Classification via Nonconventional
Processing Techniques.’
I am sending
a copy of this letter to the Board Of Directors of the Society
for Psychophysiological Research, and I will ask them never
to invite non-scientific researchers
like these again for a presentation any conference organized
by a Society that fosters research relating psychology and physiology.
There several
issues associated with the above statements. First, the Government
contracts are being conducted by researchers that are not respected
by scientists conducting psychophysiological research. Second,
the research was viewed as severely flawed and of no scientific
value. Third, it presents not only the contract researcher poorly,
but reflects negatively on the sponsor. Basically, legitimate
and established researchers view the current Government sponsored
polygraphy research as wasteful and poorly conceived. Additionally,
credible researchers who want to conduct scientific investigation
of polygraphy do not have access to the minimal funding they
required to complete well-designed research that would pass
the criteria of peer review.
2.
The fallacy of efficacy research.
The JHU/APL contract is an example of efficacy research. Efficacy
research is short-sighted and often useless when new methodologies
are developed. Efficacy research is focused on proving that existing
technologies work and not at understanding why they work. The
issue is critical in the area of polygraphy. Most researchers
as well as the public believe that polygraphy works with a “hit
rate” greater than chance. The polygraphy community believes that
hit rate approaches 100%, while the scientific community believes
that this is a vast over estimation. There are two approaches
to dealing with this inconsistency in objective effectiveness.
One is to evaluate the effectiveness or improve effectiveness
by identifying strategies of “experts.” The other is to attempt
to understand the mechanisms and processes mediating the features
of physiological activity related to deception. The first assumes
that polygraphy as currently configured works and that the primary
research goal is to “streamline” the scoring procedure to be more
reliable. It is assumed that if reliability is enhanced accuracy
will be enhanced as well. To deal with the reliability issues,
funds have been directed at computerized scoring and thus, reduce
scorer bias. The JHU/APL contract reflects this approach. The
second assumes that better signal processing can be used to extract
difference physiological response variables, better paradigms
can be developed to isolate physiological responses, and better
decision making algorithms can be developed. The latter is a scientific
approach with a goal of improved detection instruments via understanding
the underlying mechanisms, the former is the approach in the industry
with an unmodifiable and potentially obsolete product attempting
to justify its utility.
Both approaches
cannot co-exist. Their philosophical bases are in direct contradiction.
Moreover, it is clear on which side of this philosophical argument
Government agencies are. Although Government funding has stimulated
novel evoked potential and eye movement research, it has directly
impeded research with the more “traditional” autonomic nervous
system variables. The Government has created two bottlenecks in
the polygraph research program: 1) the dependence on the Axciton
system; and 2) the development of algorithms by the Applied Physics
Laboratory at Johns Hopkins. These two “research” decisions result
in current efficacy research funding being directed at demonstrating
their utility. For example, although Axciton
appears to be a defective device in the transformation of physiological
signals, the Government has allocated funds to “translate” these
signals into a data base similar to other traditional
polygraphers. Unfortunately, the traditional polygraphs also adulterate
the physiological signals. Thus, the efficacy approach will result
and wasted funds because the amplifiers distort the underlying
electrophysiological and mechano-physiological signals. The obvious
solution would be to use amplifiers and computer algorithms that
accurately report and represent the physiological signals. A simple
exercise that could be evaluated by competent scientist.
Supporting
evaluations by Drs. Johnson, Kircher, and Stern
Dr.
Raymond Johnson
Based on what
I learned at the site visit, I had a number of concerns about
the both results and the manner in which the contract was executed.
To begin, the contractors made no attempt to take advantage of
nearly 100 years of research either on the nature of the physiological
measures that form the basis of polygraphy or on the methods that
have been developed to quantify these signals. While the contractors
were able to generate a measure for separating the deception indicated
(DI) and no deception indicated (NDI) populations, it is very
likely that they could have done a better job (see below) and
spent considerably less time and money had they used the existing
information.
While one
might argue the above criticisms are irrelevant since they did
derive a discriminator, the main point is that they never tested
their method to determine if it works, and if it does, they did
not determine the extent to which their algorithm works in different
situations. That is, at present, there
is not one single piece of evidence to confirm that this algorithm
will provide an accurate indication of DI or NDI in a person outside
the original sample used to make the model. This is
another instance of the contractors ignoring a long history existing
knowledge, in this case, research design methods. Had they used
any of a series of standard practices, they would have built their
model on a random selection of one-half of the data, leaving the
other half for testing the efficacy of the finished model. Perhaps
the most appalling aspect of their presentation was the fact that
they felt absolutely no need to test their algorithm before putting
their product on the market. In addition,
they sell this as a tested product and I saw no indication from
any of the polygraphy community that was present that they understood
that this was an entirely untested product.
Along with
a complete failure to test/validate the algorithm, the contractors
also failed entirely to characterize their sample and thereby
delineate properly for whom and what crimes the system might work.
Upon questioning, they had little or no idea about what mix of
subjects (e.g., sex, race, age) or crimes was used to develop
the algorithm. I feel that, without proper studies, it is impossible
to know if the algorithm (even assuming it works and is valid)
can be applied at all, or in the same way, across the entire range
of possible subjects and crimes. I believe that the contractors,
in ignoring this point, had made a serious error that can only
be exacerbated by the atheoretical manner in which they derive
their algorithm. That is, when they choose to base their model
strictly on the basis of these particular cases, they tied their
algorithm very tightly to the particular characteristics of their
sample. I would, therefore, not expect the algorithm to generalize
or work well on persons or crimes not well represented in the
original sample.
The idea that
their algorithm does not easily generalize to other situations
is supported by their statement that they would like to continue
building their sample, doubling it if possible. This suggests that the weights, and perhaps the parameters,
are still drifting. This reveals a number of things
about their algorithm. First, it suggests that there is a considerable
amount of variability that remains, raising doubts about using
it to make important decisions about individual’ s futures. Second,
it suggested they have over modeled their sample, reducing its
generalizability to cases outside the sample. Third, it suggested
that their model does not capture well the essential differences
between DI and NDI responses. Had this been the case, they should
have been able to derive their model from a smaller, rather than
a larger, sample. This is particularly important because the model
is to be applied to individuals, rather than groups, in order
to make a vitally important decision.
Failure to
appreciate the fact that the polygraph itself affects the character
as well as the quality of the data is yet another example of the
contractors failure to obtain even a rudimentary knowledge of
the subject they were working on. Even
a cursory glance it any basic text on psychophysiological methods
would have revealed the importance of filters and their interaction
with all physiological signals. Consequently, one of their “features,”
the one based on the Autonomic GSR signal, is almost certainly
measure of the time constant of the amplifier, rather than on
a signal from the subject. Hence, the algorithm will
not work properly if a different manufacturer uses a different
time constant, or if the operator makes a change to the amplifier
settings. Note that the algorithm could be adjusted to compensate
for such changes in the time constant. However, without adequate
knowledge that this part of the algorithm reflects amplifier characteristics
and not subject responses, the contractors had no idea that such
changes in the algorithm would be required. Thus, the accuracy
of the algorithm is affected in ways the contractors did not understand,
or warn about, because of their ignorance of basic information
on the topic they were working in.
In a future,
I feel certain that the contractors will return to government
agencies to obtain additional money for developing their algorithm.
Indeed, they can argue that they did as they said and fulfilled
the contract, whether or not it was done as well, expeditiously
or cheaply as they or someone else might have done had they first
developed an expertise in psychophysiology. Therefore, it is hard
for me to imagine that these investigators will not receive another
government contract to either validate their method, or extended
it to other formats, or both.
Thus, there
are lessons to be learned from the way in which the previous contract
was written and administered. First, any future contracts (with
these or other investigators) must include a provision that any
algorithm for separating DI and NDI persons be TESTED by scientifically
accepted methods as the condition for fulfilling the contract.
Second, the contractors should submit their research/experimental
plan for review by a panel of scientists (as is done in all cases
for non-military grants and contracts) employed by the funding
agency. The funding agency may also want to consider using the
panel to periodically assess the progress on the contracts so
that any problems can be corrected as they occur, rather than
after the contract has been completed. Implementing these procedures
will dramatically reduce contract problems.
In
sum, the contractors have developed, delivered and sold in algorithm
to separate DI from NDI subjects that has no demonstrated validity.
I would recommend that all use of this algorithm cease until its
accuracy and validity can be demonstrated. Further, I would not
give these contractors another contract without first assuring
that there will be a considerably greater deal of oversight from
scientists knowledgeable in the field of psychophysiology.
Dr.
John C. Kircher
1.
Overall Impressions. I
was appalled by the procedures used at the APL to develop computer
programs for analyzing polygraph charts. It is difficult for me
to understand why the government would (i) fund efforts to develop
algorithms to analyze signals from hardware that is 40 years out
of date, (ii) fund investigators that no formal training in the
relevant scientific disciplines (physiology, psychology, and psychophysiology)
and clearly made no attempt to study these areas on their own,
and (iii) fund investigators who, even in their own areas of expertise
(statistics, signal processing) show minimal grasp of basic principles
of classical and modern measurement theory, effects of multiple
nonlinear transformations on their signals, and issues of research
design, generalizability, and shrinkage.
2.
Alternative Analytic Techniques.
The JHU/APL team did not consider, nor do they appear to be aware
of alternative analytic methods such as those developed by David
C. Raskin and me, which are described in detail in several scientific
articles and book chapters.
3.
Other Measures from the Existing Data Base.
I am not sure, but I do not believe that any traditional measures
of electrodermal and cardiovascular activity were among me 9,911
feature these investigators report having examined (I believe
with various types of signal transformations, the actual number
of features examined by these investigators is many times greater
than 10, 000). Traditional measures would include features such
as peak amplitude, rise time, half-recovery time, and onset latency
(e.g., Coles, Donchin, & Porges, 1988; Greenfield & Sternbach,
1972; Martin &Venables, 1980; Fowles, Christie, Edlberg, Grings,
Lykken, & Venables, 1981). It might be worth taking a look
at a few traditional measures and correlating them with the features
generated by the investigators at APL. It might give us some idea
of what they are measuring.
4.
Hardware.
The major problem with Axciton hardware is the measurement of
skin resistance from stainless steel plates. Even when properly
recorded, SR shows baseline drift that is often greater than the
magnitude of the physiological responses. As a consequence, on
traditional polygraphs the examiner must
frequently re-center the pen. Axciton’s solution to this problem
is to filter the signal to reduce baseline drift. However, this
filter distorts the signal and moves us one step further from
the subject’s cognitive and emotional response to the stimulus.
To measure SRR’s, the Axciton uses stainless steel plates. These
plates are subject to polarization, bias, and movement artifacts.
I do not know
if the Lafayette system uses stainless-steel electrodes, but it
does record skin conductance. I doubt that the Lafayette hardware
filters the SC signal in the same manner as the Axciton, if that
all. If the Lafayette uses Ag-AgCl electrodes with electrode paste
and records with a constant .5V circuit as we do, there would
be less polarization, bias, and minimal effects on sweat gland
activity. I cannot imagine how the Polyscore algorithm, which was
developed with data from the Axciton, could possibly correct for
the differences between the filtered SR signals from the Axciton
and the SC singles from Lafayette.
In the copies
of transparencies provided by the APL and at one point during
the meeting, the investigators noted that SR signals that exceeded
the 12-bit resolution of the Axciton resulted in flat lines at
the maximum (clipping of the waveform). The investigators then
noted that this problem had been corrected. At what point was
this problem corrected? Can it be assumed that many of the cases
in their training set were collected while this was a problem?
Were these strong SRR’s treated as missing values (which would
bias results), or were they measured as they appeared in the data
files? The occurrence of artificial ceilings in SRR’s would seriously
affect the diagnosticity of various ‘percentile’ measures of ‘GSR,’
their selection for logistic regression model, and the weights
given them by the model.
5.
Algorithm Development: Signal in Feature Transformations.
The
investigators perform numerous nonlinear transformations of signals
prior to extracting features from them. The Axciton hardware filters
low frequencies from the SR channel, and the Polyscore then detrends
the filtered SR data. The signal from which the ‘GSR’ features
are extracted is a nonlinear transformation of another nonlinear
transformation of skin resistance.
Even if the investigators measured something as simple as the
amplitude of this signal, one wonders how that measurement relates
to actual changes in skin resistance. The transformations do not
stop there. They measure response magnitude in terms of pseudo-percentiles
and in some cases differences between pseudo-percentiles. While
these measures are functionally related to the subject’s responses
to test questions, it is difficult to see how, and one wonders
about the effects of variations in gain settings and even the
number of questions on the obtained measures of electrodermal
responses.
The
cardio channel is detrended and then partitioned into high and
low frequency components. I am concerned about the effect of detrending
on measures extracted from these signals as well. The justification
for detrending cardio signals is to correct for leaks in the pneumatic
system (The same justification is given for respiration channels
since these are recorded from pneumatic bellows.) It
seems to me that the solution to leaks in the recording system
is to find out where there are leaks and fix them. I believe it
is a mistake to transform all signals to ‘correct’ for the possibility
that there might be a leak in recording system. Again,
with each transformation, the measurements we use to discriminate
between truthful and deceptive subjects are one step further removed
from the psychological processes that give rise to the changes
in physiology we wish to assess.
Respiration
signals are detrended and ‘baselined’ (pp 13-15 of
the Algorithm Overview handout). I do not see how the respiration
signal would change if only the baselining were done. Ultimately,
the points of maximum expiration are aligned. I
also fail to see any possible advantage in baselining respirations.
I think it must distort the signal in some nonlinear manner to
fit the constraint that all baseline values fall on a straight
line. And why should it matter to the algorithm if
the baseline points are aligned? There might be some advantage
in applying this transformation if the baselined respirations
were to be presented to polygraph examiners for visual inspection.
However, I asked about this, and they do not show the baselined
respirations to the polygraph examiner. I believe the transformations
on respiration channels have only adverse effects on measurements
from these channels because they distort the signals and do so
unnecessarily since no visual analysis is ever performed on the
transformed data.
6.
Algorithm Development: Feature Extraction.
APL approached their task as though no one had ever before quantified
respiration, electrodermal, or cardiovascular activity and used
them to draw inferences about psychological states or processes.
The measures developed by APL bear little if any resemblance to
the types of measurements made by psychophysiologists for many
decades. They did not study this literature. For example, they
did not know that the reductions in cardio pulse amplitude in
their model could reflect an increase in blood pressure or a decrease
in blood pressure depending on the pressure of the occluding cuff.
Are we to believe that these investigators, with no training in
psychophysiology, have identified new ways of characterizing psychophysiological
events that are generally superior to be accumulated wisdom of
research scientists who have been investigating these phenomena
for over 100 years? Standards have been develop
by the scientific community for measuring physiological activity,
and these standards were completely ignored. Massive computation
is no substitute for careful study of the scientific literature
and the application of well established psychophysiological and
psychometric principles.
One potential
advantage in using a computer to quantify physiological reactions
is that it can help to train polygraph examiners to be better
chart interpreters. With all the nonlinear transformations of
signals and features made by Polyscore and the nonstandard types
of features used by this algorithm, there is no way the measurements
made by Polyscore could possibly be used to train polygraph examiners
to be better numerical scorers.
The investigators
at APL have argued that an advantage in using the computer is
that evaluates aspects of the charts that the human interpreter
could not possibly ‘see.’ The only possible way additional complexity
could be considered advantageous is if it actually improves diagnostic
accuracy, and there is no evidence of this. Otherwise, it is nothing
but smoke and mirrors. If the algorithm is a complete mystery
to the best scientific minds in the field, it could not possibly
be viewed as anything but a black box to the field polygraph examiners
who use it.
7.
Algorithm Development: Feature Selection.
Stepwise logistic regression was used with about 33 blocks of
300 different variables to select the features for the model.
Stepwise regression procedures are notoriously unreliable, they
capitalize on chance, and yield highly inflated estimates of discrimination
(Hocking, 1983). The problem is compounded by the enormous number
of variables provided to the stepwise variable selection algorithm.
Under these conditions, random numbers would yield perfect or
near perfect discrimination between groups, which is precisely
what these investigators report. Their approach borders on the
absurd. It is the most extreme example of overfitting a training
set that I have ever seen or could possibly imagine. I do not
understand how a professional statistician could even consider
developing a prediction equation with more than 10 variables per
subject. Texts on multivariate statistics recommend 10-20 subjects
per variable and certainly no fewer than five subjects per variable.
Not only did these investigators fail to follow these recommendations,
they missed the boat by two orders of magnitude. On purely statistical
grounds, any claim that the results obtained in their training
set will generalize to new cases is completely without merit.
From
a statistical standpoint, there’s no reason to believe that the
model developed at APL has any power whatsoever to discriminate
between truthful and deceptive subjects.
Other indications that their model has limited generalizability
include intimations by the investigators (no data) that the model
for ZOC test would (or does) not work well for MGQTs or in laboratory
mock crime experiments. There’s no theoretical reason or empirical
research to suggest that the patterns of physiological changes
associated with truthfulness and deception should differ in ZOC
and MGQ tests.
I think the
use of logistic regression for computerized polygraphy is appropriate.
It is an non-parametric method that is a bit more conservative
than discriminate analysis. The relative power of these two methods
for modeling differences between truthful and deceptive subjects
depends on the extent to which the assumptions that underlie the
two techniques are reasonable. Under conditions of multivariate
normality, discriminate analysis is likely to be more powerful.
However, I take issue with the claim by these authors that the
probabilities of truthfulness/deception from discriminate analysis
are ‘unreasonable’ (see Algorithm Overview Page entitled ‘Comparing
Logistic Linear Repression’). For our presentation at the 1994
meeting of SPR in Atlanta, we created a discriminate function
and a logistic regression model composed of the same set of measures
from the same set of laboratory and confirmed field cases (89
deceptive, 74 truthful subjects). The correlation between the
probabilities generated by the two models exceeded .999. If discriminate
analysis produces ‘unreasonable’ probabilities, then the same
must be said about logistic regression.
8.
Algorithm the Development: Case Selection for Model Development.
The criteria for selecting cases for the training set differed
for individuals classified as deceptive or truthful. Criterion
deceptive subjects had confessed, and there was independent corroborating
evidence that the confession was correct. We do not know the percentage
of test questions within a test to which a subject confessed.
Confessions about all issues covered even in ZOC tests are rare.
What became of the data for questions within a test for which
there was no confessions? Apparently Mike Capps and someone else
decided it there was corroborating evidence. Was there independent
corroborating evidence for every deceptive case (or relevant question)
in the training set? What was the reliability of these judgments,
and why was the corroborating evidence criterion not stated in
the handout (page 6) or in any other descriptions of algorithm
development supplied by the APL?
Apparently,
a guilty plea by a suspect was also used as a basis for assignment
to the deceptive group and perhaps even for ‘clearing’ an innocent
suspect. There are serious problems with this criterion, since
people enter guilty pleas for many reasons, only one of which
is that they committed a crime and lied about it on their polygraph
test. Important details about the reliability of the selection
criteria for deceptive subjects were not presented, and I have
grave concerns about the use of guilty pleas as a criterion for
case selection. From the available information, it is not possible
to estimate the proportion of subjects, or the proportion of answers
to individual questions within tests, that were consider deceptive
but were actually truthful. In all likelihood,
the algorithm was trained on a ‘contaminated’ sample of deceptive
cases.
The manner
which criterion truthful subjects were selected for the training
set was even more problematic. The vast majority of this sample
was composed of cases in which the original examiner and two independent
evaluators agreed that the subject was truthful. I suspect that
this sample contains few false negative errors, too few in fact.
There is a lot of research which shows that a large percentage
of truthful subjects produce ambiguous charts, especially in field
test. The outcome of a polygraph test is ambiguous when the subject
responds similarly to control and relevant questions. These cases
were systematically excluded from the sample of truthful cases
since only clearly truthful charts are likely to result in complete
agreement among polygraph examiners. It is not difficult to see
why the algorithm was able to discriminate this homogeneous sample
of clearly truthful charts from cases in the deceptive group.
Inclusion of only clearly truthful charts biases estimates of
the accuracy of outcomes from Polyscore, not only for truthful
subjects but also for deceptive subjects. More importantly, it
is likely to have undesirable effects on the selection of variables
for the model. Variables that could be of great value in increasing
the statistical distance between ambiguous truthful charts and
deceptive ones would not be selected for the model because the
stepwise algorithm does not know that these cases even exist.
The investigators
to great pains to state that their goal was not to determine the
accuracy of polygraph examinations. However, it is not possible
to separate the issue of the ‘accuracy of polygraph examinations’
from the task of developing an algorithm that has utility for
field polygraphy. By systematically excluding inconclusive cases
from the training set, they misrepresented the qualitative and
quantitative nature of psychophysiological differences between
the target populations truthful and deceptive subject. These populations
are considerably more heterogeneous than the samples of cases
included in the training set. Consequently, their model was not
optimized to distinguish between truthful and deceptive field
suspects, the observed differences between the groups are biased,
and they fool themselves and their customers about the value of
the algorithm for field applications.
9.
Recommendations in Future Directions.
a. Stop funding the research at the APL. The research
at APL has been an incredible waste of time and money. They completely
ignored scientific standards for psychophysiological measurements
and statistical analysis; they adulterated methods David Raskin
and I pioneered almost 20 years ago; and they promote the algorithm
for field used it has never been adequately tested. One would
think that a professional statistician with plans to generate
thousands of new and untested measures with think to hold out
a few cases in order to cross-validate the model. The only test
of the validity of the algorithm was completed at DODPI shortly
before meeting in March. There were only four confirmed truthful
subjects in this cross-validation study, and Polyscore misclassified
one of them (75% correct). Statistically, this is no better than
chance accuracy on truthful subjects. Whatever limited validity
data exist speak more to the power of control question test than
their quality of the research. Accuracy on deceptive subjects
was much better (92% I think). However, with little information
about the accuracy on truthful subjects, this figure is difficult
to interpret. If the algorithm is biased against truthful suspects
and yields a high rate of false positives, one would expect highly
accurate decisions on deceptive subjects.
A far better
use of funds would be to: (i) announce your interest in the development
of computerized polygraph systems and solicit a range of proposals
from the research community; or (ii) organize a meeting of a select
group of psychophysiologists with some knowledge and interest
in polygraphy. (It wouldn’t hurt to invite a few outspoken members
from the user groups whose task would be to keep the feet of ivory
tower types firmly planted on the ground.) The purpose of the
meeting would be to develop the research plan for further development
of computer algorithms for the detection of deception. The research
program would be characterized a coherent agenda, open discussion,
consensus, and collaborative efforts by knowledgeable investigators.
I am certain that the result would look nothing like the APL/Axciton
system you have now.
b. Model
the decision making of human experts who make accurate
decisions yet tend to extract diagnostic information from the
polygraph charts that is not included in the computer model.
c. Investigative
new sensors. Autonomic measures that show considerable
promise includes skin potential, components of variance in heart
period (vagal tone), and blood pressure. Pilot data from our lab
and John Podlesny’s research with the Finapres blood pressure
monitor suggest that measures derived from these channels are
diagnostic. Moreover, they are more likely than additional characterizations
of conventional polygraph channels to make independent contributions
to existing combinations of weighted physiological measures.
John
A. Stern
1. This
is a statistical Tour de Force - or is it a Tour de Farce??
I
feel strongly that it is the latter.
Their attempt at “model building” deals purely with the development
of a statistical model, one that is purely descriptive in nature.
2. The effort
to develop an algorithm that discriminates between deceptive and
non-deceptive is laudatory, however the effort is completely atheoretical
nor did it have the benefit of skilled polygraphers in suggesting
variables that might (or might not) be entered into the discriminative
equation.
3. The need
for 500+ examinations to develop a “predictive equation” suggests
a lot of error variance in the data base. The suggestion to develop
an equation based on a smaller subset of the data and applying
to the remaining data set was considered “inappropriate” by them.
They must have had second thoughts about this! We did receive
a copy of a “cross-validation” report in which the data set (624
cases) was divided into four subsets. Using their “modeling” procedures
models were built for each database and they now claim only a
3% accuracy degradation when the model developed on one data base
is applied to the other three sets. They go on to say “It also
shows that although some models had no terms (features) in common,
there is a strong correlation for some features and allowing LOGIT
to choose among them on the basis of the data base results in
equivalently performing models.” How is that for gobbledygook
and scoring a touchdown for the other side. One must now asked,
how many samples are necessary for the development of a “model”
and which is the most appropriate model. If all four worked equally
well one must have doubts about their procedures!!!
At the meeting
they were not satisfied with the number of cases in their data
bank and felt they needed more. One wonders how many more cases
will be required before they are satisfied that the equation is
reliable and valid and field applicable - although it is being
marketed and applied in field investigations. We wonder about
the ethics of marketing an unvalidated products. We believe it
to be highly unethical - regardless of the outcome of a “validation”
study.
4. The models
developed are highly restrictive. New models have to be developed
for Zone scoring and the MGQT procedures. It would seem to me
that a more general model would be desirable - if it is possible.
5. To my surprise
only polygraph data and no personal data - such as gender, age,
race, etc. enter into the predictive equation.
6. Though
not or only poorly articulated it appears that the “features”
or “factors” cannot be described in terms of what component of
the response in question is evaluated. With all their statistical
manipulations it apparently becomes difficult to decipher the
algorithm. The features used in the algorithm are purely empirically
derived and the authors could not or would not specify what each
feature measured nor identify weights associated with each feature.
We were shown one of the Polyscore reports and I noticed that
in the particular report 80% of the weighting was attributed to
electrodermal variables. Ideally one should specify the decision
rules identified in the algorithm and then evaluate these rules
empirically.
7. Their response
to questions was, at best, evasive and sometimes offensive. It
appears that there is little in the way of documentation concerning
their “research” efforts. They alluded to analyses done earlier
but there’s no paper trail allowing for an evaluation of what
they have actually done over the years.
8. The product
“Polyscore” that has been developed is not generalizable, atheoretical,
and aphysiological.
9. The only
documentation available to us was a second draft of a paper they
hope to publish. It contains little that is not found in their
User’s Guide to the Polyscore system. It is marred with inaccuracies
and would not pass muster the reasonable editors.
10. They are
woefully ignorant about: a. the polygraph used to collect the
data used by then. For example, they had no idea about the filter
characteristics of the amplifiers used. b. about facts associated
with recording techniques. For example, the nature and meaning
of measures derived from the blood pressure recording procedure.
c. the simple fact that speech affects the recording of respiratory
activity and that such activity needs to be taken into consideration
when scoring the data. d. that there are significant differences
in the resting levels of skin conductance as a function of skin
color and probably age.
11. With all
the hype about the reliability and accuracy of data analysis the
question about its relative accuracy is still in unknown.
12. One recommendation
agreed to by the panel was the need for a broad base look at the
polygraph, the tool used to acquire the information. The current
effort focus solely on data analytic procedures.
What is needed
is a closer look at: sensors, amplifiers, A/D converters, data
loggers, and data analysis techniques. The polygraph, as currently
configured, has not changed much from the days of the old Keeler
Polygraph. We think that current technology should allow for the
utilization of better amplifiers as well as broadening the array
of physiological measures that might be used.
|