Computerized Polygraph Systems

We carry the Stoelting line of quality polygraph systems, leading the industry since 1886. These computerized
polygraph systems are state-of-the-art in "lie detection."







Stoelting Model
85000
Price: $8675
Stoelting Model
85000-1
Price: $8100
Stoelting Model
85300
Price: $6425
Stoelting Model
85300-1
Price: $5850
Stoelting Model
85225
Price: $9675
Stoelting Model
85225-1
Price: $9100

Accessories Available on Request

Export Restrictions:

"The U.S. Department of Commerce requires an export license for any Polygraph System shipment with an ULTIMATE destination other than: Australia, Japan, New Zealand or any NATO Member Countries excluding the Czech Republic, Hungary, and Poland. It is against U.S. Law to ship a Polygraph System to any other country without an Export License. If the ULTIMATE destination is not one of the above listed countries, Stoelting will apply for the necessary Export License. There is no charge or obligation for this service, but we must be furnished with the exact name and address of both the purchaser and end user. It usually takes 6-8 weeks to obtain the required license.


Peer Review of the Stoelting CPS:

Dear Colleague:

At a meeting organized by the Central Intelligence Agency, a group of distinguished scientists evaluated the Polyscore algorithm developed by Johns Hopkins University/Applied Physics Laboratory...the Polyscore algorithm is used on Axciton and Lafayette instruments...NOT Stoelting.  The meeting was held at the APL facility and was attended by their representatives.

 

                     

Excerpts of the scientists' comments are below:

"The effects of amplifiers, filters, and sampling procedures
have clearly adulterated the data."

 

"This is not a scientific project."

"...there is no reason to believe that the model developed at APL has any power whatsoever to discriminate between truthful and deceptive subjects."

 

"In summary, the contractors [APL] have developed, delivered, and sold an algorithm to separate DI from NDI subjects that has no demonstrated validity."

 

 

A complete copy of the peer review of the scoring algorithm used by Axciton and Lafayette follows is attached.  For your convenience, we have italicized some of their more significant comments.  Thank you for your interest in Stoelting's Computerized Polygraph System.  Only Stoelting determines a suspects truthfulness using software based on data from verified U.S. Secret Service criminal investigations.

 

Report of Peer Review of Johns Hopkins University/Applied Physics  Laboratory
Stephen W. Porges, PhD,
Chair of Review Committee

Date: Match 28-29,1996 Location: JHU/APL Facility, Laurel, MD

 

Summary: A team of distinguished scientists met to evaluate the research conducted by JHU/APL on automated scoring of polygraph examinations. The committee chaired by Dr. Stephen W. Porges included Dr. Raymond Johnson, Dr. John Kircher, Dr. John Stern, and Dr. John Thornton. Dr. Porges is a former President of the Society for Psychological Research, a Professor of Human Development and Psychology at the University of Maryland, and an active researcher for the past 30 years in the area of autonomic psychophysiology, physiology, and signal processing. Dr. Porges has chaired and participated on the review committee for several government agencies including the National Institutes of Health. Dr. Johnson is an Associate Professor of Psychology at Queens University and has been an active researcher for over 20 years in the area of forensic psychophysiology. Dr. John Kircher is an Associate Professor of Educational Psychology at the University of Utah and has been an active researcher for about 20 years in the area of forensic psychophysiology with special expertise in statistical models and methods for deception from autonomic responses. Dr. John Stern is the former chairman of the Department of Psychology at Washington University, a former President of the Society for Psychophysiological Research, a pioneer in psychophysiological methodology and application and currently involved in evaluating the use of eye movements in the detection of deception. Dr. John Thornton is a biostatistician and a former department chairman of biostatistics at a major medical school. The team had collective research experience covering all areas of statistics, signal processing, and psychophysiology relevant to the JHU/APL project. Based on scientific peer-review criteria, the evaluation team viewed the JHU/APL effort inadequate in the application of scientific technology. The committee was perplexed by the theoretical and descriptive approach conducted by the research team, as well as their lack of understanding of physiology, psychophysiology, signal processing strategies, physiological monitoring hardware, and statistical theory. Specific issues will be discussed below.

SOW History: JHU/APL was given the task to develop an algorithm that discriminates between deceptive and non-deceptive in the data set provided. The data set consisted of polygraph tests on which two expert polygraphers had agreement. The JHU/APL team developed an algorithm that works well with the data set provided. The algorithm was able to replicate with a high degree of accuracy the decisions of the expert polygraphers. Thus, based upon the initial description of task, JHU/APL has completed the task. However, from a scientific point of view, especially in terms of test development and application of the algorithm to the field, the approach was faulty from the start. Questions such as whether the algorithm works, under what circumstances does the algorithm work, and whether the algorithm works better than other algorithms have not been addressed. From a scientific approach, these questions are not appropriate, because the strategy used to develop the algorithms represent an atheoretical and naive approach to psychophysiological signal processing and test development.

Basic Problems with the project

1. Generalizability of the JHU/APL algorithm. JHU/APL approached the question as an empirical task of developing and algorithm that distinguished between the polygraph tracings designated as deceptive from those designated as not deceptive. This empirical direction is frequently articulated by the JHU/APL team as the “power” or the “strength” of the computers and their ability to extract the most meaningful patterns. Unfortunately, no approach is truly empirical. All approaches to statistical analysis are limited to the quality of the data input and subsequently dependent upon sampling theory. The true goal is not to separate the “given” data set into two groups, but to develop an algorithm that may be “generalized” and used in field applications. Thus, we must shift from “descriptive” atheoretical empirical descriptions, to inferential statistics and their associated distributions of estimators and of course, the probabilities of these estimators. In the polygrapher’s world, this would be stated in terms of the “probability” that individual is innocent or guilty. Methodologically, the JHU/APL algorithm can be used with the data set from which it was developed, but generalization of the probabilities based on the development sample field to field tests is not statistically appropriate. Basically, given the methods used to develop the algorithm, there is no scientific basis for the algorithm to be generalized to field application. Thus, the marketing of the Polyscore algorithm as providing a probability of deception in the field is unwarranted and professionally irresponsible.

2. Hardware used to collect the data for the development of the JHU/APL algorithms.
The committee was critical of the Axciton computerized polygraph system that was used to collect the data. To the committee the technology of this system appears to represent 40-year old hardware. Although the outputs of the amplifiers are computerized, the amplifiers and the interface with the subject reflect an old technology. JHU/APL was unaware of the specifications of the amplifiers of Axciton or the potential confounding effects of the time constants and filters designed into the amplifiers on raw data. JHU/APL had no understanding or knowledge of the transfer functions of the amplifiers. This appeared to the review committee as inappropriate, since JHU/APL had received a Federal contract to generate a transformation function to statistical adjust data from another polygraph manufacturer to be similar to the Axciton data. No appropriate answer was obtained regarding the decision to select the system. Nor was there an adequate reason to recommend this system to polygraphers in government agencies. Apparently, because the JHU/APL algorithms were developed on data generated from the system, the current marketing of the algorithm forces users to purchase the Axciton.

It is important to acknowledge that all AC amplifiers (i.e., self-centering) filter data, especially lower frequency input. If the slow trend in GSR is important or if the latency to peak or recovery of GSR is important, this information might be lost or adulterated by the frequency response characteristics of the amplifiers. From an engineering point of view, this reflects poor design specifications. However, since JHU/APL does not have an experienced research physiologist or psychophysiologist or biomedical engineer on their team, the question of attenuation of input signal or corruption of signal by amplifiers and filters was not in their realm of expertise and, thus, not considered in conducting their task.

3. Software. The software incorporates several stages of smoothing, filtering, and signal extraction. Each of the stages represent a decision to extract specific features from the data. Although the JHU/APL team argues that the approach is empirical and the “computer” selects the features, this argument is not totally accurate. Decisions were made by the team and each decision functioned to “filter” the entire empirical data set. The transfer functions of each of these steps needs to be “empirically” presented. There is no doubt the several iterations of the algorithm were accomplished via hard work, however, the decision sequences and the analyses supporting these decisions were not available for review. The development of an algorithm is not an empirical exercise as suggested by JHU/APL, but rather it must be based upon statistical theory and requires clear statements of these assumptions and consequences for each transformation and each filter.

4. Physiological systems. Since the 1880’s researchers have been quantifying electrodermal activity in response to a variety of stimuli including “words” and “thoughts” and “visualizations.” Even Carl Jung looked at GSR to words. There is a large literature in psychophysiology, physiology and biomedical engineering regarding the quantification and feature detection of physiological data. JHU/APL appeared to be totally ignorant of this literature on the special problems of signal processing and filtering of physiological signals.

5. Summary of scientific effort. This is not a scientific project. The project represents a contractual relationship between Federal agencies and JHU/APL to distinguish two groups in existing data base with a computer algorithm. If we accept that as the task, the contract has been fulfilled. However, if we were to evaluate this project in terms of scientific quality, and hold this project to the criteria of peer review by psychophysiologists, signal processors, statistical modelers, or psychometricians, our evaluation would be negative. For example, the methodology does not meet the standards for publication in a quality peer review journal.

Factors contributing to the inadequacy of the project

1. Functioned in scientific vacuum. The project has functioned as if it were in a scientific vacuum and has not incorporated knowledge of several related disciplines. For example, specific features of GSR have been known for over 30 years to be sensitive to psychological phenomenon. For over a decade research in the forensic psychophysiology has documented that several of these features are related to deception. These features are easily identified and quantified and should, at least be entered into the model to determine whether they discriminate better than the “empirically” derived variables. Empirically derived variables, may at times, miss theoretically important or relevant characteristics. Moreover, as stated above, there are no “truly” empirically derived variables, because decisions rules have been made regarding the collection of the raw data (e.g., digitizing rates, transfer functions of the amplifiers, etc.) and the processing of the data (e.g., numbers of samples to look at, features to detect such as slope or peak or level, etc.).

The evaluation committee was surprised there was no input into this project by individuals who were expert in the problems of signal processing of physiological signals. Physiological signals present special problems, because of their unique statistical characteristics such nonstationarity, dynamic changes, physiological state changes, and latency characteristics. Once this is understood, it is clear that transfer functions of the amplifiers and software filters would impact on true physiological signals, and this impact may have a predictable pattern. The JHU/APL team was ignorant of these issues and argued adamantly that these problems were irrelevant. The evaluation team viewed the JHU/APL team as inappropriate and unprepared to deal with the research problem. The credentials of the JHU/APL team do not fit the skills necessary for a successful treatment of the task.

The JHU/APL team argued irrelevant points such as relating sending a missile in the space (task which JHU/APL contributed to) with developing an acceptable algorithm to deal would psychophysiological of data. In addition, several government employees have argued that the “how and why” of the development of the algorithm does not matter. Rather, it is whether the algorithm works or does not work. This is not an appropriate question, because the ability of the system to work is dependent upon the physical and sampling characteristics of the data. The effects of the amplifiers, filters, and sampling procedures have clearly adulterated the data. Thus, even with the best statistical techniques, the findings would be limited. However, the statistical techniques employed suffer from a lack of understanding of statistical theory and methodologies that would allow one to generalize findings from a large sample to the field application of a single client.

Finally, there is no documented scientific product that can be scientifically evaluated. In science this occurs in terms of peer reviewed publications. How would this project fair in the Journal of Applied Physiology, or an IEEE journal or a signal processing journal or a psychophysiology journal? How would peers evaluate the product? These answers to this question is clear, the methodology would be viewed as inadequate.

An interesting example of this criticism occurred at the 1994 meeting of the Society for Psychophysiological Research. This meeting occurred while I was President of the Society. As President of the Society for Psychophysiological Research, I was involved in the development of the program for annual meeting this past October. My charge to the program chairman was to stimulate submissions from several affinity groups including forensic psychophysiology. One symposium focusing on forensic psychophysiology was accepted. I attended this symposium and was appalled by the attitudes expressed and the scientific bases of the JHU/APL presentation. My view was shared by many of the Society members who witnessed the talk. Following the meeting I received a copy of an email from a member who is conducting polygraph research in Europe. His message included the following statements:

To all concerned scholars and scientists:

During the last meeting of the Society for Psychophysiology Research, held in Atlanta, Ga., October 5-9, a symposium was scheduled on the topic of ‘Decision-Making Algorithms in Forensic Psychophysiology.’ The reason to address you is my concern about the ‘scientific’ contribution by Dale E. Olsen, John C. Harris, and Wendy W. Chiu (Johns Hopkins University), entitled ‘The Development of a Physiological Detection of Deception Scoring Algorithm’ and, to a lesser extent, the contribution by Patrick F. Castelaz, (Loma Linda University), and John E. Angus (Claremont Graduate School), entitled ‘Deception Classification via Nonconventional Processing Techniques.’

I am sending a copy of this letter to the Board Of Directors of the Society for Psychophysiological Research, and I will ask them never to invite non-scientific researchers like these again for a presentation any conference organized by a Society that fosters research relating psychology and physiology.

There several issues associated with the above statements. First, the Government contracts are being conducted by researchers that are not respected by scientists conducting psychophysiological research. Second, the research was viewed as severely flawed and of no scientific value. Third, it presents not only the contract researcher poorly, but reflects negatively on the sponsor. Basically, legitimate and established researchers view the current Government sponsored polygraphy research as wasteful and poorly conceived. Additionally, credible researchers who want to conduct scientific investigation of polygraphy do not have access to the minimal funding they required to complete well-designed research that would pass the criteria of peer review.

2. The fallacy of efficacy research. The JHU/APL contract is an example of efficacy research. Efficacy research is short-sighted and often useless when new methodologies are developed. Efficacy research is focused on proving that existing technologies work and not at understanding why they work. The issue is critical in the area of polygraphy. Most researchers as well as the public believe that polygraphy works with a “hit rate” greater than chance. The polygraphy community believes that hit rate approaches 100%, while the scientific community believes that this is a vast over estimation. There are two approaches to dealing with this inconsistency in objective effectiveness. One is to evaluate the effectiveness or improve effectiveness by identifying strategies of “experts.” The other is to attempt to understand the mechanisms and processes mediating the features of physiological activity related to deception. The first assumes that polygraphy as currently configured works and that the primary research goal is to “streamline” the scoring procedure to be more reliable. It is assumed that if reliability is enhanced accuracy will be enhanced as well. To deal with the reliability issues, funds have been directed at computerized scoring and thus, reduce scorer bias. The JHU/APL contract reflects this approach. The second assumes that better signal processing can be used to extract difference physiological response variables, better paradigms can be developed to isolate physiological responses, and better decision making algorithms can be developed. The latter is a scientific approach with a goal of improved detection instruments via understanding the underlying mechanisms, the former is the approach in the industry with an unmodifiable and potentially obsolete product attempting to justify its utility.

Both approaches cannot co-exist. Their philosophical bases are in direct contradiction. Moreover, it is clear on which side of this philosophical argument Government agencies are. Although Government funding has stimulated novel evoked potential and eye movement research, it has directly impeded research with the more “traditional” autonomic nervous system variables. The Government has created two bottlenecks in the polygraph research program: 1) the dependence on the Axciton system; and 2) the development of algorithms by the Applied Physics Laboratory at Johns Hopkins. These two “research” decisions result in current efficacy research funding being directed at demonstrating their utility. For example, although Axciton appears to be a defective device in the transformation of physiological signals, the Government has allocated funds to “translate” these signals into a data base similar to other traditional polygraphers. Unfortunately, the traditional polygraphs also adulterate the physiological signals. Thus, the efficacy approach will result and wasted funds because the amplifiers distort the underlying electrophysiological and mechano-physiological signals. The obvious solution would be to use amplifiers and computer algorithms that accurately report and represent the physiological signals. A simple exercise that could be evaluated by competent scientist.

 

Supporting evaluations by Drs. Johnson, Kircher, and Stern

Dr. Raymond Johnson

Based on what I learned at the site visit, I had a number of concerns about the both results and the manner in which the contract was executed. To begin, the contractors made no attempt to take advantage of nearly 100 years of research either on the nature of the physiological measures that form the basis of polygraphy or on the methods that have been developed to quantify these signals. While the contractors were able to generate a measure for separating the deception indicated (DI) and no deception indicated (NDI) populations, it is very likely that they could have done a better job (see below) and spent considerably less time and money had they used the existing information.

While one might argue the above criticisms are irrelevant since they did derive a discriminator, the main point is that they never tested their method to determine if it works, and if it does, they did not determine the extent to which their algorithm works in different situations. That is, at present, there is not one single piece of evidence to confirm that this algorithm will provide an accurate indication of DI or NDI in a person outside the original sample used to make the model. This is another instance of the contractors ignoring a long history existing knowledge, in this case, research design methods. Had they used any of a series of standard practices, they would have built their model on a random selection of one-half of the data, leaving the other half for testing the efficacy of the finished model. Perhaps the most appalling aspect of their presentation was the fact that they felt absolutely no need to test their algorithm before putting their product on the market. In addition, they sell this as a tested product and I saw no indication from any of the polygraphy community that was present that they understood that this was an entirely untested product.

Along with a complete failure to test/validate the algorithm, the contractors also failed entirely to characterize their sample and thereby delineate properly for whom and what crimes the system might work. Upon questioning, they had little or no idea about what mix of subjects (e.g., sex, race, age) or crimes was used to develop the algorithm. I feel that, without proper studies, it is impossible to know if the algorithm (even assuming it works and is valid) can be applied at all, or in the same way, across the entire range of possible subjects and crimes. I believe that the contractors, in ignoring this point, had made a serious error that can only be exacerbated by the atheoretical manner in which they derive their algorithm. That is, when they choose to base their model strictly on the basis of these particular cases, they tied their algorithm very tightly to the particular characteristics of their sample. I would, therefore, not expect the algorithm to generalize or work well on persons or crimes not well represented in the original sample.

The idea that their algorithm does not easily generalize to other situations is supported by their statement that they would like to continue building their sample, doubling it if possible. This suggests that the weights, and perhaps the parameters, are still drifting. This reveals a number of things about their algorithm. First, it suggests that there is a considerable amount of variability that remains, raising doubts about using it to make important decisions about individual’ s futures. Second, it suggested they have over modeled their sample, reducing its generalizability to cases outside the sample. Third, it suggested that their model does not capture well the essential differences between DI and NDI responses. Had this been the case, they should have been able to derive their model from a smaller, rather than a larger, sample. This is particularly important because the model is to be applied to individuals, rather than groups, in order to make a vitally important decision.

Failure to appreciate the fact that the polygraph itself affects the character as well as the quality of the data is yet another example of the contractors failure to obtain even a rudimentary knowledge of the subject they were working on. Even a cursory glance it any basic text on psychophysiological methods would have revealed the importance of filters and their interaction with all physiological signals. Consequently, one of their “features,” the one based on the Autonomic GSR signal, is almost certainly measure of the time constant of the amplifier, rather than on a signal from the subject. Hence, the algorithm will not work properly if a different manufacturer uses a different time constant, or if the operator makes a change to the amplifier settings. Note that the algorithm could be adjusted to compensate for such changes in the time constant. However, without adequate knowledge that this part of the algorithm reflects amplifier characteristics and not subject responses, the contractors had no idea that such changes in the algorithm would be required. Thus, the accuracy of the algorithm is affected in ways the contractors did not understand, or warn about, because of their ignorance of basic information on the topic they were working in.

In a future, I feel certain that the contractors will return to government agencies to obtain additional money for developing their algorithm. Indeed, they can argue that they did as they said and fulfilled the contract, whether or not it was done as well, expeditiously or cheaply as they or someone else might have done had they first developed an expertise in psychophysiology. Therefore, it is hard for me to imagine that these investigators will not receive another government contract to either validate their method, or extended it to other formats, or both.

Thus, there are lessons to be learned from the way in which the previous contract was written and administered. First, any future contracts (with these or other investigators) must include a provision that any algorithm for separating DI and NDI persons be TESTED by scientifically accepted methods as the condition for fulfilling the contract. Second, the contractors should submit their research/experimental plan for review by a panel of scientists (as is done in all cases for non-military grants and contracts) employed by the funding agency. The funding agency may also want to consider using the panel to periodically assess the progress on the contracts so that any problems can be corrected as they occur, rather than after the contract has been completed. Implementing these procedures will dramatically reduce contract problems.

In sum, the contractors have developed, delivered and sold in algorithm to separate DI from NDI subjects that has no demonstrated validity. I would recommend that all use of this algorithm cease until its accuracy and validity can be demonstrated. Further, I would not give these contractors another contract without first assuring that there will be a considerably greater deal of oversight from scientists knowledgeable in the field of psychophysiology.

 

Dr. John C. Kircher

1. Overall Impressions. I was appalled by the procedures used at the APL to develop computer programs for analyzing polygraph charts. It is difficult for me to understand why the government would (i) fund efforts to develop algorithms to analyze signals from hardware that is 40 years out of date, (ii) fund investigators that no formal training in the relevant scientific disciplines (physiology, psychology, and psychophysiology) and clearly made no attempt to study these areas on their own, and (iii) fund investigators who, even in their own areas of expertise (statistics, signal processing) show minimal grasp of basic principles of classical and modern measurement theory, effects of multiple nonlinear transformations on their signals, and issues of research design, generalizability, and shrinkage.

2. Alternative Analytic Techniques. The JHU/APL team did not consider, nor do they appear to be aware of alternative analytic methods such as those developed by David C. Raskin and me, which are described in detail in several scientific articles and book chapters.

3. Other Measures from the Existing Data Base. I am not sure, but I do not believe that any traditional measures of electrodermal and cardiovascular activity were among me 9,911 feature these investigators report having examined (I believe with various types of signal transformations, the actual number of features examined by these investigators is many times greater than 10, 000). Traditional measures would include features such as peak amplitude, rise time, half-recovery time, and onset latency (e.g., Coles, Donchin, & Porges, 1988; Greenfield & Sternbach, 1972; Martin &Venables, 1980; Fowles, Christie, Edlberg, Grings, Lykken, & Venables, 1981). It might be worth taking a look at a few traditional measures and correlating them with the features generated by the investigators at APL. It might give us some idea of what they are measuring.

4. Hardware. The major problem with Axciton hardware is the measurement of skin resistance from stainless steel plates. Even when properly recorded, SR shows baseline drift that is often greater than the magnitude of the physiological responses. As a consequence, on traditional polygraphs the examiner must frequently re-center the pen. Axciton’s solution to this problem is to filter the signal to reduce baseline drift. However, this filter distorts the signal and moves us one step further from the subject’s cognitive and emotional response to the stimulus. To measure SRR’s, the Axciton uses stainless steel plates. These plates are subject to polarization, bias, and movement artifacts.

I do not know if the Lafayette system uses stainless-steel electrodes, but it does record skin conductance. I doubt that the Lafayette hardware filters the SC signal in the same manner as the Axciton, if that all. If the Lafayette uses Ag-AgCl electrodes with electrode paste and records with a constant .5V circuit as we do, there would be less polarization, bias, and minimal effects on sweat gland activity. I cannot imagine how the Polyscore algorithm, which was developed with data from the Axciton, could possibly correct for the differences between the filtered SR signals from the Axciton and the SC singles from Lafayette.

In the copies of transparencies provided by the APL and at one point during the meeting, the investigators noted that SR signals that exceeded the 12-bit resolution of the Axciton resulted in flat lines at the maximum (clipping of the waveform). The investigators then noted that this problem had been corrected. At what point was this problem corrected? Can it be assumed that many of the cases in their training set were collected while this was a problem? Were these strong SRR’s treated as missing values (which would bias results), or were they measured as they appeared in the data files? The occurrence of artificial ceilings in SRR’s would seriously affect the diagnosticity of various ‘percentile’ measures of ‘GSR,’ their selection for logistic regression model, and the weights given them by the model.

5. Algorithm Development: Signal in Feature Transformations. The investigators perform numerous nonlinear transformations of signals prior to extracting features from them. The Axciton hardware filters low frequencies from the SR channel, and the Polyscore then detrends the filtered SR data. The signal from which the ‘GSR’ features are extracted is a nonlinear transformation of another nonlinear transformation of skin resistance. Even if the investigators measured something as simple as the amplitude of this signal, one wonders how that measurement relates to actual changes in skin resistance. The transformations do not stop there. They measure response magnitude in terms of pseudo-percentiles and in some cases differences between pseudo-percentiles. While these measures are functionally related to the subject’s responses to test questions, it is difficult to see how, and one wonders about the effects of variations in gain settings and even the number of questions on the obtained measures of electrodermal responses.

The cardio channel is detrended and then partitioned into high and low frequency components. I am concerned about the effect of detrending on measures extracted from these signals as well. The justification for detrending cardio signals is to correct for leaks in the pneumatic system (The same justification is given for respiration channels since these are recorded from pneumatic bellows.) It seems to me that the solution to leaks in the recording system is to find out where there are leaks and fix them. I believe it is a mistake to transform all signals to ‘correct’ for the possibility that there might be a leak in recording system. Again, with each transformation, the measurements we use to discriminate between truthful and deceptive subjects are one step further removed from the psychological processes that give rise to the changes in physiology we wish to assess.

Respiration signals are detrended and ‘baselined’ (pp 13-15 of the Algorithm Overview handout). I do not see how the respiration signal would change if only the baselining were done. Ultimately, the points of maximum expiration are aligned. I also fail to see any possible advantage in baselining respirations. I think it must distort the signal in some nonlinear manner to fit the constraint that all baseline values fall on a straight line. And why should it matter to the algorithm if the baseline points are aligned? There might be some advantage in applying this transformation if the baselined respirations were to be presented to polygraph examiners for visual inspection. However, I asked about this, and they do not show the baselined respirations to the polygraph examiner. I believe the transformations on respiration channels have only adverse effects on measurements from these channels because they distort the signals and do so unnecessarily since no visual analysis is ever performed on the transformed data.

6. Algorithm Development: Feature Extraction. APL approached their task as though no one had ever before quantified respiration, electrodermal, or cardiovascular activity and used them to draw inferences about psychological states or processes. The measures developed by APL bear little if any resemblance to the types of measurements made by psychophysiologists for many decades. They did not study this literature. For example, they did not know that the reductions in cardio pulse amplitude in their model could reflect an increase in blood pressure or a decrease in blood pressure depending on the pressure of the occluding cuff. Are we to believe that these investigators, with no training in psychophysiology, have identified new ways of characterizing psychophysiological events that are generally superior to be accumulated wisdom of research scientists who have been investigating these phenomena for over 100 years? Standards have been develop by the scientific community for measuring physiological activity, and these standards were completely ignored. Massive computation is no substitute for careful study of the scientific literature and the application of well established psychophysiological and psychometric principles.

One potential advantage in using a computer to quantify physiological reactions is that it can help to train polygraph examiners to be better chart interpreters. With all the nonlinear transformations of signals and features made by Polyscore and the nonstandard types of features used by this algorithm, there is no way the measurements made by Polyscore could possibly be used to train polygraph examiners to be better numerical scorers.

The investigators at APL have argued that an advantage in using the computer is that evaluates aspects of the charts that the human interpreter could not possibly ‘see.’ The only possible way additional complexity could be considered advantageous is if it actually improves diagnostic accuracy, and there is no evidence of this. Otherwise, it is nothing but smoke and mirrors. If the algorithm is a complete mystery to the best scientific minds in the field, it could not possibly be viewed as anything but a black box to the field polygraph examiners who use it.

7. Algorithm Development: Feature Selection. Stepwise logistic regression was used with about 33 blocks of 300 different variables to select the features for the model. Stepwise regression procedures are notoriously unreliable, they capitalize on chance, and yield highly inflated estimates of discrimination (Hocking, 1983). The problem is compounded by the enormous number of variables provided to the stepwise variable selection algorithm. Under these conditions, random numbers would yield perfect or near perfect discrimination between groups, which is precisely what these investigators report. Their approach borders on the absurd. It is the most extreme example of overfitting a training set that I have ever seen or could possibly imagine. I do not understand how a professional statistician could even consider developing a prediction equation with more than 10 variables per subject. Texts on multivariate statistics recommend 10-20 subjects per variable and certainly no fewer than five subjects per variable. Not only did these investigators fail to follow these recommendations, they missed the boat by two orders of magnitude. On purely statistical grounds, any claim that the results obtained in their training set will generalize to new cases is completely without merit.

From a statistical standpoint, there’s no reason to believe that the model developed at APL has any power whatsoever to discriminate between truthful and deceptive subjects. Other indications that their model has limited generalizability include intimations by the investigators (no data) that the model for ZOC test would (or does) not work well for MGQTs or in laboratory mock crime experiments. There’s no theoretical reason or empirical research to suggest that the patterns of physiological changes associated with truthfulness and deception should differ in ZOC and MGQ tests.

I think the use of logistic regression for computerized polygraphy is appropriate. It is an non-parametric method that is a bit more conservative than discriminate analysis. The relative power of these two methods for modeling differences between truthful and deceptive subjects depends on the extent to which the assumptions that underlie the two techniques are reasonable. Under conditions of multivariate normality, discriminate analysis is likely to be more powerful. However, I take issue with the claim by these authors that the probabilities of truthfulness/deception from discriminate analysis are ‘unreasonable’ (see Algorithm Overview Page entitled ‘Comparing Logistic Linear Repression’). For our presentation at the 1994 meeting of SPR in Atlanta, we created a discriminate function and a logistic regression model composed of the same set of measures from the same set of laboratory and confirmed field cases (89 deceptive, 74 truthful subjects). The correlation between the probabilities generated by the two models exceeded .999. If discriminate analysis produces ‘unreasonable’ probabilities, then the same must be said about logistic regression.

8. Algorithm the Development: Case Selection for Model Development. The criteria for selecting cases for the training set differed for individuals classified as deceptive or truthful. Criterion deceptive subjects had confessed, and there was independent corroborating evidence that the confession was correct. We do not know the percentage of test questions within a test to which a subject confessed. Confessions about all issues covered even in ZOC tests are rare. What became of the data for questions within a test for which there was no confessions? Apparently Mike Capps and someone else decided it there was corroborating evidence. Was there independent corroborating evidence for every deceptive case (or relevant question) in the training set? What was the reliability of these judgments, and why was the corroborating evidence criterion not stated in the handout (page 6) or in any other descriptions of algorithm development supplied by the APL?

Apparently, a guilty plea by a suspect was also used as a basis for assignment to the deceptive group and perhaps even for ‘clearing’ an innocent suspect. There are serious problems with this criterion, since people enter guilty pleas for many reasons, only one of which is that they committed a crime and lied about it on their polygraph test. Important details about the reliability of the selection criteria for deceptive subjects were not presented, and I have grave concerns about the use of guilty pleas as a criterion for case selection. From the available information, it is not possible to estimate the proportion of subjects, or the proportion of answers to individual questions within tests, that were consider deceptive but were actually truthful. In all likelihood, the algorithm was trained on a ‘contaminated’ sample of deceptive cases.

The manner which criterion truthful subjects were selected for the training set was even more problematic. The vast majority of this sample was composed of cases in which the original examiner and two independent evaluators agreed that the subject was truthful. I suspect that this sample contains few false negative errors, too few in fact. There is a lot of research which shows that a large percentage of truthful subjects produce ambiguous charts, especially in field test. The outcome of a polygraph test is ambiguous when the subject responds similarly to control and relevant questions. These cases were systematically excluded from the sample of truthful cases since only clearly truthful charts are likely to result in complete agreement among polygraph examiners. It is not difficult to see why the algorithm was able to discriminate this homogeneous sample of clearly truthful charts from cases in the deceptive group. Inclusion of only clearly truthful charts biases estimates of the accuracy of outcomes from Polyscore, not only for truthful subjects but also for deceptive subjects. More importantly, it is likely to have undesirable effects on the selection of variables for the model. Variables that could be of great value in increasing the statistical distance between ambiguous truthful charts and deceptive ones would not be selected for the model because the stepwise algorithm does not know that these cases even exist.

The investigators to great pains to state that their goal was not to determine the accuracy of polygraph examinations. However, it is not possible to separate the issue of the ‘accuracy of polygraph examinations’ from the task of developing an algorithm that has utility for field polygraphy. By systematically excluding inconclusive cases from the training set, they misrepresented the qualitative and quantitative nature of psychophysiological differences between the target populations truthful and deceptive subject. These populations are considerably more heterogeneous than the samples of cases included in the training set. Consequently, their model was not optimized to distinguish between truthful and deceptive field suspects, the observed differences between the groups are biased, and they fool themselves and their customers about the value of the algorithm for field applications.

9. Recommendations in Future Directions.
a. Stop funding the research at the APL. The research at APL has been an incredible waste of time and money. They completely ignored scientific standards for psychophysiological measurements and statistical analysis; they adulterated methods David Raskin and I pioneered almost 20 years ago; and they promote the algorithm for field used it has never been adequately tested. One would think that a professional statistician with plans to generate thousands of new and untested measures with think to hold out a few cases in order to cross-validate the model. The only test of the validity of the algorithm was completed at DODPI shortly before meeting in March. There were only four confirmed truthful subjects in this cross-validation study, and Polyscore misclassified one of them (75% correct). Statistically, this is no better than chance accuracy on truthful subjects. Whatever limited validity data exist speak more to the power of control question test than their quality of the research. Accuracy on deceptive subjects was much better (92% I think). However, with little information about the accuracy on truthful subjects, this figure is difficult to interpret. If the algorithm is biased against truthful suspects and yields a high rate of false positives, one would expect highly accurate decisions on deceptive subjects.

A far better use of funds would be to: (i) announce your interest in the development of computerized polygraph systems and solicit a range of proposals from the research community; or (ii) organize a meeting of a select group of psychophysiologists with some knowledge and interest in polygraphy. (It wouldn’t hurt to invite a few outspoken members from the user groups whose task would be to keep the feet of ivory tower types firmly planted on the ground.) The purpose of the meeting would be to develop the research plan for further development of computer algorithms for the detection of deception. The research program would be characterized a coherent agenda, open discussion, consensus, and collaborative efforts by knowledgeable investigators. I am certain that the result would look nothing like the APL/Axciton system you have now.

b. Model the decision making of human experts who make accurate decisions yet tend to extract diagnostic information from the polygraph charts that is not included in the computer model.

c. Investigative new sensors. Autonomic measures that show considerable promise includes skin potential, components of variance in heart period (vagal tone), and blood pressure. Pilot data from our lab and John Podlesny’s research with the Finapres blood pressure monitor suggest that measures derived from these channels are diagnostic. Moreover, they are more likely than additional characterizations of conventional polygraph channels to make independent contributions to existing combinations of weighted physiological measures.

 

John A. Stern

1. This is a statistical Tour de Force - or is it a Tour de Farce?? I feel strongly that it is the latter. Their attempt at “model building” deals purely with the development of a statistical model, one that is purely descriptive in nature.

2. The effort to develop an algorithm that discriminates between deceptive and non-deceptive is laudatory, however the effort is completely atheoretical nor did it have the benefit of skilled polygraphers in suggesting variables that might (or might not) be entered into the discriminative equation.

3. The need for 500+ examinations to develop a “predictive equation” suggests a lot of error variance in the data base. The suggestion to develop an equation based on a smaller subset of the data and applying to the remaining data set was considered “inappropriate” by them. They must have had second thoughts about this! We did receive a copy of a “cross-validation” report in which the data set (624 cases) was divided into four subsets. Using their “modeling” procedures models were built for each database and they now claim only a 3% accuracy degradation when the model developed on one data base is applied to the other three sets. They go on to say “It also shows that although some models had no terms (features) in common, there is a strong correlation for some features and allowing LOGIT to choose among them on the basis of the data base results in equivalently performing models.” How is that for gobbledygook and scoring a touchdown for the other side. One must now asked, how many samples are necessary for the development of a “model” and which is the most appropriate model. If all four worked equally well one must have doubts about their procedures!!!

At the meeting they were not satisfied with the number of cases in their data bank and felt they needed more. One wonders how many more cases will be required before they are satisfied that the equation is reliable and valid and field applicable - although it is being marketed and applied in field investigations. We wonder about the ethics of marketing an unvalidated products. We believe it to be highly unethical - regardless of the outcome of a “validation” study.

4. The models developed are highly restrictive. New models have to be developed for Zone scoring and the MGQT procedures. It would seem to me that a more general model would be desirable - if it is possible.

5. To my surprise only polygraph data and no personal data - such as gender, age, race, etc. enter into the predictive equation.

6. Though not or only poorly articulated it appears that the “features” or “factors” cannot be described in terms of what component of the response in question is evaluated. With all their statistical manipulations it apparently becomes difficult to decipher the algorithm. The features used in the algorithm are purely empirically derived and the authors could not or would not specify what each feature measured nor identify weights associated with each feature. We were shown one of the Polyscore reports and I noticed that in the particular report 80% of the weighting was attributed to electrodermal variables. Ideally one should specify the decision rules identified in the algorithm and then evaluate these rules empirically.

7. Their response to questions was, at best, evasive and sometimes offensive. It appears that there is little in the way of documentation concerning their “research” efforts. They alluded to analyses done earlier but there’s no paper trail allowing for an evaluation of what they have actually done over the years.

8. The product “Polyscore” that has been developed is not generalizable, atheoretical, and aphysiological.

9. The only documentation available to us was a second draft of a paper they hope to publish. It contains little that is not found in their User’s Guide to the Polyscore system. It is marred with inaccuracies and would not pass muster the reasonable editors.

10. They are woefully ignorant about: a. the polygraph used to collect the data used by then. For example, they had no idea about the filter characteristics of the amplifiers used. b. about facts associated with recording techniques. For example, the nature and meaning of measures derived from the blood pressure recording procedure. c. the simple fact that speech affects the recording of respiratory activity and that such activity needs to be taken into consideration when scoring the data. d. that there are significant differences in the resting levels of skin conductance as a function of skin color and probably age.

11. With all the hype about the reliability and accuracy of data analysis the question about its relative accuracy is still in unknown.

12. One recommendation agreed to by the panel was the need for a broad base look at the polygraph, the tool used to acquire the information. The current effort focus solely on data analytic procedures.

What is needed is a closer look at: sensors, amplifiers, A/D converters, data loggers, and data analysis techniques. The polygraph, as currently configured, has not changed much from the days of the old Keeler Polygraph. We think that current technology should allow for the utilization of better amplifiers as well as broadening the array of physiological measures that might be used.

For more information, please click here.

This Web Site and Content Within are Copyright ©1999 Tracer Technology
All other logos and information are copyright by their respective owners