Screen Shot 2013-06-06 at 3.04.33 PMYesterday marked the start of the Zimmerman case Frye hearing.  The purpose of this hearing is to determine whether expert testimony will be allowed at trial on the identity of the person screaming in the background of the Witness #11 911 call.

The relevant portion of that 911 call can be heard here.

Yesterday I posted what we knew about the various speaker recognition experts and their findings prior to the start of the Frye hearing.  Moving forward, I’ll be posting on the Frye testimony of each of the experts, starting with Dr. Nakasone.

[Disclaimer:  this post is based on typed notes taken in real time.  I used my best efforts to capture the information as accurately as possible, and to stay true to the intended content in summarizing the testimony, but small errors might be found if compared to some formal transcript.  I apologize for any such beforehand.]

Dr. Hirotaka Nakasone, PhD, Senior Scientist, FBI, Voice Recognition Program

As is usually the case with expert testimony, Dr. Nakasone was first asked to describe his background, work, and expertise in speaker recognition.  He testified that he works within the FBI’s Operational Technology Division in Quantico, VA.  The OTD houses the FBI’s world-class voice recognition capability.  He has a senior level position, reporting directly to the Head of that division, and oversees all work related to speaker recognition, forensic audio, and the development of voice biometrics technology.  He has been with the Bureau for 17 years, and in his current position since 2009.

In terms of academic and research background, he received his PhD in speech science in 1984, and publishes frequently in peer-reviewed scientific journals.  He is a member of numerous professional scientific associations, and is the head of a working group formed in 2012 with the mandate to create the first formal guidelines for the professional application of speech technologies and scientific research (such standards do not currently exist).

Highlights

  • Speaker identification methodologies are useful and advancing but fragile technology.
  • Formal national and international standards for speaker identification methodologies and application have not yet been developed, by a recently established international agency led by Dr. Hirotaka Nakasone.
  • The FBI maintains world-class speaker identification and speech recognition capabilities, also led by the witness, also Dr. Hirotaka Nakasone.
  • The FBI uses a SOP of quality assessment followed by a two-stage analysis.
  • The first stage of analysis involves the use of a computer-algorithm approach, using many thousands of samples in a database to evaluate the probability of a match (or no match).
  • The second stage of analysis is conducted independently and involves a certified and trained speech identification expert applying aural-perceptual techniques.
  • Only if BOTH methodologies determine that there exists a match does the final report conclude that a match exists.
  • Even if BOTH methodologies determine the highest confidence match, neither the report nor the experts involved are permitted to testify to that finding in a trial setting.  The findings are considered sufficiently robust only for investigative, not prosecutorial, purposes.  Trial use of these speech methods and findings remains years away.
  • No analysis will even be begun, however, unless the audio sample first meets certain minimal levels of quality—otherwise the result is Garbage-In, Garbage-Out, and not suitable even for investigative purposes.
  • Many factors can degrade the practical quality of a recording—that is, the usefulness of a recording for analysis.
  • These include intraspeaker variation, technology-induced distortions, environmental confounders, and the duration and phonetic balance of speech in the recording.
  • Intraspeaker variation refers to natural variations in any single individuals speech—no person says the same word exactly the same way twice.  Despite this, good success in speaker identification can be achieved when working with “normal speech”.
  • Even small variations from “normal speech” easily negatively impact analysis.
  • Speaking with emphasis or while tired can confound analysis.  High levels of stress or intoxication are even more problematic.
  • Screaming in moments of intense strength can not realistically be analyzed for speaker identification purposes, because the physiological mechanisms by which such a scream is produced is profoundly different than that used for normal speech.
  • Technology-induced distortions include the compression commonly used in cell phones and consumer VoIP telephony to ease data loads.  The inexpensive components commonly built into these systems also result in vastly degraded recording if the speaker is more than a few feet away from the microphone.  The effect of these distortions is to “muddy” the recording and strip from the recording the minute sound elements required for accurate speaker identification.
  • Environmental confounders include background noise and the physical environment itself—different settings have very different acoustic characteristics.
  • Environmental confounders typically increase greatly with increasing distance between speaker and microphone, accentuating the degradation of quality of the recorded speech as distance increases.
  • The recorded speech also needs to contain some minimal amount and variation of content (data) in order to be subject to reliable analysis.  Too short a recording and too little variation in a recording—e.g., a one-second “uh”—confounds analysis.
  • The minimal duration recording the FBI will consider for evaluation is 16 seconds.
  • In addition, they require at least 20 well-spoken words of high quality with a balance of phonemes before they will initiate analysis.
  • The portion of the Witness #11 911 call in which the scream can be isolated from confounding background noises was found by the FBI to be less than 3 seconds—well below the minimum 16 seconds required by the Bureau.
  • The screams recorded are by definition not “normal speech”, but rather the desperate screams of under the existential stress of facing imminent death, confounding their analysis.
  • The screams were recorded over a cell phone and then digitally recorded as a 911 call, subjecting them to compression degradation.
  • The person/persons screaming was many yards beyond the optimal range of the cell phone’s microphone, inducing further technological degradation.
  • The FBI was unable to submit the recording to actual analysis because the output would have been meaningless.
  • The FBI’s findings, then, were necessarily inconclusive.

THE LONG VERSION

Speaker Identification:  An Advancing but Fragile Technology

When asked to comment generally on the state of the art of speaker identification Dr. Nakasone noted it is the consensus among speech scientists that the technology has advanced considerably in recent years—but with the important caveat that the technology remained “fragile—it can break without warning.”  This fragility is also part of the consensus of the scientific community.

Speech Recognition versus Speaker Identification

On a matter of technical clarification Dr. Nakasone noted that there are important distinctions between speech recognition and speaker identification.  In speech recognition the goal is to understand what is being said.  We’ve all experienced this in the form of automated phone systems we are all familiar with, in which one can speak information into a phone—“Representative!”—rather than press a button—“0”.   Speech recognition is not at all interested in distinguishing between different speakers—indeed, to work properly it is important that it not distinguish between different speakers.  It must be able to properly discern a spoken word regardless of whether it was spoken by a male or female, old person or young, alert or tired, and so on.  As a result, speech recognition systems generally sacrifice unique details of an audio transmission in pursuit of improved performance.

In contrast, in speaker identification the emphasis is on the unique characteristics of one audio recording that are consistent with, or that differ from, another audio recording.  Here the emphasis is on highlighting these unique characteristics, even at the cost of losing the intended content of the communication.  If the goal is to match (or differentiate) two recordings it is not really important if that goal is achieved using the spoken word “yes,” versus “no”, “up” versus “down”, and so on, event though those variations would profoundly change the intended message of the communication.  As a result, methodologies attuned to speaker identification typically sacrifice the contextual elements of the recording.  A good example of this is a spectragraph, which is a visual representation of an audio file.  Such a visual representation allows one to literally “look” for similarities and differences between audio files, but does not communicate the actual content of the spoken communication.

Three Methods of Speaker Identification:  Spectagraph, Aural-Perceptual, Computer Algorithm

Dr. Nakasone also spoke to the three primary methods of conducting speaker identification:  spectagraph, aural-perceptual, and computer algorithm.

Spectragraph

Spectral voice comparison is the oldest technology-based method used for speaker identification, having been first introduced in the early 1960s.  (This is the approach, or a variation of it, is purported to have been used in this case by speech expert Dr. Alan R. Reich.)  Although once promoted as offering a “voiceprint” unique to every person, much as a fingerprint, it has in fact fallen far short of this promise.  Today, Dr. Nakasone testified, the spectagraph “has run out of time to prove its utility.”  The interpretation of what is being displayed by a spectragram has been shown to be entirely dependent on the subjective characterization of the observer.  Even trained speech scientists attempting voice comparison can look at the same spectagram and come to completely different conclusions, making it impossible to scientifically validate the technology.   Indeed, scientific efforts to do so were abandoned as long ago as 1972.

Aural-Perceptual

The aural-perceptual method of speech analysis is the oldest of the three methodologies, as it essentially involves simply disciplined listening.  (Dr. Nakasone humorously noted that a basic form of this methodology dates back at least to Biblical times.)   Dr. Nakasone does not spend much time discussing this approach, other than to note that the FBI does use it as a key component of its speech identification methodology—albeit only in combination with computer algorithm methods.

He also noted the extreme subjectivity of aural-perceptual analysis.  There is a tendency in human nature, referred to as linguistic fitting, where if a human being is told there is speech buried within a background noise to discern what words are being spoken.

Often the result is merely a guess, but once a guess has been made there is a tendency to believe with increasing confidence that the guess was correct—it seems to the observer that the word becomes clearer with repeated listening.  Unfortunately for scientific analysis, different researchers working independently often emerge with equally high levels of confidence that they have heard different words.

In addition, if they are not working independently there is a high tendency for the conclusions of one researcher to bias the findings of another.  This notion of suggestibility is well recognized in the speaker recognition community.  In the FBI, speaker recognition experts develop their findings independently, to avoid such bias.

Computer Algorithm

The computer algorithm methodology of speaker identification relies on a vast database of voice recordings and the use of computer analysis to find commonalities and differences between submitted audio recordings.   When an audio recording of adequate quality is submitted to computer analysis the outcome is a numerical score that is translated to a qualitative assessment of confidence of match (or mismatch).  Formal cross-organization standards for this type of evaluation are only in the earliest stages of being developed under Dr. Nakasone’s leadership in the working group.

The FBI SOP for Speaker Identification:  Quality Assessment & Analysis

The FBI has formally-defined Standard Operating Procedures (SOPs) for any speaker recognition analysis it undertakes, to ensure that it is following the same steps in each case.  The first step is a qualitative assessment of the audio file to be analyzed.  I’ll actually discuss this second in this post .  Instead, we’ll first discuss the second half of the process, which is the two-stage analysis of the audio recording.

Analysis:  Computer and Human

When an audio recording of adequate quality is submitted to the FBI, the first analytical step uses the computer-algorithm methodology.  This results in a logarithmic score that is the function of the ratio of two numbers.  The numerator is the probability that the recording is a match with the target speaker, and the denominator is the probability that the recording was generated by a different speaker in the database.  This quantitative score is then converted to a qualitative conclusion of either match, probable match, inconclusive, probable not match, and not match.  (Note that neither “match” nor “not match” are absolute findings, merely higher levels of confidence.  There is no absolute match possible.)

Once the computer algorithm methodology is complete, the audio recording is given to an FBI technician trained and certified in aural-perceptual techniques.  This analysis is done independent of the computer findings, to avoid bias.  (In this case, Dr. Nakasone himself conducted this analysis.)

 The final findings for a match (or mismatch) is a fusion of the computer-generated score with that of the human analyst.  Only if both the computer and the technician have concluded that there is a match will the FBI assign “match” as a final determination for the recording.  If the technician decides that there is a match, but the computer does not, the final finding will be “inconclusive”.

Even a Maximum Level Match Does Not Qualify Findings as Trial-worthy Evidence

Importantly, even if both the FBI’s computer-algorithm methodology and it’s human methodology both conclude that the recording is a match for a particular individual, they are still not permitted to testify to that effect in court.   The FBI issues its “match” report merely as investigative guidance.  They do not go into court and testify as to the results of the analysis, and being able to do so is years in the future.  “We have a long way to go before the system is ready to be rendered as opinions in the court room.”

Quality Assessment

Dr. Nakasone’s assessment made very clear that the quality, or lack thereof, was perhaps the single most important factor in being able to confidently identify a particular speaker or identify particular speech.  The quality of audio recordings subject to analysis varies enormously based on many factors, including intraspeaker variation, technology-induced distortions, environmental confounders, and the duration and phonetic balance of the speech.

Intraspeaker Variation

Even a single individual repeating the same word into a studio-grade microphone in a sound controlled environment never records exactly the same audio twice.  There are simply too many physiological components involved in speech for them to ever be precisely duplicated twice.

When a person is speaking normally, these variations are generally slight.  Even modest changes in a person’s affect can tremendously change their voice patterns, however.  Merely speaking with emphasis, or while tired, can greatly increase the difficulty of matching the recording to that person’s normal speaking voice.  High levels of stress or being subject to meaningful levels of intoxication can make a match practically impossible.

This, of course, is common to all our experience—it’s why we can tell when someone “sounds tired,” or “sounds angry”, or “sounds drunk.”  At the extremes of human stress it becomes impossible to extract any useful identifying information from an audio recording—even differences in gender or great differences in age cannot be confidently discerned.

Technology-induced Distortions

It is, of course, possible to record very high level audio recordings using profession recording equipment in a well-designed sound studio.  Such a recording would carry a maximum of the data necessary for optimal speaker identification.  In the real world, however, a great many recordings that undergo speaker identification analysis are the result of recorded phone conversations.  Modern land-line phones induce only modern distortions in audio recordings.  Cell phone and internet (VoIP) telephony, however, both can introduce major distortions.

Compression

There are two technological causes for the major distortions introduced by cell phone and VoIP telephony.  First is the degree of compression used.  Compression refers to the mathematical process of stripping an audio communication of what is considered unimportant information, so as to reduce the amount of data that must be transmitted to successfully transmitted to the recipient.  In the case of telephony it is far more important that the recipient understand what is being said than that they can identify, from the audio, the identity of the speaker (typically, the speaker’s identity is communicated explicitly—“Hey Dan, it’s Joe in accounting.”)  These systems are therefore designed to prioritize the content of the speaker’s communication over the speaker’s identity.  As a result, much of the specific variations that distinguish one individual’s speech from another’s—or that mark two recordings as having come from the same individual—are stripped out in the process of being transmitted over a cell phone or VoIP.

Poor Quality Components of Limited Capability 

A second cause of technology-induced distortion is the relatively low cost of the components built into cell phones and most consumer VoIP equipment, and the limited range of circumstances for which their use is optimized.  A cell phone’s microphone, for example, might cost well under a dollar  (orders of magnitude less than even an inexpensive studio microphone), and are designed to work best for a single speaker whose mouth is within inches.  Even then, much identifying information has been stripped out, as already discussed.  By the time the speaker has moved as many as three yards from the phone the quality of the recording is badly degraded and very difficult to decipher even under the best of circumstances.

Environmental Confounders

The issue of distance between the speaker and the microphone also raises its ugly head in the context of environmental confounders.  The greater the distance, the more likely it is that sounds other than the speech of interest will be recorded along with, or over, that speech.  Background noise and vibration can be obvious problems, especially if they are located closer to the microphone and therefore are more dominant than the speech of interest.  The physical environment itself can also make speaker identification very difficult.  Our voices are merely sound waves that emanate from our mouths and bounce around our environment.  Different environments induce different distortions into that speech.  A well designed theater will cause little distortion of the actor’s voices, but other environments (especially the outdoors) are not as cooperative.  Perhaps no common experience better illustrates the distortions induced by both distance and environment than the conference call using a speaker phone—we’ve all heard the abrupt increase in quality when the other speaker “picks up the phone.”

Duration and Phonetic Balance

The duration of an audio recording is of critical important in speaker identification because it defines the quantity of raw data with which an analyst can work.  It is intuitive that there must be some minimal level of content essential for any analysis of reasonable confidence.  The FBI and other speech analysts experts such as the MIT Lincoln Laboratory in the scientific community have arrived at a consensus minimum of 16 seconds.  Any recording of shorter duration is deemed not suitable for speaker identification.

Of course, it’s not the nominal length of the recording that’s important, but the length of recording that contains only the sound of the speech of interest.  It is common that the speech of interest is only a fraction of the total recording, say in a recording of two persons having a conversation when only one of them is a target of interest.  Even then, that person of interest’s speech must be isolated from all other sounds. If there is concurrent sound “stepping on” the speech, it cannot be effectively analyzed.  The result of these restrictions is that the length of audio being analyzed, especially when captured from a non-studio setting, is almost invariably a great deal shorter than the length of audio submitted for analysis.

The length of the recording suitable for analysis is important not merely in terms of duration but also in terms of variation of content.  Obviously, a single tone continued for 16 second is of limited analytical utility.  Rather, analysts need a certain minimal level of variation in the recording.

One form of such variation that is required is what is termed phonetic balance.  There are about 44 different sounds, termed phonemes, in American English.  The word “hello”, for example, would involve four phonemes:  “H”, “E”, “L” and “O”.   Effective speaker identification requires a rich mixture, or balance, of different phonemes.  A recording of 30 seconds typically provides some combination of all 44 phonemes, but as the duration of the usable recording diminishes the number and variation of phonemes tends to diminish, as well.  In order to do even a spectragraphic analysis (abandoned by the FBI as its primary speaker identification tool in favor of computer algorithm methods) it is necessary to have at least 20 well-spoken words of high-quality (e.g., not subject to background noise or other distorting effects).

The Witness #11 911 Recording

At this point in the testimony the defense played the relevant portion of the recording of Witness #11’ 911 call for the court, and Dr. Nakasone testified that the audio played was substantially the same as that submitted to the FBI for speaker identification.

The relevant portion of the recording can be heard here.

Dr. Nakasone then discussed the process by which the FBI sought to identify the identity of the voice/voices shouting in the background of the recording.  He personally oversaw this analysis.

Upon receipt of the recording from the Tampa FBI office an enhanced version of the audio recording was produced.  Wanting to take no chances on missing important evidence, they conducted their analysis on both the original and enhanced versions.

The entire recording was a little over 45 seconds in duration, with the screams occurring for a period of about 18 seconds.  Dr. Nakasone went through these 18 seconds in their entirety and eliminated all sounds except where he could hear only the relatively isolated screaming voice—so excluding the portion of the recording when Witness #11 was speaking, or the 911 operator, and so on.

The duration of the resultant audio was less than 3 seconds.  If the recording is at 16 seconds there is some discretion for the examiner, but less than that and we have to terminate the examination at that point, determine that it is unusable and advise the sending agency that the FBI cannot do anything with that voice sample.

In addition, they quickly determined that the screaming was not normal speech, but was more akin to speech being uttered by someone facing an imminent threat of death.  Spectral analysis confirmed that the speech captured was outside the normal range of speech.  Even the FBI’s world-leading speaker identification system cannot effectively analyze speech that is abnormal or produced at a high emotional level.  Under stress, not merely the pitch of the speech changes, but all of the other articulating vocal musculature, the different muscle groups used to produce screams like that, are totally different than what occurs in normal speech.  The current state of speech science cannot yet tell all those different conditions apart.

The science does not yet exist to make those distinctions when the speech is well outside the bounds of normal speaking.  We could not submit the audio recording for actual analysis, because the output would have been meaningless.  The result was that our findings were necessarily inconclusive.

Cross-Examination by the Prosecution

The goal for the prosecution in this Frye hearing is to develop evidence supporting their argument that the methodologies used by their speaker identification and speech recognition experts is not novel and is generally accepted in the scientific community.

He confirmed with Dr. Nakasone that the FBI continues to use all three speech analysis methods—spectragraph, aural-perceptual, and computer-algorithm—to at least some degree.

He also emphasized that when the FBI received the original audio file they didn’t simply discard it, they at least attempted an analysis, and that it wasn’t immediately obvious that the effort would be fruitless.  Dr. Nakasone concurred.

He asked if it wasn’t true that Dr. Nakasone’s concerns with not with the methodologies themselves—“I agree”—and that different scientists using the same methodologies may sometimes come to different conclusions.  “Unfortunately, yes.”

The prosecution obtained Dr. Nakasone’s acknowledgement that the working group he is leading to establish the first guidelines for speaker identification, in fact that working group was only a minority of all speech experts.   He also pointed out that the FBI standards may or may not be copied by any working group.

He noted that there are a lot of variables that can affect how a recording is made that will impact whether it can be analyzed. “Correct.”

And that even when variables exist in some cases it is ossible to make a speech comparison.  “Yes, assessing the boundaries very carefully, there are only a few cases where we can do meaningful comparison.”

But you still try to do them, because sometimes you can.  “Correct.”

Re-Direct by the Defense

The defense asked if the variables complicating how the FBI conducts its analysis–intraspeaker variation, technology-induced distortions, environmental confounders, and the duration and phonetic balance of the speech—are unique to the FBI’s procedure or would it affect any methodology for speaker identification, whether it be spectragraph, aural-perceptual, or computer-algorithm.  Dr. Nakasone agreed this would be the case.

The defense also sought to confirm Dr. Nakasone’s earlier statement that the computer aided part of the analytical methodologies was still a couple of years away form being used by the FBI in a trial setting.  “At least, maybe longer.”

Defense:  “Are you aware of any methodology available for voice identification that would produce a reliable speaker identification result for the speech sample in this case?”

Dr. Nakasone:  “No, not any system of today.”

Defense:  “So if someone claimed they could apply the methods we’ve talked about in this case, the application of that method to such a small sample would truly be new and novel.”

Dr. Nakasone:  “I would prefer to use a different word—disturbing.  I don’t think the science is there and the technology can’t handle that.  If someone claims that . . . in my opinion it’s a breakthrough in the scientific community.  It would overshadow a lot of science in this field right now.  I’ve never seen such a system, and NIST [the standards setting federal agency] has not even addressed speech samples of the type represented in this case, the screaming part.  The lack of evidence of existence of such scientific endeavor tells me that it’s not really feasible and not probable, and I’m sure that claim not only disturbs myself but other scientists in the community.

Prosecution—Re-cross examination

So, when you are asked your pinion of the speech sample, it should not be used for identification.  “Yes.”

But there’s a big difference between suggesting that someone has invented a whole new methodology and merely having a different opinion about whether the existing methodology could produce a result.  “People can have their own views, but the technology . . . “

 

If people apply the same sort of methodologies as you used to determine this voice sample can not be used, your quarrel is with their opinion.  “Correct.”

 

Wrap-Up

 

This blog post is already outrageously long, and what I would have included in the wrap-up I’ve already written as “Highlights” at the start of this blog.

 

 


Andrew F. Branca is a MA lawyer with a long-standing interest in the law of self defense.  He authored the seminal book “The Law of Self Defense” (second edition shipping June 22–save 30% and pre-order TODAY!), and manages the Law of Self Defense web site and blog.  Many thanks to the Professor for the invitation to guest-blog on the Zimmerman trial here on Legal Insurrection!