Follow-Up on DNA and Cold Hits
This is a follow-up to this morning’s post on DNA, cold hits, and statistics.
Prof. David Kaye, whom I cited in this morning’s post, has responded to my e-mail and given me permission to quote him.
Thanks for your inquiry. This is a surprisingly subtle statistical question. I have devoted two chapters to it in a forthcoming book to be published by Harvard University Press and have circulated a manuscript on the California cases and the general issue to law reviews. I served on the 1996 NRC committee that recommended adjustment, but I now find it difficult to defend that recommendation. Basically, there are two distinct questions:
Question 1. What is the chance that a database composed entirely of innocent people (with respect to crime being investigated) will show a match? For databases that are small relative to the number of people who could have committed the crime, the NRC adjustment makes sense. The British experience mentioned in article shows that this chance is much larger than the random match probability. But why is this “innocent database” probability important when considering what the evidence of a match to a named individual proves?
Question 2. How much does fact that the defendant identified by a trawl through the database matches – and no one else in the database does — change the odds that he is the source of the DNA at the crime-scene? This is the question that is of interest to a jury trying to weigh the evidence. It is the one that Peter Donnelly and other statisticians have addressed. The answer is that the single match in the database raises the odds even more (but only slightly more) than does testing a single person at random and finding that he matches. As you point out, in the limit of a database that includes every person on earth, the evidence of a single match in the database becomes conclusive. How can the value of the evidence possibly decline as small databases get slightly bigger, then somehow switch direction and get immensely stronger as they get bigger still?
The discussion of the issue in the news and the courts is oversimplified and misleading (but entertaining). The manuscript of the law review article is attached. Feel free to quote from it as “submitted for publication.”
With best wishes,
DHK
I quoted from Prof. Kaye’s article in comments to the previous post. Let me quote one of those passages here, because I think it sheds light on the issue:
We can approach this question in two steps. First, we consider what the import of the DNA evidence would be if it consisted only of the one match between the defendant’s DNA and the crime-scene sample (because he was the only person tested). Then, we compare the impact of the match when the data from the trawl are added to give the full picture. . . . In the database trawl case . . . [i]f anything, the omitted evidence makes it more probable that the defendant is the source. On reflection, this result is entirely natural. When there is a trawl, the DNA evidence is more complete. It includes not only the fact that the defendant matches, but also the fact that other people were tested and did not match. The more people who are excluded, the more probable it is that any one of the remaining individuals — including the defendant — is the source. Compared to testing only the defendant, trawling therefore increases the probability that the defendant is the source. A database search is more probative than a single-suspect search.
Interesting.
I should note that Prof. Kaye’s exposition of the two relevant questions is similar to, but somewhat different from the questions that I posed in my original post. In an attempt to illustrate what I believed to be the questions addressed by the two competing camps, I posited two similar questions: 1. What are the chances that a search of this database will turn up a match with the DNA profile? and 2. What are the chances that any one person whose DNA matches a DNA profile is indeed the person who left the DNA from which the profile is taken?
Prof. Kaye’s questions state the issue in a more refined and, I believe, more accurate manner. As to my questions, he says in a follow-up e-mail:
The answer to your question #1 depends on the chance that the database contains the source (and, if “a match” means exactly one match, no one else with the matching type). That is not the question that the statisticians who favor an adjustment to the random-match probability are considering. The proposed statistical adjustment relates to the following modified version of your #1:
1′. What is the chance that a search of a database will turn up exactly one match when the source of the crime-scene DNA is someone who is unrelated to everyone in the database?
Likewise, the statisticians who argue that the database search is better evidence than the single-suspect search (and they are the majority of those writing on the topic) focus on a variation of your second question:
2.’ What is the chance that the named individual whose DNA matches is the source?
I confess that I did not read the coin example you provided too closely. I suspect that it is correct. I have an example along these lines in my article (inspired by an example in the Donnelly-Friedman article).
Thus, I think the thrust of your remarks are on target, but some of the details of your analysis could be refined.
I thank Prof. Kaye for his correspondence. And yes, the coin example was rather long.
Incidentally, I have an e-mail in to Prof. Peter Donnelly, the Oxford statistician whom I cited in my earlier post. He is out until May 13.
I also received a nice e-mail from Jason Felch, one of the authors of the L.A. Times article, in a response to an e-mail I sent him. I have asked him for permission to quote from the e-mail and am awaiting his reply.
An interesting point, I’ll look forward to seeing his article.
SPQR (26be8b) — 5/4/2008 @ 7:44 pmKaye considers it important that there was exactly one match in the database. But suppose the reason there was exactly one match is that the search was stopped with the first match?
James B. Shearer (fc887e) — 5/4/2008 @ 8:05 pmI find this to be intellectually stimulating and agree that the database, or size of it, is irrelevent.
Rather the derivation of the statistics behind the “1:1.1 million” and the makeup of the database seem much more relevent.
In this case, if the database was only from randomly selected, general population (which I think it isn’t) then the chance of cold hit is still “1:1.1 million” (which I understand is based on genetics of the human population).
However if the database were further specialized to contain only criminal records, then the chance of a hit dramatically increases since a criminal activity was the reason for the trawl.
This seemingly increases the significance of a hit (as Prof Kayes rightly points out) because it excludes a lot of high probability suspects and it now matches a known behavior with crime scene evidence.
However I doubt this would be enough for conviction…as “1:1.1 million”, while small, isn’t zero…
But the significance further increases (and database size decreases) when the search is further limited to known rapists living in SF at the time of the crime. A cold hit here would seem to have very high significance and the odds of error approaching zero.
On the other hand, if the database were made up solely of people born after the crime was committed, the chance of a cold hit isn’t zero (as one would assume) but rather it remains at 1:1.2 million.
There are people alive today who match 5 1/2 markers (of the crime scene DNA) but how many were alive, of proper age, in SF at the time of the crime and with a history of rape??
DB (3c0940) — 5/4/2008 @ 9:03 pmIlluminating. Professor Kaye does an excellent job of elucidation here.
Thank you for taking the trouble to follow this up, I find in helpful in refreshing my memory~which I am getting old enough to need to do. 🙁
#2 James B. Shearer:
I’m stumped as to why you think that situation would arise?
And I find the use of “exactly” in your proposition somewhat troublesome: in this context there is a difference between a single match (against the database), and an exact match of the intron DNA. Which is it that you mean?
EW1(SG) (84e813) — 5/4/2008 @ 9:10 pmThis is just another cheap ploy by the anti-death penalty, pro-defendant crowd to rig the game. It doesn’t surprise me in the least that the Slimes would cover it the way they did. The fact that there are no doubt many logic-challenged people who will fall for this ridiculously transparent sleight-of-hand is sad indeed.
CraigC (89ea49) — 5/4/2008 @ 9:18 pmI don’t understand this whole conversation. I thought Kato Kaelin said DNA stood for Dude Needs Apartment.
Comment by stef
daleyrocks (906622) — 5/4/2008 @ 9:57 pmI read up a bit on genetic fingerprinting at Wikipedia
There are a number of techniques for “fingerprinting” DNA. Most of these start by breaking DNA into short fragments using enzymes that target specific patterns. Restriction enzymes, for example, break DNA at specific palindromic sequences. The idea behind fingerprinting is that in parts of the DNA that are more variable — in “junk DNA”, for example, the fragments resulting from breaking DNA wherever specific sequences occur will be similarly variable. Both the length and the content of the strands will vary.
One technique, using short tandem repeats (STR), depends on sequences of DNA which are highly variable in human populations. Each given sequence turns up in anywhere from 5% to 20% of the population. If you have a fingerprint with 13 markers in a sample of DNA, all of which are present in 20% of the population, none are correlated with each other, the odds of a “lottery winner” match with a particular fingerprint is about on in 1.2 billion. The existence of correlations will almost certainly change the odds, and not all markers will be that frequent in the population, so we can get very long odds.
Notice, though, I refer to a “lottery winner” match. This parallels the case where one person buys a lottery ticket and wins the jackpot. There is no particular reason why any ticket holder should be the winner.
Let’s start thinking about your Saturday Evening Post as winning a jackpot with odds of one in 1.1 million.
In a case like this, if we pick out any one individual at random, the odds that he will win the jackpot are one in 1.1 million. If we pick two random people we have two chances that one of them will have won that jackpot, so the total odds are two in 1.1 million. If we were to pick 1.1 million people at random, we’d expect to pick one winner. In fact, since we’re assuming the lottery numbers (markers) assort at random, the math doesn’t quite work that way — if we pick 1.1 million people at random, we only have about a 37% chance that one of those people will hit the jackpot. This is because there’s a chance that two or more non-winners may have the same set of losing numbers (non-matching markers). (We’re not looking for any two people who match, we’re looking for people who match one specific set of markers. That makes a big difference in the odds.)
In particular, the LA Times is stating that
This is actually wrong. I can lead you through the math involved, but I don’t think that’s necessary. The chance that an individual listed in the database would be a match for the genetic fingerprint by sheer chance is 26.4%, or one in 3.78. The calculation offered by the LA Times claims one in 3.25. I think we’ll both agree that’s not enough of a difference to scotch the story.
Your objection to this pont is:
You’re neglecting the denominator in this problem.
Let’s take a database of a million people, and a fingerprint with odds of one in two million of a “hit”. As I grind through the math, I can show that there’s a 39% chance of at least one false positive. Any given group of a million people has that same chance of at least one false positive. So to calculate the number of theoretically possible hits, you multiply this 39% by the number of million people groups in the world — 6700. We’d expect 2613 hits in the world, give or take.
Now let’s grow the database, keeping the odds of a “hit” the same. Given a ten million person database, we’re looking at one chance in twenty million of a “hit”. The percentage chance of at least one “hit” in any group of ten million remains the same — 39%. The number of groups available in the world is smaller. 670 such groups, yielding 261 hits in the world.
At one billion, it’s still 39%. Multiply by 6.7, and we expect two or three hits. (2.6 average)
As the size of the database grows, keeping the probability of a “hit” the same, the number of candidates outside the database shrinks. In the limit, when your database includes everyone in the world, the number of candidates outside the database is zero. That’s why, if you have one genetic match in a database that includes everyone on the planet, you actually have no chance of having identified the wrong person.
I hope that’s reasonably clear.
Karl Lembke (652d57) — 5/4/2008 @ 11:08 pm4
“I’m stumped as to why you think that situation would arise?”
Because people stop looking when they think have found the culprit. For example you might first run the sample against an LA county database, then if there is no match against a California database and finally if there is no match again against a national database. If you run against LA county get one match and stop that sounds bad for the person matched. If you run against a national database and get 99 more matches it doesn’t sound so bad.
James B. Shearer (47baae) — 5/5/2008 @ 1:37 am“A database search is more probative than a single-suspect search.”
I think this is incorrect in practice. Suppose we have a group of k suspects. We believe the chance that one of them did it is p but have no favorites among them. Suppose the random match probability is 1/n. Suppose we run the DNA sample against the database and get exactly one match with X. Assume there is no chance of error in the tests. Then if I have calculated correctly the new probability that the group of k suspects contains the guilty party is p/(p+(1-p)(k/n)). Clearly this is also the probability that X is guilty.
Let us consider some examples. We will assume n is one million so that the random match probability is 1 in a million. Suppose a girl is raped and murdered and there is an obvious suspect (for example a single man living in the other half of a two family house). Suppose before running a DNA test we estimate a probability of guilt of .5 (or 50%), fairly likely but not certainly not beyond a reasonable doubt. We run the test and get a match. Here k=1 so by the formula above the new estimated probability of guilt is 1000000/1000001. Or to put it another way the chance of innocence has gone from .5 to .000001 which most people would deem guilt beyond a reasonable doubt. This conclusion is not too sensitive to your initial estimate p. Suppose for example you conservatively estimate a low initial probability of guilt of .01 (1%). Again by the above formula after a dna match the new probability of guilt this becomes .01/(.01+.99/1000000) = .9999+. So the probability of innocence has gone from .99 to about .0001 which most people would still consider beyond a reasonable doubt.
Next suppose we don’t have a single suspect but we do have a dna database with 100000 names in it. Suppose we estimate there is a 50% chance that the murderer is in the database but we have no reason to prefer anybody among the 100000. So the chance that any particular person in the database is guilty is .5/100000 = .000005. Now suppose we run our dna sample against the database and get a single match with X. By the above formula (with k=100000) the probability that X is guilty is .5/(.5+.05)=.9091=90.91%. This is a lot higher than .000005 but a 9+% chance of innocence is probably reasonable doubt. And here if you conservatively estimate that there is only a 10% chance that the database contains the murderer then X’s probability of guilt is only 52.63% hardly beyond a reasonable doubt.
It is true all other things being equal bigger data bases are more likely to contain the murderer. But all other things are generally not equal. You run against a single suspect when you already have good reasons to suspect him. You run against a large database when you don’t have any reason to suspect any of them in particular.
James B. Shearer (47baae) — 5/5/2008 @ 2:50 amHow exactly is the DNA database constructed?
Many years ago, they ran fingerprint analysis by having a library of index cards with fingerprints on them. If a print or partial print was lifted was someone’s job to look at each index card for a matching print.
Then, with computerization, a fingerprint was reduced to a number based on where the whorls and other distinct features of the print started and stopped. The target fingerprint was reduced to a similar number; then, the actual images were compared if the numbers matched.
If after testing, the DNA sample is reduced to some number, then:
1) what analysis has been done on the reducing algorithm to check for the uniqueness of the end result?
2) after the numbers match, is the actual DNA retested or the visuals of the DNA actually compared or does the match come strictly from matching numbers?
Adriane (09d132) — 5/5/2008 @ 2:57 am#8 James B. Shearer:
Ah, I understand now. I was confused because I was aware that the California data is maintained at the state level, ie, there isn’t an “LA county database” and then a separate “Ventura county database,” a search of the California database is comprehensive.
AFAIK, California is unique in the size and comprehensiveness of a DNA database maintained at the state level. The other large, comprehensive database is maintained by the FBI, and IIRC many states participate in that one.
So if one searches in California for a cold hit on a California crime and gets a hit, it really does not make sense to search a national database unless the subject of the hit is a partial match and there is exculpatory evidence to suggest that the subject is alibied for the crime; or such culpatory evidence as exists may be “thin” and the investigator does want to go “the extra mile” and confirm that no more matches are available.
I’m not sure what the standard practice is, so perhaps somebody else can enlighten me.
EW1(SG) (84e813) — 5/5/2008 @ 3:32 amI think Mr. Shearer made an error in the analysis, in that the match in the database overrides the original 50% estimate. The estimate means nothing when definitive evidence contradicts or resolves it. If the 50% is based on true statistics due to other evidence that satisfies independence, than it would apply, and a composite probability could be computed from Bayes’ theorem.
Ken from Camarillo (245846) — 5/5/2008 @ 3:48 am10 Adriane:
The DNA profiles come from actual counts of the components that make up the unique parts of DNA (at specific loci).
If you graph it out, it looks rather like a spectrograph. But the database is composed of the number and length of DNA fragments found at a specific locus, so an example would be “D5S818 10/13” where at locus “D5S818” the subject has two fragments of DNA, one length 10, and the other length 13. (Another person’s DNA might be 11/12 at that particular locus.)
Once the samples are tested the graphs are compared, but the search is pretty specific.
EW1(SG) (84e813) — 5/5/2008 @ 4:30 am#10 Ariane: To simplify my earlier answer, DNA used for identification is a sequence of 15 arrays of numbers, and although the number of arrays is variable, the numbers in the arrays are discrete, making this kind of identification extremely well suited to computerized methods.
EW1(SG) (84e813) — 5/5/2008 @ 10:11 am12
I did compute a composite probability using Bayes Theorem.
The real problem is lay people tend to think if the random match probability is 1 in a million and the suspect’s DNA matches then there is only a 1 in a million chance the suspect is innocent. This is not correct. A match just means the odds ratio (the probability of guilt divided by the probability of innocence) has increased by a factor of one million. But this is not conclusive if the probability of guilt was very low to start with (as it is for someone in a large DNA database). The fact that everyone else in the database misses does not change things much unless you have strong reason to believe the database contains the guilty person which is not true in practice with the current data bases.
To see how the computations work consider a rape murder with an obvious suspect (say as above a single man living in the other half of a two duplex). Suppose before doing the DNA test you estimate a 50% probability that the suspect is guilty. So if the situation arose 2000000 times you would have a pool of 1000000 guilty men and 1000000 innocent men. Now run the the DNA test. This would eliminate none of the guilty men from the pool but would eliminate most of the innocent men. Suppose the test has a false positive rate of 1 in million. Then on average only one innocent man will not be eliminated. So you will end up with a pool of 1000001 men only 1 of whom is innocent. So you will have 1000000 guilty men for every innocent man after the DNA match. In other words the match has increased the ratio of guilty men to innocent men by a factor of one million from 1:1 to 1000000:1. This is pretty conclusive.
But suppose we don’t start we a good suspect but some random guy from the entire city of LA. Clearly before the dna test the probability of guilt is pretty low, let’s say 1 in 1000001. In other words the intial odds ratio is 1:1000000, one million innocent men for every guilty man. So let us imagine starting with a pool of 1000001 men one guilty and 1000000 innocent and run the test. The guilty man remains and all but (on average) 1 of the innocent men are eliminated. So we are left with a new pool of two men one of whom is guilty. So again the odds ratio has increased by a factor of 1000000 from 1:1000000 to 1:1 but this is not sufficient to prove guilt beyond a reasonable doubt.
James B. Shearer (fc887e) — 5/5/2008 @ 11:55 amOne thing not mentioned is the possibility of duplicates. With a limited DNA sample, the number of unique variations is relatively small. In this case there are 1.1M variations. Therefore, one could expect to find a match in any set of 1.1M samples, even if the set does not include the actual source.
The larger the set of samples, the greater the chance of such a match. In this case, the sample set is was 338K, so the chance of an unrelated match is about 1 in 3. The LATimes is (mirabile dictu!) right about that. The match by itself would not be enough to convict.
If the set of samples was 11M, it would probably produce multiple matches.
This is not to say that partial DNA matches are not useful; they can exclude most suspects, which is very heplful.
In this case, the partial DNA match is good for a 2/3 probability of guilt by itself, which is a pretty good start. When this is combined with the suspect’s actual presence in the area at the time, and the other corroborating elements, I think the case is made.
Going to the extreme case of a full global DNA database… Suppose one has a partial sample with 14 billion variations. One cannot assume that all the samples in the database are unique. There is about a 50% chance that any one sample has a “twin” somewhere. A “cold match”, then, would have a 50% chance of being a false positive. Of course with the full 13 quadrillion variant sample, the chance of a duplicate is 1 in 2,000; and circumstantial evidence can exclude all but a relative handful of people.
On the other hand, the partial sample from the case in question would produce thousands of matches.
Rich Rostrom (7c21fc) — 5/5/2008 @ 1:03 pm#16 Rich Rostrom:
No, you can’t say that. It is true that
but that doesn’t imply that we could or should expect to find a “match” if we expand the number of samples. It may be that the sample fragment is enough to uniquely identify a single person out of all the world’s population, but the only thing we know for sure without sampling the world population is that the fragment of DNA we have on hand has enough heritable characteristics in it that we cannot expect to find it more than once in 1,100,000 individuals. Which doesn’t have anything to do with guilt or innocence, as Patterico has pointed out.
Another thing you cannot say is
because if you don’t have the “actual source” (and are perhaps just looking for a match to a randomly generated example) then you don’t know if the example you are searching for is even capable of existing in nature.
EW1(SG) (84e813) — 5/5/2008 @ 5:00 pmWhy is it that Karl Lembke (#7) is being ignored? Seems to me he pretty much covered the statistical questions in a straightforward manner.
Apogee (366e8b) — 5/5/2008 @ 5:23 pm#18 Apogee:
Good question.
He did a good job of covering some of the material. Could be because he used some of the number thingies.
EW1(SG) (84e813) — 5/5/2008 @ 5:40 pmEW1(SG) – Yeah, and I’d already bought my wooden stick, paint, staple gun and 2×3 cardboard. I’d just finished painting “50% Innocent!” on the sign and was on my way to the protest when I read his all-numbery post that logically spelled out the issue. Damn him! How does he expect me to get emotionally involved? Now I won’t get to protest, won’t get to meet that hottie in the bandanna, won’t get to charm her with my ‘truth to power’ spiel, and won’t get laid.
I hate Lembke.
Apogee (366e8b) — 5/5/2008 @ 5:47 pmWhich was a one in 17.4 quadrillion chance anyway.
/:poke:
EW1(SG) (84e813) — 5/5/2008 @ 6:10 pmI’d be insulted, but I’m simply not good with numbers.
Apogee (366e8b) — 5/5/2008 @ 6:17 pmIn this case, the sample set is was 338K, so the chance of an unrelated match is about 1 in 3. The LATimes is (mirabile dictu!) right about that.
No, that’s the chance of a match.
It’s the chance of an unrelated match only if we know that the database is composed entirely of people who didn’t donate the DNA in the evidence sample, and who are unrelated to the person who did.
I think I’m saying that right.
Patterico (4bda0b) — 5/5/2008 @ 8:57 pmIt’s the chance of an unrelated match only if we know that the database is composed entirely of people who didn’t donate the DNA in the evidence sample, and who are unrelated to the person who did.
I think I’m saying that right.
You are.
The point is, as near as I can tell from the LA Times article, there seems to have been no a priori reason to believe the perpetrator of the rape was actually listed in the database. Thus, running the DNA sample from the crime scene against the database is mathematically equivalent to running it against any sample of 338,000 people selected at random. There is a 26.4% chance of obtaining a “hit” on any such random group of that size.
Remember, I’m discussing the statistics — not any other evidence that may have been developed as a result of following up on the match.
Now, the other evidence mentioned in the LATimes article may or may not have been telling. Maybe there was enough of a match between Puckett’s MO and the details of the crime in question to convict. (Maybe other evidence, unaccountably omitted from the LATimes article, was more compelling.) But based on the DNA “match”, there’s not enough to convict.
Karl Lembke (1d7861) — 5/6/2008 @ 9:41 amThe IBM site “Ponder This” poses a puzzle based on this subject. Their answer will be available at the end of June. It will be fun to see their appvoved approach.
Gale Greenlee (178b5a) — 6/4/2008 @ 12:26 pm