## Revisiting the medical tests example with Python and Lea

May 27, 2015

### warning

This post is **more than 5 years old**. While math doesn't
age, code and operating systems do. Please use the code/ideas with caution and
**expect some issues** due to the age of the content. I am
keeping these posts up for archival purposes because I still find them useful
for reference, *even when they are out of date!*

In this post I will use Python, and the probabilistic programming package Lea , to re-analyze an example of Bayes' Theorem covered in an earlier post: Medical tests, a first example of Bayesian calculations . The focus will be on translating the by-hand calculations into Python code. If you are unfamiliar with the example, it would be helpful to read the previous post before continuing. Assuming you are set, let's get started!

First, let's restate the problem:

1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?

This is an interesting and important problem to think about because many people get this wrong. Often people assume that a positive mammogram means that breast cancer is certain, or at least very likely. Given the probabilities stated above, that assumption is incorrect-- let's revisit this important problem.

I'll be using Lea , a package for probabilistic programming, to answer this question and provide some context. See my post Probabilistic programming with Python and Lea for installation instructions and a first example of using the Lea package. Our first step is to import Lea and some convenience functions:

`from __future__ import division, print_function`

from lea import Lea

Some notes:

- This post is using Lea version 2.1.1, the current version at this time
- A gist, containing all of the code is available here

Next, following the by-hand post Medical tests, a first example of Bayesian calculations , we
define the probability of cancer for women at age 40 (this can be considered
the **prior** in this problem-- the probability of cancer
before data from the mammogram is available):

`# define cancer dist`

cancer = Lea.fromValFreqs(('yes', 1),

('no', 99))

print('\nCancer Distribution',

'P(C)',

cancer.asPct(),

sep='\n')

producing...

`Cancer Distribution`

P(C)

no : 99.0 %

yes : 1.0 %

This corresponds to \( P(C=\textrm{yes}) = 0.01 \) and \( P(C=\textrm{no}) = 0.99 \) in the previous post. These values reflect the probability of cancer in this age group after extensive investigation to be sure that cancer is present.

Next, we have to define *two distributions* that describe the ability of
the mammogram to detect cancer accurately:

#### First, one for women known to have cancer

`# prob for mamm given cancer == yes`

mamm_g_cancer = Lea.fromValFreqs(('pos', 80),

('neg', 20))

print('\nProb for mammogram given cancer',

'P(M|C=yes)',

mamm_g_cancer.asPct(),

sep='\n')

producing...

`Prob for mammogram given cancer`

P(M|C=yes)

neg : 20.0 %

pos : 80.0 %

These probabilities correspond to
\( P(M=\textrm{pos} \vert C=\textrm{yes}) = 0.80 \) and
\( P(M=\textrm{neg} \vert C=\textrm{yes}) = 0.20 \) in the previous post. So,
if a woman *has cancer for sure* the mammogram will be positive 80%
of the time-- these are **true positives**.

####
Next, one for women known to be __cancer free__:

`# prob for mamm given cancer == no`

mamm_g_no_cancer = Lea.fromValFreqs(('pos', 96),

('neg', 1000-96))

print('\nProb for mammogram given NO cancer',

'P(M|C=no)',

mamm_g_no_cancer.asPct(),

sep='\n')

producing...

`Prob for mammogram given NO cancer`

P(M|C=no)

neg : 90.4 %

pos : 9.6 %

These probabilities correspond to
\( P(M=\textrm{pos} \vert C=\textrm{no}) = 0.096 \) and
\( P(M=\textrm{neg} \vert C=\textrm{no}) = 0.904 \). So, if a woman *does
not have cancer* the probability of a positive mammogram is
9.6%--of course, these are

**false positives**.

Finally, at least for the setup, we need to connect the probabilities of cancer and the mammogram using Lea. To do this we use a conditional probability table:

`# conditional probability table`

mammograms = Lea.buildCPT((cancer == 'yes', mamm_g_cancer),

(cancer == 'no', mamm_g_no_cancer))

print('\nMammograms',

'P(M)',

mammograms.asPct(),

sep='\n')

producing...

`Mammograms`

P(M)

neg : 89.7 %

pos : 10.3 %

The values are the marginal probabilities for the mammogram results:
\( P(M=\textrm{pos}) = 0.103 \) and \( P(M=\textrm{neg}) = 0.897 \).
This says that the probability of a positive mammogram is about 10%! However,
we have to keep in mind that this includes *both* true positives and
false positives.

Lea will let us look at the *joint probabilities* and separate out these effects as follows:

`# get joint probs for all events`

joint_probs = Lea.cprod(mammograms, cancer)

print('\nJoint Probabilities',

'P(M, C)',

joint_probs.asPct(),

sep='\n')

producing...

`Joint Probabilities`

P(M, C)

('neg', 'no') : 89.5 %

('neg', 'yes') : 0.2 %

('pos', 'no') : 9.5 %

('pos', 'yes') : 0.8 %

From this result we can see that the 10.3% positive mammograms is split into
9.5% *false positives* and 0.8% *true positives*! So, why do
we use mammograms? Well, they do provide information, but __not certainty__,
about the presence of cancer.

We can calculate the conditional probability that the original question asks
for using Lea: What is probability of cancer *given a positive
mammogram?*

`# prob cancer GIVEN mammogram==pos`

print('\nThe Answer',

'P(C|M=pos)',

cancer.given(mammograms == 'pos').asPct(),

sep='\n')

producing...

`The Answer`

P(C|M=pos)

no : 92.2 %

yes : 7.8 %

As we can see, this calculation provides both \( P(C=\textrm{no} \vert M=\textrm{pos}) \) and \( P(C=\textrm{yes} \vert M=\textrm{pos}) \). The code does not make it clear that Bayes' Theorem is being used in the calculation, but that's what is done to get the desired conditional probability-- see the previous post and the by-hand calculations to see how this works.

From the result we can see that the probability of cancer given a positive
mammogram is about 8%, that is
\( P(C=\textrm{yes} \vert M=\textrm{pos})=0.078 \).
This should be compared with the probability of 1%,
\( P(C=\textrm{yes})=0.01 \), without the information from a mammogram:
**there is an eight-fold increase in the probabilty of cancer that
might lead to further investigation**.
However, a positive mammogram *does not* mean
that breast cancer is likely; only more probable. Of course, the downside of
all this is that many women with *no cancer* will get positive
mammograms leading to further (unneeded tests).

Note, a negative mammogram also provides information:

`# prob cancer GIVEN mammogram==neg`

print('\nExtra Info',

'P(C|M=neg)',

cancer.given(mammograms == 'neg').asPct(),

sep='\n')

producing...

`Extra Info`

P(C|M=neg)

no : 99.8 %

yes : 0.2 %

This deceases the probability of cancer from 1% before the mammogram to 0.2% after the negative mammogram. Again, this is not certain or absolute, just the probabilities have changed.

So, that's it. A re-analysis of the medical test example using Python and Lea . Be sure to check out the package for more information and examples. As always, comments and questions are very welcome.