[MUSIC] My name is Andre Kleensang. I'm a research associate at the Center for Alternatives to Animal Testing. And within the Tox21C lectures series, I will be talking in the next about 30 to 40 minutes about biometry and data mining. This lecture will be divided into three sections. In the first section, section A, I will talk a little bit about the definition of biometry, biostatistics, and we'll revisit some classical statistical basics. In section B, I will talk about different types of Tox21 data, and the importance of data mining from external data sources for Tox21. In Section C, I will talk about available bioinformatics tools for the interpretation of Tox21 data. This lecture is not standing alone, so please also have a look in the other lectures, because some things which are only sketched here, you can find more detailed information in other lectures. For example, the lecture of Dr. Martins, which is entitled Process of Toxicity for Data. And now we will go to section A, definition of biometry, biostatistics, and revisiting some statistical basics. So for the section A, there are three learning objective. The first is that you should be able, after hearing this lecture, to define what is biometry, biostatistics. What kind of data is dealing with it? And also, what are the different variable scales and distributions? Then, as a second point, what is descriptive statistics? What are typical tools that I use there? What is inferential statistics? And what is the difference between the two? And then, especially in the area of inferential statistics, I would like to emphasize again, to rediscuss a little bit, what is a p-value? What is alpha and beta and power in inferential statistics? So let's start with a definition of statistics and biometry. So statistics is in general, the collection, analysis, and interpretation and presentation of any kind of data. Biostatistics, or some people call it biometry, and there's a subsection of that, which is the application of statistical analysis, especially to data from a biological origin. So methods of biometry. First, one important thing is study design, which means biometry can help you to design your studies in a way that it will come to robust results. And so with descriptive statistics, you are describing quantitatively or summarized features of your data, and with inferential statistics, you are testing hypotheses and deriving population estimates. Another very important part as you also will see later in this lecture is data mining, which means you are try to harvest big amounts of data or things you are interested in. So let's start with something very simple. You should have seen something like that in a similar way, but I think it's always important to go back and start with some basics. So what we are measuring in the lab is normally variables as result of our experiments. And these variables can have different scales. So you divide two primary scales, which is categorical and numerical. So let's start first with categorical. So you can have for categorical unordered or ordered categories. So if things are not in an order, so you can't give them ranks, like sex, blood group, hair color, it's called nominal. And if you are able to assign ranks, like a disease stage, toxic potency, things like that, then it is called an ordinal scale. On the other hand, you have the numerical scales, which are divided into discrete and continuous. So discrete are things that you can describe in integer numbers. So if you toss a coin, for example, you can only have one, two, three, four times see the eagle, but you cannot see it three and a half times, for example. So in general, you can say counts, things you can count. On the other hand, continuous variables, the entire space of numbers is a. Typical things for a continuous variable could be, for example, OD measurement in the lab, but also height or blood pressure. There are also several other types of data. For example percentages, viability in a cell culture after exposure to a toxicant. Or you can ratios, so the body mass is an example for a ratio. You can have quotients or rate or scores, or in general, you could also have censored data, which means your detection has left and right limit, and you're not able to detect, to measure things which are outside of these limits. Most of the times, these kind of variables are simply treated as continuous measurements. Why I'm saying all this? Depending on which scale your data is, you apply different tools and methods in statistics. Let's start with descriptive statistics and locations, scores. So if you have a continuous distribution, the arithmetic mean is a typical measurement, but you can also use the median. Or in case of discrete data, the mode, which is the most frequent value, is quite often used. But if you have other distributions, for example lognormal distribution, a geometric mean can be indicated. The location score is one thing. In addition, in descriptive statistics, have you talk about dispersion, variability, and spread. So one of the measurements is range, which is simply the maximum value in a data set minus the minimum value in a data set. But since this measurement is sensitive to outliers, and most of the times what you're doing is you are using ranges which are derived from percentiles, a typical measure is the interquartile range. Other measurements are given on this slide here, which is the empirical variance, the standard deviation, and the coefficient of variation, and the standard error of the mean. And just to point out that not only numbers are important. I gave there on the base of the slide also some visualization of the data with box plots and histograms. Because it's not just important just to calculate some descriptive statistics, but also to have a visual inspection, for example by histograms or box plots as are given on the bottom of the slide. Some things you cannot see if you calculate the descriptive statistics, but you can easily see on visualization of the data. So always not just trust the numbers, but visually inspect the data you are dealing with. Now I would like to talk a little bit about distributions. So the famous normal distribution, which is especially for inferential statistics, very important as you as you can see on the left upper side of the slide which gives the distribution of a normal distribution. A distribution is described by two parameters, which is the mean and the variance, mu and sigma. It is symmetric around the mean, and in this case, the mean is zero and the standard deviation is one. And one of the nice properties which also allows you later to do inferential statistics is that mu plus minus 1.96 standard deviation covers 95% of the probability mass. A very close relative to that is the lognormal distribution, which is a logarithmized normal distribution. And especially in toxicology, you will see that a lot of data that you generate in the lab is actually more lognormal distributed than normal distributed. Why it's so important to talk about the normal distribution, it is well-known. It allows for a series of parametrical statistical tests, so-called hypothesis testing. So if your data looks normal distributed, you are fine. If your data does not look normal distributed, think about whether you can apply a transformation function to transform your data in a way that it looks normal distributed afterwards. Ways to check normality are graphic tools. We saw already a histogram and box plots, also scatterplots. But there are also inferential statistics that can test whether the assumption of normality is met, which is the Shapiro-Wilks test or Kolmogorov-Smirnov test. A second very important distribution is the binomial distribution. This is a discrete distribution. You can see it because of the two coins on the left side. It is defined as k number of successes in n independent dichotomous experiments with a probability of p. So if your experiments, your data is a discrete distribution, then you are dealing with a binomial distribution. Now I would like to talk about the statistical test, which is an important part of inferential statistics. And the base of statistical tests are the hypotheses, the null hypothesis and the alternative hypothesis, or H1. A typical question would be, for example, would you like to test for, have females different shoe size than males? So the null hypothesis would be, in this case, that the arithmetic mean of the shoe size of females is equal to the arithmetic mean of the shoe size of males. Whereas the alternative hypothesis would be that the arithmetic mean of the shoe size of females are different from the arithmetic mean of the shoe size of males. So, one very important thing to know about a statistical test is it is a decision procedure. You make a decision. You have two choices. You believe further in the null hypothesis, or you leave the null hypothesis and decide for the alternative hypothesis. You can also already see there in the brackets that if you stay with the null hypothesis, this is a weak decision. If you leave the null hypothesis and go to the alternative, this is a strong decision. Why is that a case? Let's have a look at the confusion matrix in the lower right side of the slide. You can see in the columns the truth, but the truth is unknown, and the decision that you have taken based on your test. So you can have two correct answers. The null hypothesis is true and I decided for the null hypothesis, or the alternative hypothesis is true and I decided to go to the alternative hypothesis. And then you have two possibilities to make an error, which is so-called the alpha and beta error. So the beta error is quite often called the missed chance, and the alpha error is quite often called the. This is very good to remember it. So what the famous p-value is is an estimate of alpha. So based on the data, we calculate the alpha and call that the p-value. And if this calculated p-value is below a given threshold probability, which are the famous 5% normally, then we say our evidence that the null hypothesis is correct is so weak that we leave the null hypothesis and go to the alternative one. And since we can only calculate the alpha error but not the beta error in parallel, if we stay with the zero hypothesis, this is a weak decision because we cannot say whether zero hypothesis is true. And if we stay with a zero hypothesis, all of the alternative is true, because we have no idea about the beta, so it is a weak decision. But if we go for the alternative hypothesis because the p-value is lower than the before agreed threshold, then there is a strong decision because we controlled the alpha error. In general, you have two different type of statistics. So you can have parametric or nonparametric statistics. Parametric statistics, assuming a distribution of the data, which can be for example, a normal distribution. Nonparametric data is not assuming any distribution in the data and can be applied to all kinds of data. Because in the first step, it transforms the data into ranks, and ranks always have the same distribution. That means the statistics can always be correctly applied. Last but not least, one minus beta is so-called power of statistics. That means if you would run the same experiment 100 times, and you know that there is truly an effect, that means the alternative hypothesis is correct. The power describes the percentage of cases in which your statistical test would be able to find the difference. I mentioned already that one minus beta is the power, and a common mistake that is done if you deal with biological data and doing experiments in your labs, is that studies are under powered. So I would like to sketch out an exemplary sample size calculation without going much into detail. So the study objective is to test the between laboratory reproducibility of a new in vitro assay that has been developed to replace animal tests of the OECD test guidelines 406 and 429. And target is to evaluate the reliability, which means that within and between laboratory reliability of three methods. So the results between laboratories can be concordant, yes, or you get disconcordant results, which would then be no. In order to be able to do a sample size calculation, you need, in principle, to define three things, which is the effect size, the type 1 error, and the power. So let's start with the effect size. So we define and this is taken from the literature, an expected proportion of concordant classifications, which in this case, 90%. And then you define based on that, the effect size, the deviation of that that your statistic should be able to detect, which is in that case 25 percentage point. That means 90% minus 25 percentage points is 65%, to load up lower border for pi. The type 1 error that we are willing to accept has been defined as 5%, and the power of the study has been defined as 75%. If you do the sample size calculation, the outcome is that you need to test 21 chemicals in each lab to establish and test border between laboratory reproducibility. Now, in the last 20 minutes, we talked about some basics of biometry and statistics and revisiting some things about scales. And so all this was completely independent of Tox21 data. And now in the next section, we really go into types of Tox21 data and the importance of data mining from external data sources, and how this could be done. [MUSIC]