Why probability is important to statistics




















What roles do probability and statistics play in the data science field? Let us make the thing logical and understandable about what is the significance of it?

Making predictions and searching for different structures in data is the most important part of data science. They are important because they have the ability to handle different analytical tasks. Read more about the importance of Statistics given in the Springer article here. Therefore, statistics are a group of principles that are used to attain information about the data so as to make decisions. It unveils the secret hidden in the data. Probability and Statistics are involved in different predictive algorithms that are there in Machine Learning.

They help in deciding how much data is reliable, etc. It is a theorem that plays a very important role in Statistics. Read more about the theorem here. A person should have knowledge about often-used terminologies, broadly practiced in Statistics for data science. Let us understand the same -. Population - The place or a source from where the data has to be fetched or collected. Sample - It is defined as a subset of the population. Variable - Data item that can be either a number or thing that can be measured.

Statistical Parameter - It is defined as the quantity that leads to probability distribution like mean, median, and mode. Statistical Analysis is the science of the exploration of the collection of large datasets to find different hidden patterns and trends.

These types of analyses are used in every sort of data for example in research and multiple industries, etc so as to come to decisions that are to be modeled. There is mainly two types of Statistical Analysis-. Quantitative Analysis: The type of analysis is defined as the science of fetching and interpreting the data with graphs and numbers to search for underlying hidden trends.

Employers in this area often prefer mathematicians and statisticians to people with an economics background. Politics is very much about strategy.

How should an election campaign be fought? How should a government deal with other powers? How much money should the health service receive? To find a good strategy, politicians need to understand public opinion, know about the structure of society and assess risks. The government employs many statisticians to help them with this. They can conduct and evaluate a census, and work out the risk of there being an epidemic, or of the world economy plunging.

During the cold war, game theory , which is closely related to probability theory, was used to decide whether the US strategy of arming itself to the teeth to deter an attack from the USSR was effective. When you produce a product, be it a car or a light bulb, you want to know how reliable it is. To find out, you take a sample of your light bulbs or cars and test them. Just as in an opinion poll, you can use statistical methods to gain information about the quality of your product from this sample.

Reliability theory has become a very important branch within statistics. Statistics is often used in law. Suppose that someone working for a company has been accused of falsifying their expense account.

Statistical tests can show whether the numbers in the expense account are likely to have been made up. But there is also a lot of controversy around statistics and probability in law, because it can get misused. A few years ago, a woman called Sally Clark was jailed for the murder of her two children. But this reasoning is flawed. Finally, we all need a basic understanding of statistics.

The newspapers and TV news are full of statistics that we need to understand. We also need to make decisions based on risk. There are two measures that are commonly used to evaluate the performance of screening tests: the sensitivity and specificity of the test.

The sensitivity of the test reflects the probability that the screening test will be positive among those who are diseased. In contrast, the specificity of the test reflects the probability that the screening test will be negative among those who, in fact, do not have the disease. A total of N patients complete both the screening test and the diagnostic test.

The data are often organized as follows with the results of the screening test shown in the rows and results of the diagnostic test are shown in the columns. The false positive fraction is 1-specificity and the false negative fraction is 1-sensitivity. Therefore, knowing sensitivity and specificity captures the information in the false positive and false negative fractions. These are simply alternate ways of expressing the same information. Often times, sensitivity and the false positive fraction are reported for a test.

However, the false positive and false negative fractions quantify errors in the test. The errors are often of greatest concern. The sensitivity and false positive fractions are often reported for screening tests. However, for some tests, the specificity and false negative fractions might be the most important.

The most important characteristics of any screening test depend on the implications of an error. In all cases, it is important to understand the performance characteristics of any screening test to appropriately interpret results and their implications. Consider the results of a screening test from the patient's perspective! If the screening test is positive, the patient wants to know "What is the probability that I actually have the disease?

These questions refer to the positive and negative predictive values of the screening test, and they can be answered with conditional probabilities. The sensitivity and specificity of a screening test are characteristics of the test's performance at a given cut-off point criterion of positivity.

However, the positive predictive value of a screening test will be influenced not only by the sensitivity and specificity of the test, but also by the prevalence of the disease in the population that is being screened. In this example, the positive predictive value is very low here 2.

This is due to the fact that as the disease becomes more prevalent, subjects are more frequently in the "affected" or "diseased" column, so the probability of disease among subjects with positive tests will be higher. While this screening test has good performance characteristics sensitivity of Because positive and negative predictive values depend on the prevalence of the disease, they cannot be estimated in case control designs.

In probability, two events are said to be independent if the probability of one is not affected by the occurrence or non-occurrence of the other. This definition requires further explanation, so consider the following example. Suppose we have a different test for prostate cancer. This prostate test produces a numerical risk that classifies a man as at low, moderate or high risk for prostate cancer.

A sample of men underwent the new test and also had a biopsy. The data from the biopsy results are summarized below. Note that regardless of whether the hypothetical Prostate Test was low, moderate, or high, the probability that a subject had cancer was 0.

In other words, knowing a man's prostate test result does not affect the likelihood that he has prostate cancer in this example. In this case, the probability that a man has prostate cancer is independent of his prostate test result. Consider two events, call them A and B e. The equality of the conditional and unconditional probabilities indicates independence. In other words, the probability of the patient having a diagnosis of prostate cancer given a low risk "prostate test" the conditional probability is the same as the overall probability of having a diagnosis of prostate cancer the unconditional probability.

Each individual is also classified in terms of having a family history of cardiovascular disease. In this analysis, family history is defined as a first degree relative parent or sibling with diagnosed cardiovascular disease before age Are family history and prevalent CVD independent? Is there a relationship between family history and prevalent CVD? This is a question of independence of events.

Again, it makes no difference which definition is used; the conclusion will be identical. We will compare the conditional probability to the unconditional probability as follows:.

The probability of prevalent CVD given a family history is In the overall population, the probability of prevalent CVD is 9. Since these probabilities are not equal, family history and prevalent CVD are not independent. Chris Wiggins, an associate professor of applied mathematics at Columbia University, posed the following question in an article in Scientific American: Link to the article in Scientific American:.

The doctor performs a test with 99 percent reliability--that is, 99 percent of people who are sick test positive and 99 percent of the healthy people test negative. The doctor knows that only 1 percent of the people in the country are sick.

Now the question is: if the patient tests positive, what are the chances the patient is sick? The intuitive answer is 99 percent, but the correct answer is 50 percent The solution to this question can easily be calculated using Bayes's theorem.

Bayes, who was a reverend who lived from to stated that the probability you test positive AND are sick is the product of the likelihood that you test positive GIVEN that you are sick and the "prior" probability that you are sick the prevalence in the population.

Bayes's theorem allows one to compute a conditional probability based on the available information. Wiggins's explanation can be summarized with the help of the following table which illustrates the scenario in a hypothetical population of 10, people:.

Therefore, in a population of 10, there will be diseased people and 9, non-diseased people. Therefore, among the 9, non-diseased people, 99 will have a positive test. Suppose a patient exhibits symptoms that make her physician concerned that she may have a particular disease. The disease is relatively rare in this population, with a prevalence of 0. Before agreeing to the screening test, the patient wants to know what will be learned from the test, specifically she wants to know the probability of disease, given a positive test result, i.

Based on the available information, we could piece this together using a hypothetical population of , people. Given the available information this test would produce the results summarized in the table below.

Point your mouse at the numbers in the table in order to get an explanation of how they were calculated. We can now substitute the values into the above equation to compute the desired probability,.

If the patient undergoes the test and it comes back positive, there is a 2. Also, note, however, that without the test, there is a 0. In view of this, do you think the patient have the screening test? Another important question that the patient might ask is, what is the chance of a false positive result?

We can compute this conditional probability with the available information using Bayes Theorem. Thus, using Bayes Theorem, there is a 7. The events, Disease and No Disease, are called complementary events. The "No Disease" group includes all members of the population not in the "Disease" group. The sum of the probabilities of complementary events must equal 1 i. To compute the probabilities in the previous section, we counted the number of participants that had a particular outcome or characteristic of interest, and divided by the population size.

For conditional probabilities, the population size denominator was modified to reflect the sub-population of interest. In each of the examples in the previous sections, we had a tabulation of the population the sampling frame that allowed us to compute the desired probabilities. However, there are instances in which a complete tabulation is not available. In some of these instances, probability models or mathematical equations can be used to generate probabilities.

There are many probability models, and the model appropriate for a specific application depends on the specific attributes of the application. There are two particularly useful probability models:. These probability models are extremely important in statistical inference, and we will discuss these next. The binomial distribution model is an important probability model that is used when there are two possible outcomes hence "binomial".

In a situation in which there were more than two distinct outcomes, a multinomial probability model might be appropriate, but here we focus on the situation in which the outcome is dichotomous. For example, adults with allergies might report relief with medication or not, children with a bacterial infection might respond to antibiotic therapy or not, adults who suffer a myocardial infarction might survive the heart attack or not, a medical device such as a coronary stent might be successfully implanted or not.

These are just a few examples of applications or processes in which the outcome of interest has two possible values i. The two outcomes are often labeled "success" and "failure" with success indicating the presence of the outcome of interest. Note, however, that for many medical and public health questions the outcome or event of interest is the occurrence of disease, which is obviously not really a success.

Nevertheless, this terminology is typically used when discussing the binomial distribution model. As a result, whenever using the binomial distribution, we must clearly specify which outcome is the "success" and which is the "failure". The binomial distribution model allows us to compute the probability of observing a specified number of "successes" when the process is repeated a specific number of times e.

We must first introduce some notation which is necessary for the binomial distribution model. First, we let "n" denote the number of observations or the number of times the process is repeated, and "x" denotes the number of "successes" or events of interest occurring during "n" observations.

The probability of "success" or occurrence of the outcome of interest is indicated by "p". The binomial equation also uses factorials. Because data used in statistical analyses often involves some amount of "chance" or random variation, understanding probability helps us to understand statistics and how to apply it. Probability and statistics are actually quite extensively linked. For instance, when a scientist performs a measurement, the outcome of that measurement has a certain amount of "chance" associated with it: factors such as electronic noise in the equipment, minor fluctuations in environmental conditions, and even human error have a random effect on the measurement.

Often, the variations caused by these factors are minor, but they do have a significant effect in many cases. As a result, the scientist cannot expect to get the exact same measurement result in every case, and this variation requires that he describe his measurements statistically for instance, using a mean and standard deviation.

Likewise, when an anthropologist considers a small group of people from a larger population, the results of his study assuming they involve numerical data will involve some random variations that he must take into account using statistics. This type of link between probability randomness or "chance" and statistics applies to a wide variety of fields that deal with numbers.

It therefore behooves us to present some of the basic aspects of probability theory as they relate to statistics. Although the concept of randomness or chance is difficult to define, we will simply assume that an experiment or observation whose outcome cannot be predicted is a random experiment. The outcome of a random experiment is the result of a single instance of the experiment.

A set of possible outcomes is called an event--an event can consist of a single outcome or multiple outcomes. For a particular random experiment, the range of potential outcomes may be limited or unlimited; in either case, we call this range the sample space of the experiment.

If two events from a particular sample space have no outcomes in common, then those events are mutually exclusive. A function that is defined for the sample space of some random experiment and that has a finite probability for each value or interval in that sample space is called a random variable.

Of course, to understand the definition of a random variable, we must also know what a "probability" is. Recall that relative frequencies of data values: a number between 0 and 1 that expresses a particular datum's fraction of occurrences in the data set. If we conduct a random experiment a large number of times, then the probability of a particular data value outcome is its relative frequency.

Ideally, we would have to conduct the experiment an infinite number of times to truly discover the probability. Consider, for instance, a random variable X that corresponds to the outcome of the roll of a single die.

If the die is a "fair die," then each outcome has an equal chance of being rolled. This is to say, for any outcome a where a can be any number in the sample space ,. Why is this result the case? Because if we roll the die a large number of times, each outcome in the sample space should occur the same number of times.

Occasionally, we will use set notation to describe events.



0コメント

  • 1000 / 1000