Sampling theory and random sample

fog37 · Jan 10, 2024

In inferential statistics, we have a large population, collect data from it to get random sample of size ##n##, and infer the population parameters from that single sample.

I read that the random sample can be interpreted as the collection of the ##n## realizations of a single random variable ##X##. For example, the height ##H## of individuals in a population can be define as a random variable and the height of each individual in the random sample is a realization of the r.v. However, a more correct interpretation of a random sample is the following: each element of random sample, for example the 5 heights ##[6, 5.4, 6.1, 5.5, 6.4]##, is the realization of a different random variables. So the random sample is the realization of a random vector, a sequence of i.i.d. random variables ##[X_1, X_2, X_3, X_4, X_5]## with a joint probability distribution ##f(x_1, x_2, x_3, x_4, x_5)##. Why is this the correct interpretation of the random sample and not the first one with a single r.v.? Are the two interpretations somehow equivalent to each other? How?

When we perform regression analysis on some random sample of data, are we dealing with a pair of random variables, ##X## and ##Y##, i.e. a 2D random vector ##Z=(X,Y)##? Or with two random vectors, ##X=[X_1, X_2, X_3, X_4, X_5]## and ##Y= [Y_1, Y_2, Y_3, Y_4, Y_5]## where each value of x and each value of y are realizations of different random variable X and different random variable Y?

Thank you as always for any comment and correction.

FactChecker · Jan 10, 2024

fog37 said:

I read that the random sample can be interpreted as the collection of the ##n## realizations of a single random variable ##X##. For example, the height ##H## of individuals in a population can be define as a random variable and the height of each individual in the random sample is a realization of the r.v. However, a more correct

Is "more correct" your phrase or theirs? A restriction of the first interpretation is that the population distribution is assumed to be identical. If the intent is to study things like cluster analysis, importance sampling, or stratified sampling, then there is some freedom to say that there are more than one distribution involved in the sample.

CORRECTION: I missed the IID part of the description of the second interpretation. I see no practical difference between the two interpretations.

fog37 · Jan 10, 2024

FactChecker said:

Is "more correct" your phrase or theirs? A restriction of the first interpretation is that the population distribution is assumed to be identical. If the intent is to study things like cluster analysis, importance sampling, or stratified sampling, then there is some freedom to say that there are more than one distribution involved in the sample.

Well, I have found this interpretation in several places. For example:

The population is an infinite set of values drawn from a random variable ##X##. Sampling from a population is the same as repeatedly drawing new values from ##X##. A a random sample of size ##n## is a collection of individual draws from ##X##.

The point seems to be that ##n## independent draws from a random variable ##X## is equivalent to one draw of ##n## i.i.d. random variables ##X_1, X_2,....X_n## Is that really the case? Can you help me appreciate why the two scenarios are equivalent...

FactChecker · Jan 10, 2024

Sorry. I missed the IID part of second interpretation. I see no practical difference between the two. So I wonder where you read that the second interpretation was better.

fog37 · Jan 11, 2024

FactChecker said:

Sorry. I missed the IID part of second interpretation. I see no practical difference between the two. So I wonder where you read that the second interpretation was better.

Thank you FactChecker for your support. Let me share with you this stats.stackexchange.com answer:
https://stats.stackexchange.com/questions/368492/about-sampling-and-random-variables/368517#368517

The response by shadowtalker is discussed how the 2nd interpretation allows for for the sample statistics to also be random variables, as they are...

So why are the two interpretations really identical? Would you mind sharing your thought process. It is the same random reality but described in two different ways...Is one more technically correct that the other? As mentioned, when we talk about regression analysis, it seems better to keep the random sample of data, each pair of ##x## and ##y## values, are realizations of two random variables ##X## and ##Y## instead of two sequences of random variables, one for the ##x## values and one of the ##y## values...

For example, in the case of tossing a die multiple times, the outcome of each toss is the realization of a single random variable OR are the outcomes are the realizations of different random variables...

Thank you!

Thank you!

FactChecker · Jan 11, 2024

fog37 said:

Thank you FactChecker for your support. Let me share with you this stats.stackexchange.com answer:
https://stats.stackexchange.com/questions/368492/about-sampling-and-random-variables/368517#368517

The response by shadowtalker is discussed how the 2nd interpretation allows for for the sample statistics to also be random variables, as they are...

I agree. It is a distinction that I have probably been careless about in the past. There is a difference between a sample, which is an already collected set of data, versus the random variables the gave you that sample. I think it is standard to use lower case (##x_i##) for the data and upper case (##X_i##) for the random variables.

fog37 said:

So why are the two interpretations really identical? Would you mind sharing your thought process. It is the same random reality but described in two different ways...Is one more technically correct that the other?

IMO, one situation where the distinction is significant is if you talk about collecting data in stages so that some data is collected but other data is not yet collected and still a random variable. You might see this in stopping problems. Suppose that you were doing an experiment where collecting data was expensive or difficult and you need to decide if you should collect more data. Also, I think that the distinction would be significant in many Bayesian methods with prior and post distributions. Also bootstrap methods.
I have no real experience with these types of problems and will have to leave this discussion to others.

Sampling theory and random sample

Similar threads

Hot Threads

Recent Insights