Linear Regression, Population, Sample

fog37 · Oct 12, 2023

Hello,

1) Let's consider a population of 1,000,000 data points with each data point being represented by the pair of values (x,y).
Let's assume that, when plotted on a graph, the 1,000,000 points look like a spread out cloud with an overall positive linear trend. These 1,000,000 points represent the population. The best-fit line calculated using all the 1,000,000 points will have a specific slope and intercept.
Given a particular x value, the y value provided by the computed best-fit line equation will exactly represent the arithmetic average of all the y values of the data points that have the same x value. Is that correct? In essence, the average value of the y variable depends linearly with the value of x variable.

2) In this case, instead of using all 1,000,000 data points to plot the graph and calculate the best fit line, we only use a random sample of 100 points. The best-fit line obtained using the 100 random data points is a different line from the best fit line calculated using the 1,000,000 points. We can take a different sample of 100 random points and the best fit line will again be different (but similar in intercept and slope to the previous sample line). In essence, both the slope and the intercept, calculated for each different random sample of size 100, are random variables. Very often we can only work with a sample and not with the 1,000,000 data points population. Under which conditions will the sample best-fit line be a good approximation of the population best fit line? The larger the sample, the closer the sample best-fit line will be the to population best fit line...What are conditions must be met to guarantee that the sample line is close to the population line?

Thank you!

thank you!

FactChecker · Oct 12, 2023

fog37 said:

Given a particular x value, the y value provided by the computed best-fit line equation will exactly represent the arithmetic average of all the y values of the data points that have the same x value. Is that correct?

That is what linear regression does. If you want to use an X value to get the best estimate of the associated Y value, that is the thing to use. There are other calculations (see principle component analysis) that finds the line which has the smallest sum of the squared perpendicular distances from the points directly to the line.

fog37 said:

What are conditions must be met to guarantee that the sample line is close to the population line?

If the 100 sample points have a large variation (scatter off of the line), then you can expect that the calculated line from them will have a large variation from the population line. The standard linear regression software programs will use the ratio of the scatter of the y values off of the line divided by the total scatter of the 100 points. (see Assessing Goodness-of-Fit in a Regression Model )

statdad · Oct 16, 2023

"Under which conditions will the sample best-fit line be a good approximation of the population best fit line?"

What are you using as your measure for assessing "good approximation"? You could make a very simple argument that for any given set of data, as long as you're referring to linear regression AND using least squares as your way to estimate slope and intercept, the resulting fitted line is a good approximation GIVEN the data set you're working with.

"The larger the sample, the closer the sample best-fit line will be the to population best fit line...What are conditions must be met to guarantee that the sample line is close to the population line?"

I'm not sure I agree with that first statement: for me the issue is knowing the quality of the sample: simply having a larger sample isn't enough to ensure what you say it does. (I know that basic probability says that both the sample intercept and slope will converge in probability to the population values, but I still have my doubts.)

I think my biggest bit of uncertainty here is the lack of a clear meaning of what you think constitutes a "good approximation".

BWV · Oct 16, 2023

It’s not as if the population gives you a ‘true’ regression model in the same sense that you can get a true mean or variance. typically you are not trying to sample from a population with OLS, rather trying to infer some relationship between variables. In many cases - such as in financial markets - you are not sampling, you are using the complete data set, not a sample of it. Even with a sample, the tiny differences between sample estimates and population values of intercepts and betas tend to be immaterial compared to the overall error.

fog37 · Oct 16, 2023

BWV said:

It’s not as if the population gives you a ‘true’ regression model in the same sense that you can get a true mean or variance. typically you are not trying to sample from a population with OLS, rather trying to infer some relationship between variables. In many cases - such as in financial markets - you are not sampling, you are using the complete data set, not a sample of it. Even with a sample, the tiny differences between sample estimates and population values of intercepts and betas tend to be immaterial compared to the overall error.

Thank you BWV. I like your explanation. We have a bunch of data points and the linear regression model attempts to explain the collective behavior of these points with the goal being to find the relation between the dependent and independent variables (simple linear regression). It may be incorrect to call the best-fit line calculated using all the population the "true" best fit line.

@statdad points out the meaning of "good approximation"... I would say that, assuming we collect a good simple random sample of size ##N##, the sample best-fit line approximates the population best fit line better (the betas and intercepts are closer in value) if the sample size ##N## is large (whatever large means)...

Linear Regression, Population, Sample

Similar threads

Hot Threads

Recent Insights