# Need Help with "Regression" Analysis

1. ## Need Help with "Regression" Analysis

Experts:

I need some assistance with *REGRESSION* analysis in Excel. While I have some fundamental understanding of how (linear) regression works, I do need some clarification on both a) interpretation and b) potentially restructuring the sample data. Allow me to provide some background on the XLS (attached) and process first.

Sample File:
- Sample file includes purely made-up/random data; however, it mimics my organization's actual concept.
- Attached XLS includes 3 tabs:
1. "SampleData" -- contains 2 columns (Crime and State)
2. "SampleData_Recoded" -- given the error (i.e., "Input range contains non-numeric data"), I re-coded the string values into numeric values (see tab "Lookup")
3. "Lookup" -- summarizes the re-coding from string to numeric value

Regression Process:
- On tab "SampleData_Recoded", the SUMMARY OUTPUT is included (starting in column P)
- Based on the regression analysis, the "Coefficients" column gives me both the intercept (3.029850746) and State value (0.179104478)

Regression Model (based on the Coefficients information):
- Cell P23 includes a drop-down list (1, 2, 3 indicating CA, FL, and TX, respectively)
- Cell Q23 uses the formula which **predicts** the dependent variable (i.e., Crime).
- Given cell Q23 includes decimals, I am using cell R24 to round to 0 decimals.
- Lastly, cell S23 uses the INDEX/MATCH to cross-reference the numeric value to the crime (string).

My question(s):
1. Based on this (small) **sample** data set, does it mean that the most prevalent crime = "Homicide" if CA (1) OR FL (2) is selected in cell P23?
2. Alternatively, does it also mean that "Kidnapping" is the most frequently occurring crime if TX (3) is selected in cell P23?
3. If my interpretation of Q1 and Q2 are incorrect, what does the regression (model) tell me in this example?
4. Finally, is there another way to restructure the data set and perform the regression differently so that I get some form of matrix (for each state) and their respective crimes?

EEH  Register To Reply

2. ## Re: Need Help with "Regression" Analysis

Your p value is too small and would be considered as strong evidence against null hypothesis (i.e. claim made about the population).

In this case I'd interpret it as... what state crime occurs in is not valid predictor of what type of crime is committed.  Register To Reply

3. ## Re: Need Help with "Regression" Analysis

CK76 - yes, in principle, I agree that the p-value is NOT statistically significant. Keep in mind that this -- as mentioned -- is random sample data.

At this time, I merely want to explore the concept and determine if my **interpretation** of the regression output is stately correctly. Also, per Q4, do you know of another way to run the data differently to obtain a broader perspective? That is, I do NOT want to use any form of COUNTIFs and histograms to outline "how many crimes/state"? Instead, is there another statistical method that I can use to somehow **predict** the # of crimes per state (always keep in mind this is random/sample data only).

Thanks!
EEH  Register To Reply

4. ## Re: Need Help with "Regression" Analysis

These questions feel more statistical than Excel.

When I want to think through a regression analysis, I like to include a scatter chart with the resulting trendline that allows me to visualize what is going on. If you add a scatter chart, switch the x and y ranges (so column A is Y and column B is X -- Excel will default to A is X and B is Y), then add a trendline to the chart, you can see what you linear regression is doing. A scatter chart should show you that your recoding of the data is simply creating a "grid" in the chart, and then the trendline simply represents "the middle" of the grid.

Questions 1 and 2) "Prevalence" (or other expression of frequency) is at best only weakly represented in the raw data, so the regression really cannot say anything about prevalence.
Question 3) While it is true that Excel cannot perform a regression on text, simply assigning arbitrary numbers that have no meaning does not assure that your regression will have meaning. I see no meaning to be derived from the regression.
Question 4) I'm sure there are many ways to restructure the data set to get something meaningful out. The first step is to replace the text data with meaningful (not arbitrary) numbers. Q1 and Q2 suggest that you are trying to understand something about "prevalence" -- a count of how frequently each crime occurs as a function of state. If that is the interest, then I would expect part of the restructure to include a counting step (whether by pivot table or COUNTIFS() functions or however you like to count data in a spreadsheet). "State" is difficult to use as the independent variable, so you would want to substitute some meaningful number that will represent something quantifiable about each state. Maybe "year of admission to the union" to explore how and if older states impacts crime rates. Or maybe population or average elevation or land area or average price of milk or price of gasoline or ... (which leads me to reflect on Tyler Vigen's spurious correlations: http://tylervigen.com/page?page=1 ).

In short -- in order for regression to have meaning, the dependent and independent variables need to have some real meaning. Understanding what the inputs means will go a long way to understanding what the regression output means.  Register To Reply

5. ## Re: Need Help with "Regression" Analysis

MrShorty -- you made some very excellent points.

- Copy on the scatter plot... I also got the grid and realized that it's not really a scatter after all
- Frequency counts on the crime rates is very helpful
- I'll continue to think how to best represent "state"... I'm thinking to use population estimate (normalized)

Again, thank you for the great info... VERY HELPFUL! Cheers,
EEH  Register To Reply