# Average values while removing outliers

1. ## Average values while removing outliers

hello all,

I am looking for a way to achieve an average while removing outlining values.

I attached a sample of some data that im working with...

What i am hoping is that i can take an average of Hrs worked per Assy in Column I where The value in Columb B and D match but exclude any lines that are more than XX% different from the other entries. *the % to be used is still TBD*

I initially tried to add a helper column that checked the hours worked against a calculated average and i was going to have it just display "yes" or "no" if the hours work was greater than 10% different than the average but the problem with that is the major outliers were still weighing the average and thus botching the results.

Im sure i did a poor job of explaining this but if you can follow along at all please let me know if you have ideas or questions.  Register To Reply

2. ## Re: Average values while removing outliers

I'm not sure I understand exactly what you want. Your data is quite scattered to where it is difficult to intuitively see which values you intend to include in the averages and which you intend to exclude. If you could maybe work out a few example manually so we can see what you intend.

I am not a real statistician, but every "outlier" algorithm that I know of requires that you first calculate the "unfiltered" average before determining which points are outliers. If you are trying to use any of those algorithms, I don't know of a way to implement those algorithms without first computing the average.

Or maybe you are just wanting to add two additional criteria to the existing AVERAGEIFS(). AVERAGEIFS(...,I:I,">"&I2*0.9,I:I,"<"&I2*1.1) which simply adds the condition that you only want to average values that are within 10% of the current value in column I.

Help us understand what you are trying to do, and we will try to help you implement it.  Register To Reply

3. ## Re: Average values while removing outliers

(Note: I would never use TRIMMEAN, which arbitrarily excludes the largest and smallest x% of the data, even though many people (incorrectly) describe its purpose as removing "outliers".)

One statistical definition of an outlier is: a value that is less than q1-IQR*x or more than q3+IQR*x, where q1 is the 25%ile, q3 is the 75%ile, IQR (interquartile range) = q3-q1 (middle 50%), and x is some IQR factor (TBD below).

Caveat: Just because a value is an outlier, that does not necessarily mean it should be excluded. Usually, a value that is identified as an outlier should be excluded only if it is a mistake. That said, many people do indeed arbitrarily exclude outliers, with the intent of trying to determine some "centrality" of the data.

To that end, I would calculate the following:

M15 (q1): =QUARTILE(I2:I13,1)
M16 (q3): =QUARTILE(I2:I13,3)
M17 (IQR): =M16-M15
M18 (out1): =M15-M17*1.7
M19 (out3): =M16+M17*1.7
M20 (avg): =AVERAGEIFS(I2:I13, I2:I13, ">=" & M18, I2:I13, "<=" & M19)

That average excludes I3 and I9 in your example. You can determine that by entering the following formula into M2 and copying down through M13:

=IF(AND(\$M\$18<=I2, I2<=\$M\$19), "in", "out")

-----

Note that I choose x=1.7. This is a subject of some debate and disagreement.

Many people choose 2 or 2.5, even 1.5, IIRC. Those factors are easy to remember.

I choose 1.7 because in a normal distribution, that corresponds to +/-3sd ("sd" is standard deviations).

(But the IQR method of identifying outliers is not limited to normal distributions.)

To explain.... For a normal distribution, two criteria are commonly used to determine the outlier limits: +/-3sd ("weak" limits), and +/-4sd ("strong" limits).

And for a normal distribution, an IQR factor of 1.7 corresponds to +/-3sd; 2.5 corresponds to +/-4sd.

I prefer the "weak" limits because they include 99.73% of normally-distributed data, and it is more likely to exclude some data.

In contrast, the "strong" limits include 99.99% of normally-distributed data, and it is less likely to exclude any data.

For your example, it does not matter because the outliers are so extreme. But it might make a difference with other data.

Anyway, you should feel free to choose any IQR factor between 1.5 and 2.5.  Register To Reply

4. ## Re: Average values while removing outliers

PS.... Originally Posted by joeu2004 Caveat: Just because a value is an outlier, that does not necessarily mean it should be excluded. Usually, a value that is identified as an outlier should be excluded only if it is a mistake. That said, many people do indeed arbitrarily exclude outliers, with the intent of trying to determine some "centrality" of the data.
Speaking of "centrality" of data, you might consider calculating the MEDIAN, not the AVERAGE excluding outliers.

The median is the value such that (about) 50% of the data is above and 50% of the data is below it.

It is less sensitive to extreme differences in the data, such as yours.

And it avoids arbitrarily excluding valid data just because they seem to be "outliers".

That is why, for example, we talk about the median price of homes, not the average price. That way, the billionaire's \$50M mansion and the minimalist's 1000 sq-ft one-room home do not affect the statistic in an otherwise "middle class" neighborhood.  Register To Reply

There are currently 1 users browsing this thread. (0 members and 1 guests) 