1st, 2nd & 3rd most common ranges within a single column of values

**IO4lZE** · 02-22-2015, 10:12 AM

Hi. Thanks for stopping by my question. Would you know how I can analyze data to give you the most common range, for a more meaningful analysis? I guess it would require that we define some parameters, but what are those parameters, and would this be a formula or array? Essentially, it would be, conceptually, like a scatterplot, but without the plot itself. I just want the data to be automatically analyzed to output the most common ranges. You probably already got what I'm asking, but here's a quick, unsophisticated example, assume 1 column like this:

3
91
95
96
96.5
89
157
177
175
181
250
299
301
793

So, I'd want the output for the above example to be:

1st most common range of values: 89-96.5
2nd most common range of values: 177-181
3rd most common range of values: 299-301

I'm not sure how/where I'd tell Excel what constitutes a "range", or how this even works.

Thanks for showing me the right direction/formula/array, since I think this might be more meaningful than just looking for 1 value that represents "mode", when often, there is no single number that occurs more than once in a data set and more meaningful data appears in clusters that are probably just ranges. Thanks for helping me out and guiding me down the right path.

**XOR LX** · 02-22-2015, 10:31 AM

Hi.

How are you defining "most common" in terms of a "range" of values?

How is it determined what constitutes the start and end values for one of these "most common" ranges?

Regards

**sourabhg98** · 02-22-2015, 10:34 AM

Hello
Just a wild try
May help...

**MrShorty** · 02-22-2015, 11:45 AM

My first thought would be to start with the FREQUENCY() function: https://support.office.com/en-us/art...rs=en-US&ad=US

I'm not exactly sure how you would come up with "bins" (maybe every 50 -- 0,50,100,150,...?). Once you come uyp with suitable bins, the FREQUENCY() function will calculate how many are in each bin, then you can use the LARGE() function ( https://support.office.com/en-us/art...rs=en-US&ad=US )to return the 3 largest frequencies. If needed, you could then use a lookup function to return the bin associated with that frequency.

**IO4lZE** · 02-22-2015, 04:08 PM

Exactly. Those are the key questions I'm basically also asking. In a way, what I'm looking to do is find a mathematical solution to what you and I normally do intuitively when we scan and automatically "make sense of" a scatterplot and summarize, "ok, so basically, values between here & here are important". I'm no mathematician, that's for sure, but I think, to answer these questions, it seems like you'd almost have to analyze each data set (my 1 column) uniquely first, in order to come up with statistical things like "variance", etc ... but I'm not trained in that, so I can't quite think how a statistician would think. So we'd want to find the most common ranges within any data set, where, indeed, the span between clusters or bins or ranges may be highly variable based on the data set. That's all I can think of for now. Hope it helps some. Please let me know if you have additional thoughts. Thanks for your post.

**newdoverman** · 02-22-2015, 04:19 PM

Would this be something along the lines that you are looking for?

	A	B	C	D	E	F	G
1	Data	Bins	Frequency	Range	Large to Small
2	3	0	0	100	6
3	91	100	6	200	4
4	95	200	4	300	2
5	96	300	2
6	96.5	400	1
7	89	500	0
8	157	600	0
9	177	700	0		=IF(ROW()-1>3,"",LARGE($C$1:$C$11,ROW(INDIRECT("1:3"))))
10	175	800	1
11	181	900	0	=INDEX($B$1:$B$11,MATCH(G1,$C$1:$C$11,0))
12	250	1000	0
13	299		=FREQUENCY(A1:A14,B1:B11)
14	301
15	793

**IO4lZE** · 02-22-2015, 04:21 PM

Originally Posted by MrShorty

Mathematics is the native language of the natural world.

MrShorty,

Thanks for your thoughts there re frequency and bins. Yeah, the bins thing is really the key, but I need a mathematical solution (your quote is perfect for this post) to what our brains do intuitively/second nature when we visually scan, say, a scatterplot and are able to quickly come up with a solution. This question kind of feels like a "Brain Games meets Excel" challenge. Thanks for your post and hopefully we can all figure it out. My post to Xor LX has some related thoughts. Feels like someone with a statistics type of background could really lend some valuable insights here for us as well. I'm really interested to learn more and find a solution to this puzzle. Thanks

**IO4lZE** · 02-22-2015, 04:22 PM

newdoverman

let me have a looksee, and get back to you. thank you

**IO4lZE** · 02-22-2015, 04:32 PM

newdoverman,

First, thanks for your post. Ok, to answer your question directly, no, I don't think bins with arbitrary 100,200, 300 etc cutoffs is the way to go because those are not the natural cutoffs inherent to the data. It's not what the human eye/our brain would intuitively do, eg, when examining a scatterplot, to use my example above.

I almost feel the solution would be like this (I'm a beginner honestly):

There would be "2 runs" (at minimum), where (I guess) an array would do a "first pass" evaluating the data for conformity, variance, or whatever statisticians would understand way better than me. That first run will essentially "create the bins". The 2nd pass will do the counts. The key is "what are the bins (the ranges)?" The funny thing is that the DATA clusters themselves should "tell us" what those bins are....and maybe someone will have a frequency solution, combined with some stats that won't even need "bins". These are just some thoughts.

PS A follow up thought I had: I wonder if an array would help out. Here's why I'm thinking that: Basically an array could (?) continually "sweep" or do passes thru the data, using unique, variable increments with each "sweep" (each pass) until if finds the right "bin" (range). I really don't know the solution though. Thanks to everyone for their help so far. It will be interesting to see what the solution is in the end.

**MarvinP** · 02-22-2015, 04:35 PM

Hi IO4IZE,

I have a little different spreadsheet. See if this helps.

**IO4lZE** · 02-22-2015, 04:40 PM

sourabh, will review your earlier .xlsx and respond shortly, thank you.

**newdoverman** · 02-22-2015, 04:47 PM

Here is another way of looking at this. Use of Pivot Table ranges:

**IO4lZE** · 02-22-2015, 04:49 PM

sourabh & marvinp, can you explain the .xlsx, and the logic as we go from A thru I (sourabh), or A thru N (marvinp)?

I'd appreciate it. That will help me follow along.

**IO4lZE** · 02-22-2015, 04:50 PM

To all, Please don't just post excel files. Please include an explanation of your logic so I can follow along. Thanks a lot.

**IO4lZE** · 02-22-2015, 04:57 PM

In the meantime, I'm trying to go thru the 3 .xlsx posted so far, to see if I can make sense of them. Right now I'm looking at MarvinP's since it looks like an updated version of sourabh's.

**MrShorty** · 02-22-2015, 05:09 PM

Originally Posted by IO4lZE

MrShorty,

Thanks for your thoughts there re frequency and bins. Yeah, the bins thing is really the key, but I need a mathematical solution (your quote is perfect for this post) to what our brains do intuitively/second nature when we visually scan, say, a scatterplot and are able to quickly come up with a solution. This question kind of feels like a "Brain Games meets Excel" challenge. Thanks for your post and hopefully we can all figure it out. My post to Xor LX has some related thoughts. Feels like someone with a statistics type of background could really lend some valuable insights here for us as well. I'm really interested to learn more and find a solution to this puzzle. Thanks

It is an interesting question. I might suggest that, in many ways, in much the same way as we might see this as a mathematical problem first, I might also suggest that this is more of a generic "algorithm development" question. The way I see this sort of thing happening:

1) mathematical definitions and concepts.
2) Algorithm development -- this is where we take those "intuitive" steps and try to formalize and systematize them into something a series of steps and processes that will accomplish the intuitive part with the mathematical parts.
3) Then, we work on implementing our algorithm in our chosen programming language (in this case -- spreadsheet formulas and functions).

Since we are kind of stuck on step 2, I would probably first start with some research into others' algorithms for similar analyses (since I have a hard time believing that we are the first people in all of human history to wonder about this kind of analysis). A quick internet search comes up with this from Wikipedia: http://en.wikipedia.org/wiki/Cluster_analysis
This one: http://www.readperiodicals.com/201006/2108164771.html
This one: http://www.data-miners.com/companion/Chapter11.ppt
and many others.

I haven't got time right now to digest such a broad topic, but I would start by researching these algorithms and see how they are implemented. Then come back to Excel and look at how I might implement these in Excel.

**IO4lZE** · 02-22-2015, 05:10 PM

ignore this post #17. Pls see instead my next post #21, since I completely rewrote this post. Thanks!

**IO4lZE** · 02-22-2015, 05:13 PM

Mr Shorty, thank you for that. Good. Yeah, it looks like we're making progress right now with sourab & marvinp, and I'm reviewing newdoverman's last post. Have an awesome day, and thanks again.

**IO4lZE** · 02-22-2015, 05:22 PM

newdoverman, I see where you're going for sure, thanks! So I can see you've tightened the ranges from 100s down to 20s. But the prob with that, respectfully, is that they're still arbitrary (you make them up), instead of the perhaps preferred approach in marvinp et al, which is (almost) fun 'cuz you can try and walk thru the logic one column at a time (ie helper columns...which is the premie stage I'm at, lol). I'm still awaiting a response to my last question to them (re logic in Column D). newdoverman, I really appreciate you.

**MrShorty** · 02-22-2015, 05:34 PM

From an internet search for "cluster analysis in Excel":

http://www.lynda.com/Excel-tutorials.../165438-2.html
http://www.neilson.co.za/k-means-clu...crosoft-excel/ (though this one is more VBA than Excel).

There were many other hits as well.

**IO4lZE** · 02-22-2015, 05:44 PM

sourabh & marvinp,

I've figured out what the "60" is in your .xlsx in $D$1.

It's an Average difference from 1 value to the next (an average of the column C differences).

Since all the bins are ultimately based on Column D, and Column D2:D14 are relative to the "60" value, I'm trying to determine how you got the right Upper & Lower limits of those ranges (Column E & F) based on the yes/no comparison to D$1$. Still reviewing that.

**IO4lZE** · 02-22-2015, 06:10 PM

sourabh & marvinp,

I'm presently evaluating D & E ... pretty thoughtful approach. I'll need to think more on it. First, thanks. Second, I think I see an error:

Why is the value "89" from column A missed? Shouldn't that be the first lower limit of the first range, instead of 91, esp since the very first value is only 3?

Thanks for your thoughts.

I'm stepping out for a quick bite. Will check back later. Kind regards and thanks for the help so far...

**MarvinP** · 02-22-2015, 07:21 PM

Hi IO4IZE,

Lots of problems are clearer if looked at from a different direction. This problem seemed to have "funny" ranges or Bin limits in it. They weren't evenly spaces, just kind of random. I was interested to see what Frequency would do with your numbers with funny bin limits. I needed to do an Array Formula (CSE formula) to get the Frequency Function to work.

Then it occurred to me we could do this problem easier/differently just using CountIF. See the attached where I simply counted the number of numbers less than the bin limits in Column A. Countif() isn't an Array function and works a lot faster. Then I simply subtracted the countif below each row to see how many were in that specific bin.

When I look at the problem in this way, who needs Frequency when we have Countif? See the attached. Does this help? Do you still want to see which bin is fullest?

**IO4lZE** · 02-22-2015, 08:40 PM

Marvin,

Thanks for the follow up.

Here's the problem imho. In this more recent CountIF version, you're falling into a similar trap encountered earlier in this thread, which is that you're assigning arbitrary bin cutoffs. That's exactly what I don't want. I want the opposite. I want the bins (ranges) to be automatically detected as shown below. I followed the counts on your previous version (array in M, etc). The primary challenge is not so much the counting, but rather first finding those Lower & Upper limits, not unlike you had previously.

In a sense, the "smart" bins, or ranges, will automatically detect "clusters" of data, and then report that back. Using my initial example from my first post, that would be like:

3
91
95
96
96.5
89
157
177
175
181
250
299
301
793

where I'd want the output for the above example to be:

1st most common range of values: 89-96.5
2nd most common range of values: 177-181
3rd most common range of values: 299-301

Where to go from here...

I think the previous version, which focused on the main challenge, namely determining the ranges, is the best place to continue from. However, since you've already shown a lot of sophistication with that, I'm wondering (dare I ask?) if you or others might be able to make it more sophisticated (more accurate), as follows: It's good but not great in one respect: It's basically saying "as soon as you take a leap that's greater than the average leap, then mark that as the lower limit" (and related logic for upper limit). But, the fault (or area for improvement) is if we can figure out a smarter way (statistically speaking), a logic that's more accurate than "average leap", since, if there's a HUGE gap somewhere in the data, then that gap will ultimately be responsible for missing (not calling) some ranges of values.

Also, a quick side note: I'm learning that thinking in terms of "bins" is the wrong way to go, since we don't want predetermined, arbitrary bins with arbitrary cutoffs since assigning arbitrary cutoffs (funny or not) misses the point completely; rather we want the formulas/arrays to report back the ranges (lower & upper limit) of the actual data "clusters", if you will, that is, where values are most common, using an approach like your initial one. So just wondering if we can improve from "average leap" to something more accurate, and avoid the notion of "bins" altogether, continuing from Frequency with strange Bin Ranges.xlsx, and simply modifying/improving the logic from D thru F. Thanks for your help so far.

**MarvinP** · 02-22-2015, 09:03 PM

Ok,

So you don't know the low and high value for each bin. You want "the code" (us) to tell you what these are. Can you tell me how many bins there will be? How many ranges/buckets/bins are we supposed to come up with? That might help!!

This sounds more like a statistics problem now. I'm thinking of bell shaped curves over "clusters" of points/bins. How do we know if 5 clusters are better than 4? Some Least Squares Regressions tool?

I'm starting to get a feeling for the problem but this might be one of those without enough information to give an exact answer.

**IO4lZE** · 02-22-2015, 10:33 PM

How many ranges/buckets/bins are we supposed to come up with? That might help!!

marvin? Thanks for your post. You'll notice that that's already been discussed at length in this thread. Also please don't be angry or pretend to be angry at me. From your eyes, emoji and extra punctuation, it looks like you are bugging out. Please keep focused and professional thanks. Nah, I didn't want "the code". That's weird cuz I never even used that term and I'm not like that; and for anyone reading the whole thread, they can see that's an unfair mischaracterization. I've been very open-minded, inspired, and have been just as interested in the theory and the math as much as any solution. Actually, you're asking me to provide the

ranges/buckets/bins

but that's actually the point of the post is to come up with the ranges, very much like I described in post#24 which is the same as I posted originally. Appointing arbitrary "buckets" or "bins" is like trying to push a square peg through a circular hole, pigeon holing really. And it means you (theoretically) didn't understand the initial post or other posts where I talk about "scatterplots" and "intuitive" and math, etc.

Marvin, what's strange is that you specifically did not use bins in your first submission and had a completely different solution that didn't involve bins in your attachment Frequency with strange Bin Ranges.xlsx, so I thought you already understood what we're trying to do here? I thought I was being very genuine and kind and helpful to others in explaining the .xlsx you submitted...but simply attached but without an explanation or any discussion. I think you misread my entire last post #24 or are acting like you took it personally because your last post was very defensive. You misread or want to be perceived as having misread my previous post before this one.

When you say

I'm starting to get a feeling for the problem but this might be one of those without enough information to give an exact answer.

, please don't speak on behalf of others. Let others make their own decisions and help and come up with their own methods. No, there is enough information. You may just not have the answer or want to say that. This is not a simple black and white question as you apparently already knew from your Frequency with strange Bin Ranges.xlsx, Columns D-F. So you knew all this already. This is quite strange.

I thought you would be more gratious since the xlsx you submitted was a carbon copy of the one initially posted by another user (sourabh), buit just added a count? That part confused me. Who came up with that non-bin logic? Like I stated in my post before yours, or post #24, that seemed to show an understanding of the challenge initially posted, quite clearly I might add. No need to post back if you have additional negative comments, sarcasm, or insults. Those were inappropriate.

In my last paragraph there I provided additional clarity in response to others' (not mine) question about "bins":

I'm learning that thinking in terms of "bins" is the wrong way to go, since we don't want predetermined, arbitrary bins with arbitrary cutoffs since assigning arbitrary cutoffs (funny or not) misses the point completely; rather we want the formulas/arrays to report back the ranges (lower & upper limit) of the actual data "clusters", if you will, that is, where values are most common, using an approach like your initial one. So just wondering if we can improve from "average leap" to something more accurate, and avoid the notion of "bins" altogether, continuing from Frequency with strange Bin Ranges.xlsx, and simply modifying/improving the logic from D thru F. Thanks for your help so far.

For others, post#24, and really all my posts, summarizes the issue well, and solutions are welcomed.

**MrShorty** · 02-22-2015, 11:08 PM

I don't know if anyone looked at some of the links I posted. I looked a little closer at the k-means algorithm page. Seems simple enough, especially for 1D data like we have here:

1) Choose how many "clusters" there should be. Note that the documentation says that this is more difficult if you want the algorithm to determine how many clusters to find, so, as a first look at the algorithm, specify how many clusters. Your intuitive analysis suggest 3 clusters.
2) "Guess" at the means for the three clusters (maybe 90, 150, and 300). Some suggest that these could be "random" for the first iteration, though random choices don't seem to work well with this set and our first ever look at this kind of algorithm.
3) "Assign" each data point to an "average".
4) Compute three new averages based on these assignments.
5) Return to step 2 with the new averages, reassign data points to the appropriate averages.
6) Repeat until the three averages stop changing.

It's not very good, it was put together quickly by a novice in these cluster algorithms, but maybe it is a start. In any case, I think that, if someone were to take the time and effort to become familiar and somewhat proficient in these kind of algorithms, such a problem shoudl be solvable.

**IO4lZE** · 02-23-2015, 01:36 AM

MrShorty,

Your kungfu is strong. First, I apologize because, for some reason, I didn't see your post #20. Second, huge thanks for providing a sense of direction. Ok, it's quite late here and I want to study up on these most recent 2 posts from you when I get back to the office tomorrow morning. I'm looking forward to checking that out for sure. and following up here of course with you.

Have an awesome evening.

**sourabhg98** · 02-23-2015, 09:41 AM

Hi
Sorry for the late reply
Actually there is no specific way to extract ranges from a large array that are most frequent because there is no specific criteria... until we get some specific ranges so that we can extract frequencies of the specified ranges and then them most frequent ones...so what I have done is this...

--In the column B I have arranged the observations from column A in ascending order
--In column C I subtract each value in column B with the value above it...
--Now in cell D1 I have the average of those differences which we got in column C
--Now D column Checks if the difference of each observation (with the value above it) is smaller than the average difference or not...if yes then gives a 1 otherwise returns 2
--Now if see D column we have values in the form 2,1,1,1,1,2,1,1,1,2,1,1,2 ..we can observe that every 1 after a 2 represents the lower limit and every 1 with a 2 after it represents the upper limit...
--So column E extracts the upper limits and F the upper limits
--Now G and H columns just remove the blank cells from column E and F
--Finally we combine the columns G and H with a " - " in column I

Hope this helps!

**IO4lZE** · 02-23-2015, 02:34 PM

Sourabh,

Yes, I was able to figure all that out on my own back on posts #15 & #21. thanks anyway.

And I even summarized it in post #24, which is the current direction of this thread:

It's good but not great in one respect: It's basically saying "as soon as you take a leap that's greater than the average leap, then mark that as the lower limit" (and related logic for upper limit). But, the fault (or area for improvement) is if we can figure out a smarter way (statistically speaking), a logic that's more accurate than "average leap", since, if there's a HUGE gap somewhere in the data, then that gap will ultimately be responsible for missing (not calling) some ranges of values.

You wrote,

Actually there is no specific way to extract ranges from a large array

However, that's actually not true. I think you also may have missed MrShorty's post #27. Damn, I feel like a moderator, lol. Remember, if you're gonna set a fire, stick around.

OK, thanks.

**IO4lZE** · 02-23-2015, 09:59 PM

MrShorty, Just a quick note here to let you know I've fallen a little behind, and haven't forgotten, but mostly to express thanks, and let you know I promise to follow up here as soon as I can...thanks/thanks all. We had some wild weather here today with the first snow in decades, perhaps like most of the country & probably world, so had to get out with the fam for scenic photos and that whole thing. I hope you're doing great and sending my respects along.

**IO4lZE** · 02-24-2015, 02:05 AM

MrShorty,

Right on! k means cluster analysis would definitely serve as that more sophisticated method and is pretty much what I was getting at. Reading thru that made me feel probably how a caveman felt when he did his first cave drawing, or when he first said "mama"- haha. I like that idea, esp as described:

“In statistics and data mining, k-means clustering is a method of cluster analysis which
aims to partition n observations into k clusters in which each observation belongs to the
cluster with the nearest mean.

That seems like a step up from what I was calling the "average leap". I was really surprised to see how much coding, that Sheldon Neilson (kudos), eg, put into this concept, and the k means cluster analysis really seems like a good answer to my original question in this thread. I like that it runs thru all the iterations (why I originally thought an array might be best) to find these clusters

The k-Means algorithm is an iteration of the following steps until stability is achieved i.e. the cluster assignments of individual records are no longer changing

this quote from the http://www.neilson.co.za/k-means-clu...crosoft-excel/ previously provided. And I also like that data observations belong to the cluster with the nearest mean...which is a variant, albeit, of "average"; only it doesn't compare it to the average of the entire data set, making it a more accurate approach I believe.

Well, I still have to go thru your latest table, and will follow up thereafter for sure. And, this week I'll definitely want to do some more independent research on k means, as you wisely suggested. Thanks for providing the guidance you did thru this clusterf. It's cool if you think about all these posts as individual observations in a data set, and how it all kind of came together (like a cluster) that wasn't necessarily that obvious; just a neat notion I think. Equally cool was that, while I'd mentioned ranges and "scatterplots" early on, your post #20 was first to mention the term clusters, and I hadn't actually seen it (my bad) til after I'd mentioned "clusters" in post#24. This thread has been a pretty cool experience and I learned a lot, thanks to the collective enterprise. I'll follow up as promised. Thanks for your patience in the meantime.

**IO4lZE** · 02-24-2015, 02:29 AM

I'm also learning a lot from this wikipedia page on:

http://en.wikipedia.org/wiki/K-means_clustering

There's a lot on that page that shed a lot of light on various aspects from this thread, things like cluster shape, some of the history of k means, initiation methods, Forgy vs Random partition methods, fuzzy stuff, algorithms, variations, and the discussion section, and the importance of k means in signal processing and data mining.

Just wanted to share that follow on with everyone.

Thanks again MrShorty. Will keep in touch.

**MrShorty** · 02-24-2015, 09:56 AM

If you want more research and more complication, after looking briefly at marvinp's and sourabhg98's proposals, they look like variations on the "basic sequential algorithm" scheme: http://en.wikipedia.org/wiki/Basic_s...rithmic_scheme

If you are interested in a different looking algorithm for this problem.

**IO4lZE** · 02-25-2015, 09:46 PM

Thanks. Well, I looked at the BSA, and imho, sourabh/marvinp's didn't look like that. Theirs is simply finding the average distance between values, and then creates an upper or lower limit as soon as one number goes to the next but only if that distance > the average difference.

In the process of reviewing your last link, though, I came upon another nice page with various clustering methods here: http://en.wikipedia.org/wiki/Categor...ing_algorithms , which, interestingly, includes "K-means ++ (basically an improvement over k means) with more info here http://en.wikipedia.org/wiki/K-means%2B%2B . However, perhaps even a basic k means would be cool to work out; either way, thanks. Ok, from here on I apologize if I wrote too much:

Ok, I've checked out your most recent prop, and I do admire & appreciate that it takes what appears to be a more accurate approach...but I wonder how much the arbitrary 100, 180, 300 cutoffs had to do with that, since it sort of relies on a "visual first" (which I assume played a role in creating them). What I liked, of all things though, was how how you corrected for outliers, by comparing the distance to these arbitrarily-defined centroids to the SD of the resulting #s within that kluster.

I was thinking the problem is really all about our initiation method. Your most recent prop kind of relies, first, on picking arbitrary cutoffs for your centroids (100, 180, 300). Plus we kind of picked "3" as our # of klusters, and then arbitrarily picked 3 (perhaps convenient) cutoffs. I was thinking that, if I were doing it, it would be easy to "cheat", meaning one could just pick # of klusters and the actual centroids for initiation, by sneaking a peak first (which any smart person would do if they could). But in cases with massive amounts of data, it's just not feasilble (like the way I've conveniently bolded certain sets for our outcome). What I'd like is to be able to not have to study the data first, visually, but to be able to do this "blind", if you will. It looks like the initiation method and the iterations are the 2 key steps. It seems we'd need to first figure out the smartest initiation method, but again blindly and randomly. From http://en.wikipedia.org/wiki/K-means_clustering:

Initialization methods

Commonly used initialization methods are Forgy and Random Partition.[9] The Forgy method randomly chooses k observations from the data set and uses these as the initial means. The Random Partition method first randomly assigns a cluster to each observation and then proceeds to the update step, thus computing the initial mean to be the centroid of the cluster's randomly assigned points. The Forgy method tends to spread the initial means out, while Random Partition places all of them close to the center of the data set. According to Hamerly et al.,[9] the Random Partition method is generally preferable for algorithms such as the k-harmonic means and fuzzy k-means. For expectation maximization and standard k-means algorithms, the Forgy method of initialization is preferable.

I guess yours is more the Forgy partition style (referring to my first wiki link above), although I was wondering if there was an equally straight-forward way that didn't necessarily arbitrarily pick #klusters and the exact centroids initially. Wondering, eg (once we figure out best initiation method), if a modern day array would help us to quickly run thru iterations to help determine the both the # of clusters and the actual "centroids", a la some combination of one of these (preferably Random) partition initiation methods plus k means or k means ++, or something similar. Ultimately, I want to stick with the main idea of k means, namely partitioning n observations into k clusters in which each observation belongs to the
cluster with the nearest mean.

Having sort of stumbled here with too much written, what I'd like to achieve (life: a work in progress) is shown in the DIAGRAM below the paragraph on Initiation (shown above) within the link above. In that diagram, it shows 4 STEPS. K means++ has a different seeding method, and is apparently twice as fast computationally (on iterations) and 1000x more accurate. Proposed in 2007, it says:

This seeding method yields considerable improvement in the final error of k-means. Although the initial selection in the algorithm takes extra time, the k-means part itself converges very quickly after this seeding and thus the algorithm actually lowers the computation time. The authors tested their method with real and synthetic datasets and obtained typically 2-fold improvements in speed, and for certain datasets, close to 1000-fold improvements in error. In these simulations the new method almost always performed at least as well as vanilla k-means in both speed and error.

Generally, at least to me, it seems like the key to finding clusters is identifying the fact that the distance between each of the points in that cluster is (a lot) less than the larger (eg "average") difference between data points throughout the entire data set. So, I think an array that runs through iterations finding these "nearest means" is the key. A lot of this sounds like identifying star systems or galaxies by their relative proximity, etc.

In fact, another method called "k-medians" made me think that one could also use that in combination, in other words, while running thru the means iterations, it tries to find the median or central number within the (draft) cluster, and then keeps feeding that back into the array. So, I was thinking a combination of iterations that use both mean and median to identify the clusters. I know I'm mostly babblying and probably looking foolish around now. But I do feel the thought process is important. I know you're the same. Enjoying this, even though reading up on all these methods makes my eyes a little blurry. I'd obviously be more than happy implementing something like k means. Well, thanks for allowing me to ramble. Sorry, I know I wrote too much today. Kind regards.

**Pete_UK** · 02-25-2015, 10:29 PM

Here's a recent thread that you might find useful:

http://www.excelforum.com/excel-prog...ml#post3998817

Hope this helps.

Pete

1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Re: 1st, 2nd & 3rd most common ranges within a single column of values

Thread Information

Users Browsing this Thread

Similar Threads

[SOLVED] need vba code to compare two column get result of common values of both column

Bookmarks

Bookmarks

Posting Permissions