+ Reply to Thread
Results 1 to 9 of 9

Iteration that finds concetration of numbers in a normal distribution

  1. #1
    Registered User
    Join Date
    05-25-2017
    Location
    Buchs, Switzlerand
    MS-Off Ver
    2019
    Posts
    75

    Iteration that finds concetration of numbers in a normal distribution

    Hi there!

    My Situation:
    My file has 16 worksheets.
    Each worksheet contains about 1500 more or less normally distributed values.
    Every worksheet has the identical quantity of entries, but the value ranges differ greatly. In the example provided (excel file) the number can range between 0% and 6%, but on another worksheet the range can be -50000% to +50000%
    On each worksheet there are 15 sections each with a cell holding a “low” value and a “high” value (like on the excel file provided).

    Aim:
    What the macro must do is to change the values of cell H2 (“low”) and H3(“high”) in an iteration, the aim is to satisfy the following conditions:
    1) find a concentration of occurrences so that either scenario 1 or scenario 2 lies at 67% or above.
    AND
    2) The number of observations is to be at least 2% or higher.
    In the example provided the green cell G6 shows a value of 68% for scenario 1 (and 32% for the other scenario), so condition 1 is satisfied.
    Condition 2: the number of observations is 53 (see cell F11) or 3,48% of the total (1548), therefore this condition is also met and we have a first solution.
    Now the macro should jump to the next section and repeat the iteration with the numbers in O3 and O4. To work systematically through the entire number range O3 should be the minimum of the next range. If no solutions can be found with the number in O3 as a minimum then the value O3 must be increased in small increments and so on (we are thus moving in “inch-worm fashion” up the number range and scanning for concentrations.
    To graze through all the numbers it is best to start from the absolute min. (0% in this example) and work upwards (up until the absolute max.). It is possible that less than 15 solutions or even no solutions are found.
    In that case, the macro can simply move on to the next worksheet and proceed in the same manner until all worksheets are worked through. Hope I didn’t confuse you totally. Please come back with questions if I did!

    Best regards,
    Excel_Arate
    Attached Files Attached Files

  2. #2
    Forum Guru
    Join Date
    04-13-2005
    Location
    North America
    MS-Off Ver
    2002/XP and 2007
    Posts
    15,803

    Re: Iteration that finds concetration of numbers in a normal distribution

    I don't see how you came to the solution you give in H3 and H4. When I change H3 and H4, nothing changes. Same for O3 and O4.

    In a generic sense, what you describe sounds like something Solver could do (for one scenario at least). I cannot suggest anything specific, since I see no connection between your "objective cells" (G6 and G9) and your "decision variables" (by changing cells H3 and H4). Solver does not use a brute force "inchworm" type method like you describe, but prefers to use algorithms like Newton's algorithm or an evolutionary algorithm that are more efficient than brute force algorithms.

    Once you get a handle on solving one scenario with Solver, then you can use Solver in VBA to "loop through" your different scenarios. There are lots of examples around this forum and the internet, if this proves to be a useful approach. You might start here: https://peltiertech.com/Excel/SolverVBA.html
    Quote Originally Posted by shg
    Mathematics is the native language of the natural world. Just trying to become literate.

  3. #3
    Registered User
    Join Date
    05-25-2017
    Location
    Buchs, Switzlerand
    MS-Off Ver
    2019
    Posts
    75

    Re: Iteration that finds concetration of numbers in a normal distribution

    Dear Mr. Shorty,

    Thank you for your swift reply. I wanted to keep things simple, so I only stuck in the values to help visualize the problem. Now I included the underlying data for two of the worksheets along with the formulae. You will see that changing the values in H3 and H4 will change the %-concentration for the scenarios.

    So what the macro should do is to run an iteration for the values in H2 and H3 (start by taking the absolute minimum from B2) and then find the ranges where either G6 or G9 >= 67% AND F9 >= 1.85%. The macro should run through thecomplete number range (up until the abs. max. is reached) and find the sections if any that satisfy these conditions. When finished with that worksheet, the macro should then hop on to the next worksheet.

    I have now done this manually for two sections for case 1 and one section for case 2.
    Hope this is clearer now. If not, please ask. Thank you!

    Best regards,
    Excel_Arate
    Attached Files Attached Files

  4. #4
    Forum Guru
    Join Date
    04-13-2005
    Location
    North America
    MS-Off Ver
    2002/XP and 2007
    Posts
    15,803

    Re: Iteration that finds concetration of numbers in a normal distribution

    Can you explain in a little more detail exactly how you manually came to those solutions? Exactly what does this "brute force" algorithm that you describe look like? It looks to me like there are many potential solutions, are you looking for any solution that works, or do you need to find a specific solution.

    Here's what I see happening, and my start to analyzing the problem.

    1) Looking at G6 and G9, and the desired target values (G6>0.67 [G9<0.33] or G9>0.67 [G6<0.33]), it looks like the basic goal is to find when E7 is twice E10 or E7 is half E10. E7 and E10 are basically conditional counts of columns A and B, respectively. The problem seems to boil down to "find a bin where E7 is either E10*2 or E10/2". Am I seeing this correctly?
    2) Because the problem looks very much like this kind of count of bins, my first step was to create histograms of both columns. (Note, this was developed on case 2 sheet)
    2a) IN E63, I entered 0.5 (the size of bin I decided to try). In E65, I entered 0. In E66, I entered =E65+$E$63 (note the mix of relative and absolute references). Copied down to E145 to get a series of 0.5 sized bins from 0 to 40.
    2b) To get the frequency distributions, I used the FREQUENC() function (https://support.office.com/en-us/art...7-fd9ea898fdb9 ). I selected F65:F146, Entered =FREQUENCY(A7:A818,$E$65$E$146) and confirmed with ctrl-shift-enter (FREQUENCY() is an array function). Copied this across to G65:G146. Now I have a frequencies for each column, and can create a histogram (column) chart if I want.
    3a) At this point, I know that I am looking for bins where the frequency in column F is half or double the frequency in G, so I add a column to get the ratio of frequencies. In H65 =G65/F65 and copy down.
    3b) You also say that the total for each bin should be 1.85% of 1524, so I enter these two numbers into two convenient cells (I used J61 and J62) and then multiply them together [I used J63 =PRODUCT(J61:J62)]. Then I added another column I65 =SUM(F65:G65) and copied down.
    4) To help identify the entries that meet the criteria, I entered a few boolean columns.
    4a) J65 =I65>J$63
    4b) K65 =OR(H65<=0.5,H65>=2)
    4c) L65 =AND(J65:K65).
    5a) Scan down column L -- Any TRUE result in column L represents a solution to the problem (if I understand correctly).
    5b) Scan down column K -- Any TRUE result in column K is almost a solution, and you may find a solution by expanding the bin a little. For example, I can see from column K that the 5 to 5.5 bin is close, if I expand this to be 4.6 to 5.6, I get another solution.

    I see a lot of possible solutions by trying different bin sizes. It is not clear to me if it is enough to just find one solution or many solutions, but I see a possible approach here that could be more efficient than a brute force try every possible bin definition until you find one that works.

    I copied it onto the case 1 sheet. The distributions here seem to be fairly heavily skewed towards the low end. By trial and error, I found that a bin size of 0.00015 found solutions for the .00015-.00030 bin, 0.00180-.00195 bin, and the .00270-.00285 bin. There are probably many others to be found.

    Does it seem that I am understanding the problem correctly? Does it look like something like this can work?

  5. #5
    Registered User
    Join Date
    05-25-2017
    Location
    Buchs, Switzlerand
    MS-Off Ver
    2019
    Posts
    75

    Re: Iteration that finds concetration of numbers in a normal distribution

    Dear Mr. Shorty,

    Many thanks for your detailed and response, which is well thought through! You are pretty much spot on with your understanding of the problem and I have worked your solution into the attached (updated) file. I took up your idea on the “ratio of frequencies” (item3a of your message) and inserted a grey column starting at K66, which simply states if there is a potential fit for either scenario, going only by the ratio of frequencies and not by the level of relevance (1.85%).

    I hadn’t used a histogram but instead depicted normal distributions, like the on inserted in “case 1”- From this I got a visual impression of where the numbers lie and where they are spread out more or are bunched together more densely. I start at the absolute minimum (0.00%) and take this as a first “low” value. I then take a number which is just a tiny bit larger (0.01%) and use this as the first “high” setting. By gradually increasing the higher setting you find those sections that satisfy the 2 conditions you know of.
    Let’s continue with this example. Raising the “High” setting to 0.03% lets the red scenario gain ground to 62%, but when I increase it further to 0.04%, scenario 2 drops back to 58%. So here, no solution is found (I have inserted the numbers in the second row of those section boxes on case 1).
    As a sudden drop in the probability of one scenario logically indicates that the other scenario is becoming more likely, I take the old “high” value as a new “low” and again continue increasing the “high” value, expecting that other scenario’s probability to increase (and hopefully surpass the threshold criteria). Admittedly, this is a “brute” manual procedure (and very time consuming). That’s why I thought a macro could plough through the data.

    While your way is far more elegant, the hitch is that we are forced to take equal bin sizes, while all I really want is a bin size that has a minimum number of entries (1.85% of the total data sample) with a minimum level of “success” (67%) for either scenario. So the bin sizes must vary to find all the solutions. In fact, the bin sizes vary quite considerably from what I saw - some solutions are very strong having more than 50 items of data with a success rate of 69%, while others just “eek by”. A thought: Possibly, a macro could alternate the bin sizes and copy out and intermediate results found on the way?
    Also, I am looking for as many solutions as possible. There could be 10 sections, just 1 or sometimes zero sections that fulfill the criteria. That is why my idea was that macro was to “play” with the numbers in H3 and H4 and then move on O4 and O5 and so on (after that the next of the 16 worksheets should be worked through).

    Side note: each worksheet will have the equal amount of entries but extremely different distributions. Also, the 67% and 1.85% criteria are arbitrary, but I could change those in the macro. At this stage this is a side issue of course.
    Hope I have clarified things sufficiently. I didn’t want to be too long-winded...
    How best to proceed from here?

    Best regards,
    Excel_Arate
    Attached Files Attached Files

  6. #6
    Forum Guru
    Join Date
    04-13-2005
    Location
    North America
    MS-Off Ver
    2002/XP and 2007
    Posts
    15,803

    Re: Iteration that finds concetration of numbers in a normal distribution

    While your way is far more elegant, the hitch is that we are forced to take equal bin sizes,
    I don't know that the approach I started exploring is necessarily limited to equal bin sizes. As a starting point for developing a working algorithm equal bin sizes was the easiest way to start, but I see no reason why it would require equal bin sizes. Of course, it adds the complexity of "how do I decide how large to make each bin" to the algorithm, but I see no reason why that couldn't be added once one understood how it all works.

    Which got me to thinking that maybe a clustering algorithm might be a useful part of this. It seems that the basic idea is to find "clusters" where A has a cluster in regions where B does not, and B clusters where A does not. Here's a spreadsheet example of k-means clustering: https://www.excelforum.com/excel-pro...ng-tester.html
    Wikipedia's k-means clustering: https://en.wikipedia.org/wiki/K-means_clustering Note that Wikipedia suggests that k-means could be NP-hard -- meaning that there may not be a good, non-brute force algorithm for this -- especially if you want to find all possible bins and know that you have found all possible bins.
    Wikipedia's general clustering analysis page: https://en.wikipedia.org/wiki/Cluster_analysis

    Have you tested your brute force algorithm? It seems simple enough, but are you certain it will find all possible bins? As far as programming your brute force algorithm it looks like it should be:
    1) assign low value
    2) Add step value to low and assign to high value. Do you have an idea of what kind of step value you would use? Always use 0.01%, or is there more involved to choosing a step value?
    3a) Test if this is a solution. If so, reset low value and return to step 1.
    3b) Test if this is trending towards a solution. If so, add step to high value and return to step 2.
    3c) Test if this is not going to be part of a solution. If so, reset low value and return to step 1.

    The exact details of the algorithm obviously still need to be worked out. Which of those steps do you have trouble with in VBA? How much is still the pre-programming step of figuring out the details of the different tests and resets? Do you have a few sample data sets where you already know all possible solutions to test the algorithm against?

  7. #7
    Registered User
    Join Date
    05-25-2017
    Location
    Buchs, Switzlerand
    MS-Off Ver
    2019
    Posts
    75

    Re: Iteration that finds concetration of numbers in a normal distribution

    Thank you for being so resourceful on this! I looked at the k-means clustering macros which clusters one set of data and yes, your idea of adapting this to “relative” clustering (if I understand you correctly) i.e. having one scenario that is highly concentrated (at least the double frequency) versus the other scenario appears very promising. How to do this I fear is quite beyond me, as my skills in VBA are limited to the use of macro recordings and adapting them arduously to fit a certain situation.

    Regarding the bin sizes, I looked at how they actually influence the results in case 1 (this tab I had to remove from hte file becvause of Memory sace Problems here on this portal). Beginning with the original bin size of 0.015% we started off at and gradually increased it to 0.021%. It is quite striking how some solutions just disappear and others emerge! I pursued that idea of simply increasing the bin sizes from 0 and copying the results received “on the way” into a table (planning to fine-adjust later manually). If you look the remaining tab "results for testing", it works, albeit in a “brute” style which wouldn’t win any prizes for efficiency. Some parts also need some refining and in others I simply lack the programming skills, but perhaps you can guide me.
    First please take a look at the “results for testing” tab.

    1. A loop will continuously increase the bin sizes (as you can see in C3, C4, C5 and so on). Eventually, the bin size will reach 2.81% (see D26). I have auto filtered for “relevant solutions” under I27 and get scenario 2 as a result, the solution range being {LOW:-5%;HIGH:-2,277%}. For testing purposes the Auto filter must always be refreshed every time a new value is entered in D26.
    2. The first successful bin size is copied into K27 and beneath it a 2 is entered (I could use the auto-filter but maybe the macro would be too slow with that, so I may try an index() function.
    3. The successful range (LOW and HIGH value) is copied into H2 and H3. For the next successful bin size of 3.48% I get two results, which are then copied into (N2, N3) and (T2, T3) respectively.

    The general procedure is:
    I. Create a loop for all worksheets (16)
    Establish the range for the progressive bin sizes. This was quite challenging, as the distributions on every worksheet are worlds apart. For the first bin size I decided to take the range between the 2nd lowest to the 2nd highest value (eliminating outliers) and divide the result by 40000 (taking a min. of 0.01%). The step size will be 10% of that initial number. Allowing for a loop of 40000 will then cover all the possible result. It probably takes quite a while to run this, but that’s OK, I’ll have a nice cup of coffee while I wait (at the outset I have a lot more classes then entries but that doesn’t bother me)

    II. Create a loop within which takes the progressive bin sizes, enters them into D27 and processes the results as described above
    Now, since the bin size increases by small amounts, I must avoid listing the same section countless times. Therefore, when I receive an answer range, I test for intersection with the previous result with 3 possible outcomes:
    1) The new solution intersects with the old one but doesn’t have more hits >>> ignore
    2) The new solution intersects with the old one but doesn’t have more hits >>> improvement: replace existing solution under 1)
    3) The new solution doesn’t intersect with the old one >>> new solution
    I guess, I must use “if …then” statements, but could you give me a clearer idea how this might look?

    Also, testing for intersection is OK if you have one solution and compare it to one other solution. I simply check if either of the high or low values is between the high and low values of the new bandwidth. However, as soon as I have more than 1 solution either with the new or the old bandwidth, I run into difficulties. I tried setting up an array formula, but it just wouldn’t work. Would you know what to do in such a case?
    I do appreciate your help very much!

    Best regards,
    Exel_Arate
    Attached Files Attached Files

  8. #8
    Forum Guru
    Join Date
    04-13-2005
    Location
    North America
    MS-Off Ver
    2002/XP and 2007
    Posts
    15,803

    Re: Iteration that finds concetration of numbers in a normal distribution

    I have let this percolate a little. Here's where my thoughts are going now.

    1) Clustering algorithms are beyond me, as well. However, with the correct choice of "programming language/platform", there could be a library/addin/etc. with the clustering algorithms preprogrammed. In addition to MarvinP's Excel/VBA implementation, I see that R has a library of clustering algorithms (https://www.statmethods.net/advstats/cluster.html ) and Matlab (https://www.mathworks.com/help/stats/kmeans.html ). For that matter, if you can figure out what to call this analysis you are trying to do (I assume you aren't the first in all of human history to try this), you might find preprogrammed solutions for it already out there.

    2) I see two large, overall parts to your question -- how to develop the algorithm (independent of programming language) and how to implement the algorithm in a given language (you seem to prefer something that combines Excel and VBA). Those are both fairly broad problems, and I find it is helpful for us if you can clarify exactly what you need help with. Most of your questions seem to assume the algorithm you want is developed (and I keep throwing new algorithms at you as if it is not), so they focus on program language specific things (like loops in VBA and reading/writing data from/to cells. When we get to this point, are you already familiar with VBA's loop structures? Are you familiar with using the Range object (specifically the .Value property) in order to read/write data from/to a cell? If you are not already familiar with these basic programming structures, then it might be valuable to spend some time becoming familiar with these programming structures before we worry too much about how they will fit into the larger algorithm. It is certainly easier for us to explain a small error you are making in a specific VBA statement than to develop an entire program for you (especially if you already knew most of the programming).

    3) Your latest implementation used ~40000 bins for a data set that contain ~1500 entries. This seemed excessive to me, as there should be no more than 1500 bins (of 1 each) possible. Which led me to think what would happen if we used the entries themselves to define bins? If we use every entry, we will get ~1500 bins of 1 each. If we use every 10 bins, we will get ~150 bins of 10 each, and so on. This led me to:
    A) combine the data in A and B into a single, sorted list.
    B) Extract every 30th value (SMALL(combined list,0) SMALL(combined list,30) and so on), then use the two FREQUENCY() functions to get the counts for each of these bins and so on like before. With careful attention to relative and absolute references, I could then make 30 copies of this block of cells to get all 30 possible combinations of bins using every 30th value. I could then work on what this looks like using every 35th entry, every 40th entry, and so on until I am satisfied that I have found all of the solutions (I don't have the patience to carry it out that far, yet -- At this point, I don't know how to know when you have found all possible solutions). Note that, at this point, I did not use any VBA -- just making copies of the calculation.
    C) At this point, I just need to extract the solutions from the various copies. If I arrange the calculation in a flat list, then a simple filter like you used could show all of the solutions. Then, as you note analyze the solutions to extract "unique solutions". Assuming you can figure that out, then we should be pretty close to an algorithm that will work.

    At this point, I assume you can figure out how to implement something like that in Excel/VBA, so I have just left the algorithm description only. How does that look?

  9. #9
    Registered User
    Join Date
    05-25-2017
    Location
    Buchs, Switzlerand
    MS-Off Ver
    2019
    Posts
    75

    Re: Iteration that finds concetration of numbers in a normal distribution

    Hi Mr. Shorty,

    I appreciate your effort and your thoughts! At this point I had to make a fundamental decision: either take step back and rebuild the procedure or to pursue that idea of increasing bin sizes and see where it leads me. I went for the latter and I’m almost there!
    First let me come back on your comment regarding the 40000 bins. I agree absolutely that this number appears excessive. But if we have 1500 entries, then we would only have 1500 bins (with 1 entry each) if they were evenly distributed. As we are both aware, when the data gets lumpy, you may have vast fields of nothing and then all of a sudden a you may have a cluster. The problem is, however, that you still have to test tiny bin sizes, because things get “fuzzy” very quickly hwne the bin sizes increase. Thinking back of the old days with ultra short wave radio stations if you spun the radio dial to quickly, you wouldn’t find the stations. I have however put in an “Exit for” which allows an “early” exit. This exit is prompted when the concentration of the data exceeds a pre-specified level.

    Could you please help me on the following (please look at the table in the attachment): The macro works quite fine and copies the newly found ranges into the columns R:X. The soultions are compared to what I have and basically 3 situations can arise:

    1) The new range intersects with the old one but doesn’t represent an improvement (the number of results (33) is less than that of the old range (34) and can be deleted.
    2) The new range intersects with the old one represents an improvement (the number of results (34) is at least to that of the old range (34). It’s an improvement because the new range is slightly larger as the macro is gradually increasing the bin sizes.
    3) The new range doesn’t intersect with the old one and is a new solution!

    Now, the side of the “old” solutions gradually compiles the solutions as the program runs its course and could hold up to 30 ranges. The aim of the game is to compare the new “suggestions” that are continuously copied into the “new” section and to either take them to replace those that they improve (it’s a one-on-one relationship), added on at the bottom of the “old” section or ignored. When done the new section is cleared so new entries can be copied into it.
    What I have set up so far is a series of complicated loop structures, that sort of works but when you have 3 new entries in the “new” section and they are to be tested against 5 exiting “old” solutions, it gets rather confusing…
    Would you be able to help me here with a VBA code for this task (using the table as you see it)? That would be excellent!

    Best regards,
    Exel_Arate
    Attached Files Attached Files

+ Reply to Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Similar Threads

  1. Replies: 14
    Last Post: 11-19-2019, 09:17 AM
  2. ways to determine a normal or log normal distribution in excel
    By aprildu in forum Excel Formulas & Functions
    Replies: 6
    Last Post: 10-29-2016, 06:32 PM
  3. Replies: 9
    Last Post: 07-24-2015, 03:40 PM
  4. How do you generate random numbers in a normal distribution?
    By Ladee_bird in forum Excel Programming / VBA / Macros
    Replies: 5
    Last Post: 03-13-2013, 08:18 AM
  5. Replies: 5
    Last Post: 01-31-2011, 12:21 AM
  6. [SOLVED] NORMAL DISTRIBUTION
    By FLKULCHAR in forum Excel Formulas & Functions
    Replies: 13
    Last Post: 10-27-2005, 06:05 PM
  7. Normal Distribution?
    By Anthony Slater in forum Excel General
    Replies: 3
    Last Post: 03-08-2005, 04:06 PM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

Search Engine Friendly URLs by vBSEO 3.6.0 RC 1