+ Reply to Thread
Results 1 to 23 of 23

Aggregating data based on a search key, across a very large dataset possible?

  1. #1
    Registered User
    Join Date
    10-09-2014
    Location
    Sydney, Australia
    MS-Off Ver
    2016
    Posts
    12

    Aggregating data based on a search key, across a very large dataset possible?

    Hey friends,

    I have a database full of receipt IDs and need to filter out particular records (i.e. receipts belonging to buyers/sellers that no longer exist). I have spent almost 3-4 hours after work, every day, over the last 5 weeks, and have only finished around ~2200 records, one by one. I underestimated how long it would take, as there are around 330,000 records. So I’m 0.67% of the way there...

    I’m thinking there must be a faster way to do this in Excel?

    So the process is I’m performing:
    1) A look-up of a specific string (in the ‘sub_seller’ column) that can appear in one to many (1:M) records in another table’s (‘Search Table’) column called ‘id’, and
    2) Once found, across all the rows found, aggregates all the values in the column next to it (‘owner_id’), including the search/lookup string itself, and dumping this result set into the adjacent cell in the original table
    3) I’d then like to repeat this process for the other column (obj_buyer) in that same original table.
    4) Once the process (lookup, aggregation, data dump) has been performed for both tables, I’d like to join only the common values across the two data dumps only the common values (i.e. INTERSECT operation) into a separate column (‘INTERSECT’).
    5) I then have a filter/exclusion table, where if any of the values (id’s) in ‘UNION’ (sorry I just realised I meant to use 'INTERSECT') are found in said table, they are to be excluded from the results (which will help to also exclude all the receipts associated with those excluded id’s


    I’ve attached a clean file containing a complete example of this process above. I’ve also put screenshots below to show what each step above looks like / is doing:
    (Also just a note, I realised '1636' (lookup value) should also be in the 'aggregate_seller_owners' column, apologies!

    Step 1)


    Step 2)

    Step 3)

    Step 4)

    Step 5)


    Any ideas? Any assistance would be greatly appreciated.

    Thanks so much guys!
    Attached Images Attached Images
    Last edited by Restrictedz; 08-15-2019 at 07:28 AM. Reason: I meant 'intersection', not 'union'

  2. #2
    Registered User
    Join Date
    10-09-2014
    Location
    Sydney, Australia
    MS-Off Ver
    2016
    Posts
    12

    Re: Aggregating data based on a search key, across a very large dataset possible?

    Apologies, but I thought the pictures attached would fall below each step but doesn't look like to be the case! Hopefully all the pictures are in order for you (they are for me!) - Steps 1 to 5.
    Link is here: dropbox dot com/s/hzu0a0nhjk18qv2/receipts_db_Cleaned.xlsx
    Last edited by Restrictedz; 08-06-2019 at 10:13 AM.

  3. #3
    Registered User
    Join Date
    10-09-2014
    Location
    Sydney, Australia
    MS-Off Ver
    2016
    Posts
    12

    Re: Aggregating data based on a search key, across a very large dataset possible?

    Bump - any ideas at least on the aggregation part? (Steps 1 & 2)?

  4. #4
    Registered User
    Join Date
    10-09-2014
    Location
    Sydney, Australia
    MS-Off Ver
    2016
    Posts
    12

    Re: Aggregating data based on a search key, across a very large dataset possible?

    Does anyone have any ideas or suggestions? Thank you

  5. #5
    Forum Guru
    Join Date
    08-28-2014
    Location
    USA
    MS-Off Ver
    Excel 2019
    Posts
    17,535

    Re: Aggregating data based on a search key, across a very large dataset possible?

    Hello Restrictedz and Welcome to Excel Forum.
    Please upload the file directly to this site by clicking on the GO ADVANCED button below the Quick Reply window and then scrolling down to Manage Attachments to open the upload window.
    Also your profile states that you are using Excel 2001(?). Prior to Excel 2003 .xlsx was not supported, so I assume the profile to be incorrect. This is important because the proposals offered may be tailored to the version of Excel you have, for example if you are actually using 2016 or later you may be able to use Power Query (Get & Transform) to do some of what you are asking, so please update that information.
    Let us know if you have any questions.
    Consider taking the time to add to the reputation of everybody that has taken the time to respond to your query.

  6. #6
    Registered User
    Join Date
    10-09-2014
    Location
    Sydney, Australia
    MS-Off Ver
    2016
    Posts
    12

    Re: Aggregating data based on a search key, across a very large dataset possible?

    Thanks for your reply JeteMc. I have updated my profile accordingly to correctly reflect the Excel version I'm currently using.

    I have tried to upload, but given the size of the original file (16MB), I had to delete a lot of rows to reduce it down to say, 6MB, which, when compressed into a .zip, falls under the 9.79MB file limit for a .zip however I cannot upload it as the website doesn't allow me to: "receipts_db_Cleaned.zip - Upload of file failed.", or I still get the error "file too large to upload". Please advise? Could you perhaps try downloading it via the Dropbox URL I pasted above and uploading it? I'm not sure how else to upload this file, but would really appreciate any assistance.
    Last edited by AliGW; 08-11-2019 at 06:39 AM. Reason: Please don't quote unnecessarily!

  7. #7
    Registered User
    Join Date
    10-09-2014
    Location
    Sydney, Australia
    MS-Off Ver
    2016
    Posts
    12

    Re: Aggregating data based on a search key, across a very large dataset possible?

    Deleted a majority of the records in the file to reduce file down as close to 1MB as possible. Attempt #2 at upload.

    JeteMc, can I please get your help with this? You're the only one who seemed to notice my post. Steps 4 & 5 is what I'm trying to automate instead of doing this manually which I've completed 2500 rows over the last 7 days, putting in 4 hours each night.
    Attached Files Attached Files
    Last edited by Restrictedz; 08-11-2019 at 06:41 AM.

  8. #8
    Forum Expert
    Join Date
    11-24-2013
    Location
    Paris, France
    MS-Off Ver
    Excel 2003 / 2010
    Posts
    9,831

    Cool Hi ! Try this ‼


    This demonstration works only with numeric ID or its execution raises an error …

    To paste to the Database worksheet module :

    PHP Code: 
    Function FilterEval(ByVal S$, ByVal T$)
             
    FilterEval Filter(Evaluate("IF(ISNUMBER(MATCH({" "},{" "},0)),{" "})"), FalseFalse)
    End Function

    Sub Demo1()
      Const 
    ", "
        
    Dim VX$, R&, T$(3), W
            V 
    = ['Search Table'!A1].CurrentRegion.Columns("A:B").Value2
        With 
    ['Exclusion Table'!A1].CurrentRegion.Columns(1).Rows
            X 
    Join(Application.Transpose(.Item("2:" & .Count)), S)
        
    End With
            Application
    .ScreenUpdating False
    With CreateObject
    ("Scripting.Dictionary")
        For 
    2 To UBound(V)
            If .
    Exists(V(R1)) Then .Item(V(R1)) = .Item(V(R1)) & V(R2) Else .Add V(R1), V(R2)
        
    Next
            V 
    = [A1].CurrentRegion.Columns("C:D").Value2
        
    For 2 To UBound(V)
            If .
    Exists(V(R1)) Then T(0) = .Item(V(R1))
            If .
    Exists(V(R2)) Then
                      W 
    = .Item(V(R2))
                   
    T(1) = V(R2) & W
                
    If T(0) > "" Then
                        W 
    FilterEval(T(0), W)
                    If 
    UBound(W) > -1 Then
                        T
    (2) = Join(WS)
                        
    FilterEval(T(2), X)
                        If 
    UBound(W) > -1 Then T(3) = "YES: '" Join(WS) & "'"
                    
    End If
                
    End If
            
    End If
                If 
    T(0) & T(1) > "" Then Cells(R5).Resize(, 4).Value2 TErase T
        Next
           
    .RemoveAll
    End With
            Application
    .ScreenUpdating True
    End Sub 
    Do you like it ? So thanks to click on bottom left star icon « Add Reputation » !
    Last edited by Marc L; 08-11-2019 at 11:10 AM. Reason: code typo …

  9. #9
    Registered User
    Join Date
    10-09-2014
    Location
    Sydney, Australia
    MS-Off Ver
    2016
    Posts
    12

    Re: Hi ! Try this ‼

    This works well with the file attached - thanks so much.
    The only issue is that the code breaks on row/line 52412, with a run-time error "Type mismatch" for the FilterEval() function.

    Any ideas why it breaks? If you can please refer to the dropbox link (dropbox dot com/s/hzu0a0nhjk18qv2/receipts_db_Cleaned.xlsx)I put in one of my posts above for the full file which I am running your code on.
    Unfortunately I cannot attach the file directly here because the file size is too big, even if I compress it...

    Mods, I have even uploaded the same file as a .xlsb, which is under file size limit, but I get a "Upload of file failed." error; same happens if I upload a .zip very well under the size limit. I think it's a host issue...

    Thanks Marc
    Last edited by Restrictedz; 08-12-2019 at 10:39 AM.

  10. #10
    Forum Expert
    Join Date
    11-24-2013
    Location
    Paris, France
    MS-Off Ver
    Excel 2003 / 2010
    Posts
    9,831

    Arrow

    No idea, as I warned maybe a non numeric …

    As your dropbox is not accessible, the better is to attach a new workbook (no needs much rows)
    but this time with all the possible data configurations in order to have all your initial post steps result …

  11. #11
    Registered User
    Join Date
    10-09-2014
    Location
    Sydney, Australia
    MS-Off Ver
    2016
    Posts
    12

    Re: Aggregating data based on a search key, across a very large dataset possible?

    Quote Originally Posted by Marc L View Post
    No idea, as I warned maybe a non numeric …

    As your dropbox is not accessible, the better is to attach a new workbook (no needs much rows)
    but this time with all the possible data configurations in order to have all your initial post steps result …
    That's what I thought as well, but I can confirm they are all numeric (there is no non-numeric values) - I checked using TYPE(), and all the values in the two (2) columns are '1' (numeric).
    Also, the Dropbox link works for me - what did you mean by not accessible? If you can replace the spaces and the 'dot' with an actual ".", it should work fine.

    I have attached a snippet of one part which broke (hopefully the root cause of where it breaks is the reason for the rest of the other breaks).
    Attached Files Attached Files
    Last edited by Restrictedz; 08-12-2019 at 09:43 PM.

  12. #12
    Forum Expert
    Join Date
    11-24-2013
    Location
    Paris, France
    MS-Off Ver
    Excel 2003 / 2010
    Posts
    9,831

    Arrow


    There is only a worksheet within this attachment but as the initial explanation is based on 3 worksheets like my demonstration …

  13. #13
    Registered User
    Join Date
    10-09-2014
    Location
    Sydney, Australia
    MS-Off Ver
    2016
    Posts
    12
    Hi Marc the only reason for that is because I need to reduce the file size to below 1MB. What I intended is if you can please insert the other two tabs from the first workbook into this recently-uploaded workbook and proceed as usual?

    Alternatively, copy the contents of the 'Database' tab from the recently-uploaded workbook, into the same tab in the initial workbook uploaded.

    Hope that makes sense - really appreciate your help in looking into this Marc.
    Last edited by Restrictedz; 08-13-2019 at 07:34 PM.

  14. #14
    Forum Expert
    Join Date
    11-24-2013
    Location
    Paris, France
    MS-Off Ver
    Excel 2003 / 2010
    Posts
    9,831

    Arrow

    After the new content copy I run Demo1 without any issue !

    So wait for another one may reach your dropbox, I hope you shared the file without displaying and restriction …

  15. #15
    Forum Expert
    Join Date
    11-24-2013
    Location
    Paris, France
    MS-Off Ver
    Excel 2003 / 2010
    Posts
    9,831

    Arrow

    I finished to succeed to download from dropbox (first time I had such issue, a mess) and I got any issue running Demo1 !
    But as my test computer has an old Excel version limited to 65 536 rows max
    I will grab later (~ half a day) another computer with a more recent one …

  16. #16
    Forum Guru
    Join Date
    08-15-2004
    Location
    Tokyo, Japan
    MS-Off Ver
    2013 O.365
    Posts
    22,516

    Re: Aggregating data based on a search key, across a very large dataset possible?

    See if this works.
    Please Login or Register  to view this content.

  17. #17
    Registered User
    Join Date
    10-09-2014
    Location
    Sydney, Australia
    MS-Off Ver
    2016
    Posts
    12

    Re: Aggregating data based on a search key, across a very large dataset possible?

    Not sure what you meant by: '
    Quote Originally Posted by Marc L View Post
    So wait for another one may reach your dropbox, I hope you shared the file without displaying and restriction …
    Quote Originally Posted by Marc L View Post
    I finished to succeed to download from dropbox (first time I had such issue, a mess) and I got any issue running Demo1 !
    But as my test computer has an old Excel version limited to 65 536 rows max
    I will grab later (~ half a day) another computer with a more recent one …
    Thanks for looking into this Marc; I'm interested to see whether we can find what made the code break and how we can resolve it.

  18. #18
    Registered User
    Join Date
    10-09-2014
    Location
    Sydney, Australia
    MS-Off Ver
    2016
    Posts
    12

    Re: Aggregating data based on a search key, across a very large dataset possible?

    Quote Originally Posted by jindon View Post
    See if this works.
    This doesn't cause any breaks (versus Marc's code), and runs through the entire 330,000 rows, but, the aggregation of the strings is buggy for the 'buyer' data. For instance, in the first row, this is the output in column F:
    7249884, 72498837249884, 51485837249884, 5115537

    However, expected output should be: (Marc's code has no issue with this aspect)
    7249884, 7249883, 5148583, 5115537

    Looks like it's aggregating the search string (from column D) to some of the results.

    Also, there's nothing listed in the 'IN_EXCLUSION' column (column H) - so not exactly sure what the code is doing for records that fall in the exclusion table.

  19. #19
    Forum Guru
    Join Date
    08-15-2004
    Location
    Tokyo, Japan
    MS-Off Ver
    2013 O.365
    Posts
    22,516

    Re: Aggregating data based on a search key, across a very large dataset possible?

    If this doesn't work, just ignore.
    Please Login or Register  to view this content.

  20. #20
    Forum Expert
    Join Date
    11-24-2013
    Location
    Paris, France
    MS-Off Ver
    Excel 2003 / 2010
    Posts
    9,831

    Lightbulb

    Quote Originally Posted by Restrictedz View Post
    interested to see whether we can find what made the code break
    and how we can resolve it.
    With all your data I got the same issue, occuring on row #5413 in the FilterEval formula.
    I tested with small arrays with same elements # (9 & 33 like 1, 2, ) : no error !
    So I suppose it's some formula string size limitation …

    This new demonstration where you can follow its execution within the Excel status bar (at screen bottom left)
    works as text even with numeric ID, just paste it this time to an empty standard module (or to ThisWorkbook) Edit v3 :

    PHP Code: 
    Sub Demo2()
      Const 
    "Database"", "
        
    Dim R&, oDic(4) As ObjectVK$, WT$(), L#, N#, U, X, P%
            
    Sheets(D).UsedRange.Offset(1).Columns("E:H").ClearContents
            Application
    .StatusBar "Initializing …"
            
    For 0 To 4:  Set oDic(R) = CreateObject("Scripting.Dictionary"):  Next
            V 
    = ['Search Table'!A1].CurrentRegion.Columns("A:B").Value2
        
    For 2 To UBound(V)
                
    V(R1)
            If 
    oDic(2).Exists(KThen
                W 
    oDic(2)(K)
                
    ReDim Preserve W(UBound(W) + 1)
                
    W(UBound(W)) = CStr(V(R2))
                
    oDic(2)(K) = W
            
    Else
                
    oDic(2).Add K, Array(CStr(V(R2)))
            
    End If
        
    Next
            V 
    = ['Exclusion Table'!A1].CurrentRegion.Columns(1).Value2
            
    For 2 To UBound(V):  oDic(3)(CStr(V(R1))) = "":  Next
            V 
    Sheets(D).[A1].CurrentRegion.Columns("C:D").Value2
            ReDim T
    (2 To UBound(V), 3)
            
    Application.StatusBar "1%"
            
    UBound(V) / 20
            N 
    L
        
    For 2 To UBound(V)
                
    V(R1)
                If 
    oDic(2).Exists(KThen U oDic(2)(K): T(R0) = Join$(US)
                
    V(R2)
            If 
    oDic(2).Exists(KThen
                   W 
    oDic(2)(K)
                   
    T(R1) = Join$(WS)
                If 
    T(R0) > "" Then
                    
    For Each X In W:  oDic(0)(X) = "":  Next
                    
    For Each X In U
                        
    If oDic(0).Exists(XThen
                           oDic
    (1)(X) = ""
                            
    If oDic(3).Exists(XThen oDic(4)(X) = ""
                        
    End If
                    
    Next
                       oDic
    (0).RemoveAll
                    
    If oDic(1).Count Then
                        T
    (R2) = Join$(oDic(1).KeysS)
                        
    oDic(1).RemoveAll
                     
    If oDic(4).Count Then T(R3) = "YES: '" Join$(oDic(4).KeysS) & "'"oDic(4).RemoveAll
                    End 
    If
                
    End If
            
    End If
                If 
    >= N Then P 5Application.StatusBar "%"L
        Next
            oDic
    (2).RemoveAll:  oDic(3).RemoveAll
            Erase oDic
            Application
    .StatusBar "Writing data …"
            
    DoEvents
            Sheets
    (D).Rows("2:" UBound(T)).Columns("E:H").Value2 T
            Application
    .StatusBar False
    End Sub 
    Do you like it ? So thanks to click on bottom left star icon « Add Reputation » !
    Last edited by Marc L; 08-16-2019 at 06:06 PM. Reason: optimization …

  21. #21
    Registered User
    Join Date
    10-09-2014
    Location
    Sydney, Australia
    MS-Off Ver
    2016
    Posts
    12

    Re: Aggregating data based on a search key, across a very large dataset possible?

    Quote Originally Posted by jindon View Post
    If this doesn't work, just ignore.
    Thanks for your attempt to help with this jindon; this worked really well as well but has errors in the final column (not outputting correct results; false positives etc); but have also given you reputation for looking into this and giving it ago as I would say this was a challenging task.

    Quote Originally Posted by Marc L View Post
    With all your data I got the same issue, occuring on row #5413 in the FilterEval formula.
    I tested with small arrays with same elements # (9 & 33 like 1, 2, ) : no error !
    So I suppose it's some formula string size limitation …

    This new demonstration where you can follow its execution within the Excel status bar (at screen bottom left)
    works as text even with numeric ID, just paste it this time to an empty standard module (or to ThisWorkbook) Edit v2 :
    Thanks so much Marc, this is great. I also suspect it must have something to do with the 255 string limit that caused the break.
    Consider this closed/solved. Marc, you've saved me with countless months of work! I estimated this task would have taken me another 6-7 months every weekend... bless y

  22. #22
    Forum Expert
    Join Date
    11-24-2013
    Location
    Paris, France
    MS-Off Ver
    Excel 2003 / 2010
    Posts
    9,831

    Thumbs up

    Thanks for the rep' !

    With your big data it's a benefit the formula way fails as the dictionaries way is faster …

  23. #23
    Forum Expert
    Join Date
    11-24-2013
    Location
    Paris, France
    MS-Off Ver
    Excel 2003 / 2010
    Posts
    9,831

    Lightbulb


    New v3 version in post #20 : just a tweak to reduce a little the execution time …

+ Reply to Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Similar Threads

  1. Transposing grouped data (large dataset)
    By Bengan1 in forum Excel General
    Replies: 2
    Last Post: 02-20-2018, 08:53 AM
  2. Replies: 3
    Last Post: 10-20-2017, 12:40 AM
  3. Best method of extracting data from large dataset
    By Manchesterpolstudent in forum Excel General
    Replies: 2
    Last Post: 08-07-2017, 03:01 PM
  4. how to perform t test on a large dataset based on dynamic change in values
    By amalchacko in forum Excel Programming / VBA / Macros
    Replies: 0
    Last Post: 07-06-2015, 06:05 PM
  5. Replies: 3
    Last Post: 01-23-2014, 10:13 AM
  6. Create New Worksheets based on one column in a Large Dataset
    By shaolinsnake in forum Excel Programming / VBA / Macros
    Replies: 1
    Last Post: 07-20-2012, 06:29 PM
  7. Filtering on large dataset based on rule
    By not4google in forum Excel General
    Replies: 2
    Last Post: 11-11-2009, 06:15 AM

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

Search Engine Friendly URLs by vBSEO 3.6.0 RC 1