+ Reply to Thread
Results 1 to 5 of 5

Filter large amounts of data

  1. #1
    Registered User
    Join Date
    09-09-2019
    Location
    Amsterdam, the Netherlands
    MS-Off Ver
    Office 365 MSO (16.0.11929.20234) 32-bit
    Posts
    2

    Filter large amounts of data

    Hello everybody,

    For a project for the Amsterdam Center for World Food Studies, I have a vast dataset: approximately 2.5 million cells.
    However, the data contains mistakes. Sometimes, the researchers filled in a description of something, instead of the code of the thing.
    Now, what I want to do is the following: change words that I come accross frequently with a code. So for example: I found the word "candy" 50 times in the data, and I want to replace it with "342", the code that I assigned to candy.
    How should I go about this? At first, I wanted to use a load of IF-formulas, but if I use an IF-formula, Excel wants to know a value for when it is false, but I want the rest to stay the same, so that does not work.
    Any suggestions on how to do this? And if Excel is not suitable for the task, any suggestions for different software?

    I hope my explanation is not too unclear, if so please tell me and I will try to clarify.

    Kind regards and thanks in advance,
    Wibraldus
    Amsterdam

  2. #2
    Forum Expert CK76's Avatar
    Join Date
    06-16-2015
    Location
    ONT, Canada
    MS-Off Ver
    MS365 Apps for enterprise
    Posts
    5,887

    Re: Filter large amounts of data

    Try using CTRL + H and replace cells holding "*candy*" with 342?

    Other alternatives are... load data to PQ, then use lookup tables to replace or create translation column.

    Exact method will depend on your data structure.

    Excel can certainly perform the required operation from the sound of it (especially using PowerQuery & PowerPivot using DAX measures).
    But depending on data structure, you may want to use additional transformation steps prior to your operation.

    If you need more specific help. I'd recommend to upload sample workbook with about 50~100 rows of data (enough to demonstrate your issue), along with few lines of expected outcome.
    You can desensitize the data, but should keep structure and data type same as your original.

    To upload, use "Go Advanced" button, follow "Manage Attachments" hyperlink. It will launch new tab/window for managing uploads.
    ?Progress isn't made by early risers. It's made by lazy men trying to find easier ways to do something.?
    ― Robert A. Heinlein

  3. #3
    Forum Expert ben_hensel's Avatar
    Join Date
    01-26-2012
    Location
    Northeast USA
    MS-Off Ver
    Office 365
    Posts
    2,043

    Re: Filter large amounts of data

    I don't even think you can use Excel -- the upper limit on the number of rows is a million-and-change, your dataset is apparently just straight-up too big for excel.

    What format is the file in? It would have to be some kind of database, right? Like PHP or something.

    I guess the "right thing" to do depends on what you want your results to look like.

    If you just want to non-reversibly once-and-done change "candy" to "352" for a specific example of a general case, then you probably want to do something like a Regex to pass through and over-write the wrong codes. Like a script to pass through the database and repeatedly replace text strings as a
    user-mandated operation that happens when you import/append data from wherever to your set.

    This dataset is too big for me to see a good way to do a dynamic check where you can just append more rows, or update cells, and it will dynamically re-calculate an "output" for each row based on some input. (I'm not saying it's impossible, I'm saying you are far enough into the territory of Database Land that I don't know the geography).
    Click the [* Add Reputation] Button to thank people for help!
    Post Attachment: Click [Go Advanced] under the reply box; then click [Manage Attachments] to launch the attachments dialogue.
    Mark Thread Solved: Go to the top of the first post; Select [Thread Tools] dropdown; Pick the [Mark thread as Solved] option.

  4. #4
    Forum Expert CK76's Avatar
    Join Date
    06-16-2015
    Location
    ONT, Canada
    MS-Off Ver
    MS365 Apps for enterprise
    Posts
    5,887

    Re: Filter large amounts of data

    the upper limit on the number of rows is a million-and-change
    OP stated as 2.5mil cells I assumed that to be spread out between rows and columns.

    Also, with PowerQuery and PowerPivot (which is standard for Office 365 subscription), depending on machine spec, you can handle 12 million rows + without issue.

  5. #5
    Registered User
    Join Date
    09-09-2019
    Location
    Amsterdam, the Netherlands
    MS-Off Ver
    Office 365 MSO (16.0.11929.20234) 32-bit
    Posts
    2

    Re: Filter large amounts of data

    Hello,

    Thanks for the reactions! I am new at this forum, and am positively surprised about how active it is! As you can probably tell I am not very experienced with Excel, so I appreciate the help.

    CK76, the Ctrl+H option may actually do (part of) the job, as there is a pattern of common mistakes in the data (approximately 85 terms, in different spellings, that should instead be expressed in code, make up the bulk of the mistakes). I will try that today and keep you posted on the progress! I agree with you that Excel is probably able to do the job in some way. If Ctrl+H turns out to be too much work because it has to be done manually, I will come back to try the other options you suggested.

    ben_hensel, the size of the dataset I think will not be a problem, because (1) indeed, like CK76 said, the 2.5 mil cells are spread out between rows and columns, and (2) the data can be split up very easily in 7 parts and if need be smaller parts as well. Ideally, I would indeed have a Regex that finds and replaces text for values. So for example: I tell Excel to find cells that cointain "candy", "sweets", "confectionary" or "tidbits" (+ some extra spellings that I came across) in a certain column in the document, and replace it with 352. Like I said, there are approximately 85 of these common mistakes.

    As of right now, 25 per cent of the data is an "other" (so a term expressed in text that should be a code). If I make something that filters this down to 10 per cent, that would be a massive improvement to the dataset and the project in general.
    It's all about finding a way to do the job that automates it at least to some degree. Also I will be consulting one of my colleagues today, maybe he can help.
    I'll keep you posted about my progress.

    Wibraldus

+ Reply to Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Similar Threads

  1. [SOLVED] HELP - This Macro works on small amounts of data but fails on large amounts
    By BookmanNLA in forum Excel Programming / VBA / Macros
    Replies: 7
    Last Post: 06-18-2015, 12:40 AM
  2. Large amounts of data
    By namluke in forum Excel General
    Replies: 3
    Last Post: 12-11-2014, 02:40 PM
  3. Transposing large amounts of data
    By salva7ore in forum Excel Programming / VBA / Macros
    Replies: 1
    Last Post: 04-12-2014, 06:52 AM
  4. [SOLVED] Rearranging large amounts of data
    By kozor in forum Excel General
    Replies: 6
    Last Post: 10-30-2012, 05:41 AM
  5. Large Amounts of Data
    By Drewser33 in forum Excel Charting & Pivots
    Replies: 3
    Last Post: 06-17-2008, 09:13 AM
  6. Doing Analysis from large amounts of DATA
    By william4444 in forum Excel - New Users/Basics
    Replies: 3
    Last Post: 05-15-2006, 04:01 AM
  7. Importing Large Amounts of Data
    By Mark McWilliams in forum Excel General
    Replies: 1
    Last Post: 01-21-2005, 08:06 PM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

Search Engine Friendly URLs by vBSEO 3.6.0 RC 1