Originally Posted by
JayUSA
DonkeyOte, I'm very intrigued by your cascade of logical rules, but I struggled to fully understand and answer your questions in the context of the challenge described in this post.
First, note given the fact that your current scenario:
-- search for 100 terms within one string and if any found return true
is in reality the complete opposite of our initial interpretation of your scenario:
-- search 100 strings for one term and count matches
then it follows that my earlier "cascade of logical rules" aren't really applicable in this instance given they relate to the 2nd scenario, however, I suspect it's worthwhile illustrating what I meant by means of a much simplified example of the first scenario.
Consider
a) following values in A1:A5
b) following search terms in B1:B2
I will attempt to illustrate the following "cascade of logical rules" based on the above data / terms:
So
Using "and", this is a very real possibility given it appears embedded within words / terms within the string being search - eg operand, Streisand
Using "zx", no this would appear to be highly unlikely.
We use the above test to determine as to whether or not a basic COUNTIF with wildcard characters will be sufficient...
we can demo the "viability" of the COUNTIF by adding to to Column C in our test, eg:
using our example data C1 will return 4, C2 will return 2.
C1 returns 4 given "and" appears twice as an outright term (A1 & A4) and twice embedded : "operand" (A2) & "Streisand" (A5).
So we move to the next question regards counting embedded matches (eg operand / Streisand)
So in short, if we are not concerned with finding "and" as a word/term in it's own right (ie not embedded) then we can obviously use the earlier COUNTIF wildcard approach.
If on the other hand we want to treat "operand" & "Streisand" as "false positives" and thus discount them from our calcs we realistically* have to dispense with COUNTIF as a viable approach.
Why ? Well, we are going to need to manipulate the input values (A1:A5) as part of the calculation process and we can not do that with COUNTIF.
We'll assume we do wish to discount - ie search for "and" as a self contained word / term and thus we come to our next question:
So by "common delimiter" we mean a consistent character that separates each term within our strings (A1:A5), traditionally a space or comma.
If we do have a common delimiter it means we can be more selective in our search term by looking for the term encased within the delimiter, eg:
The above now correctly returns 2 given " and " is found only twice in our adjusted A1:A5 values.
Re: Adjusted... you will note that each string being searched has been appended such that the "common delimiter" has been added to both the beginning and end of the string.
Why? This is to account for the possibility that the search term may appear either at the beginning or end of any given string and thus would not otherwise be encased within the common delimiter as a term mid string would be.
To illustrate:
this now returns 1 ... the "and" in A4 is disregarded given it appears at the end of the string and thus does not have a trailing space - " and " <> " and"
As outlined in the logic - the use of SUMPRODUCT is of course dependent upon the use of a common delimiter.
The above point is hopefully self explanatory given the above examples ?
It follows that if the delimiter between terms is inconsistent defining the appropriate search term within the SUMPRODUCT itself becomes nigh on impossible given the leading & trailing chars surrounding the term itself are not always going to calculate correctly.
To illustrate - if we were to add a full stop to A4 such it reads:
you will note that D1 now returns 1 rather than 2 as was previously the case... this is because " and " can no longer be found in A4.
This is where we enter the potential "world of pain" - if we had only 2 delimiters we could of course use an embedded SUBSTITUTE to "normalise" - eg:
which would correctly return 2 ... we remove the period delimiter such that we have only one delimiter in each string - given the consistent delimiter it follows we can apply similar logic to before
(note in the above we have to make assumption that period delimiter will not always appear at the end of the string - hence we persist with adding delimiter to end of each string)
Obviously though in some cases - particularly when each string is a sentence - the reality is such that a multitude of delimiters will exist and the use of SUBSTITUTE essentially becomes cumbersome and will slow the calculation ... there is also the restriction in terms of how many may be embedded.
At this point one would think about reverting to UDF ... potentially utilising Regular Expressions to look for "patterns" of characters....
I hope the above has helped clear the muddied waters a little.
I will post regards your actual question on a subsequent post (if it's not been resolved in the time it's taken to write this!)
footnote re: COUNTIF:
*I say "realistically" because there are a few nuances that may still permit use of COUNTIF - ie common delimiter and search term never being either first/last term in any given string.
Bookmarks