Very special occurence count

**terryhenderson** · 05-04-2019, 10:03 AM

Hi All,
I'm in the process of analyzing a pretty large data sheet of two columns (A &B) and over 8000 rows of data.

ColA carries all the data strings in its cells. Data strings vary in length which could reach 32000 chars per cell.

ColB has very particular words and phrases that pertain specifically to the contiguous cells. So B1 has words and phrases from A1, and B30 has also words and phrases from A30 and so forth.

BolB words and phrases also vary in length from 6 to 400 chars.

My task is to get a 100% accurate count of occurence of every cell in ColB from its neighbour in ColA.

To do that I directly used this formula in ColC:

Please Login or Register  to view this content.

yielding the occurence count of ColB strings from ColA.

The job would've been finished if the data was in English. It is rather in Arabic, with its full diacritical marks in both ColA and ColB.

I tried normalizing the text by removing all the diacritical marks, and this actually worsened the problem as the text lost its meaning when the diactritical marks were removed. The formula yielded totally wrong count for words that looked the same as normalized, but they are in entirely different words and meanings with their diacritical marks on.

To solve this problem, I need to run a smart script that uses dictionary to record all the ColA and ColB data "as they are" with their diacritical marks, and run its smart logic to detect the occurence of every cell in ColB in its contiguous cell of ColA, yielding the count in ColC respectively.

Unfortunately I cannot send a data sample of that file and will be really difficult to try to make up an example, this is why I had to explain my problem in detail.

Can I get some precious assistance with this problem?

Always, many thanks in advance.

T.

**shg** · 05-04-2019, 12:32 PM

I have no experience with Arabic, but am surprised that Substitute doesn't work rigorously with Unicode strings.

**terryhenderson** · 05-04-2019, 04:14 PM

Substitute worked fine with normalized Arabic text, and the formula did count the occurrence of the text from ColB in ColA, but it wasn't right at all. I tried to use UPPER with original text with diacritical marks, but of course it didn't make any difference.
Thanks shg.

**shg** · 05-04-2019, 04:30 PM

"Wasn't right at all" -- what does that mean?

Were the results not always integer, smaller than expected, larger than expected?

Are columns A and B consistent as to composition? See https://en.wikipedia.org/wiki/Unicod...ite_characters

Edit: This in particular:

For example, é can be represented in Unicode as U+0065 (LATIN SMALL LETTER E) followed by U+0301 (COMBINING ACUTE ACCENT), but it can also be represented as the precomposed character U+00E9 (LATIN SMALL LETTER E WITH ACUTE). Thus, in many cases, users have multiple ways of encoding the same character.

Those would look the same, but would differ as strings.

**shg** · 05-04-2019, 05:06 PM

Also, see https://en.wikipedia.org/wiki/Unicode_equivalence

I think you have a tiger by the tail.

**terryhenderson** · 05-04-2019, 05:09 PM

"Wasn't right at all" -- what does that mean?

It meant that the occurrence in ColA for the text in ColB was not correct. Once the text is normalized it turns to be basic, and the difference in meaning between two same words with same letters but different diacritical marks becomes eliminated. So, for example, instead of counting 6 for the occurrence of a word with diacritical marks, it would be 10 without diacritical marks due to stripping out the differences between words of the same letters.

Yes, ColA and ColB are consistent.

I hope I made it much clearer.

**shg** · 05-04-2019, 05:25 PM

How about paring one cell in col A to the same (apparent) string in col B. What happens?

**terryhenderson** · 05-04-2019, 05:26 PM

I think we'll be going the wrong way if we looked into the Arabic text as text not as bits. In other words, we should be dealing with it as if it is English, but using scripting dictionary.

**shg** · 05-04-2019, 05:39 PM

I don't know what that means.

Substitute is a mindless comparison of characters. If you have two strings that should be the same, but Substitute tells you they're not, it's easy enough to compare their hex strings and see how they differ.

Very special occurence count

LinkBack

Thread Tools

Rate This Thread

Display

Very special occurence count

Re: Very special occurence count

Re: Very special occurence count

Re: Very special occurence count

Re: Very special occurence count

Re: Very special occurence count

Re: Very special occurence count

Re: Very special occurence count

Re: Very special occurence count

Thread Information

Users Browsing this Thread

Similar Threads

[SOLVED] count each occurence....

Count the occurence of a value across multiple worksheets

count no of occurence per item

Counting the Occurence of Special characters in Single/Multiple cells?

how to count on every occurence

[SOLVED] count occurence and present results

[SOLVED] Using Sumproduct to count text occurence

Count the occurence of more than one condition

Tags for this Thread

Bookmarks

Bookmarks

Posting Permissions