Hi All,
I'm in the process of analyzing a pretty large data sheet of two columns (A &B) and over 8000 rows of data.
ColA carries all the data strings in its cells. Data strings vary in length which could reach 32000 chars per cell.
ColB has very particular words and phrases that pertain specifically to the contiguous cells. So B1 has words and phrases from A1, and B30 has also words and phrases from A30 and so forth.
BolB words and phrases also vary in length from 6 to 400 chars.
My task is to get a 100% accurate count of occurence of every cell in ColB from its neighbour in ColA.
To do that I directly used this formula in ColC:yielding the occurence count of ColB strings from ColA.Please Login or Register to view this content.
The job would've been finished if the data was in English. It is rather in Arabic, with its full diacritical marks in both ColA and ColB.
I tried normalizing the text by removing all the diacritical marks, and this actually worsened the problem as the text lost its meaning when the diactritical marks were removed. The formula yielded totally wrong count for words that looked the same as normalized, but they are in entirely different words and meanings with their diacritical marks on.
To solve this problem, I need to run a smart script that uses dictionary to record all the ColA and ColB data "as they are" with their diacritical marks, and run its smart logic to detect the occurence of every cell in ColB in its contiguous cell of ColA, yielding the count in ColC respectively.
Unfortunately I cannot send a data sample of that file and will be really difficult to try to make up an example, this is why I had to explain my problem in detail.
Can I get some precious assistance with this problem?
Always, many thanks in advance.
T.
Bookmarks