Hi,
I have a large dataset of text and I'm trying to find the co-occurance of words. Alterantive combinations of words are in two different columns. For instance, the column A has a co-word "apple_orange" and the column B of the same row has its flip co-word "organge_apple" and both are equivalent. This also means that all the values in column A are present somewhere in Column B and vice versa, but in different rows. For instance, consider the following:
Col A Col B
apple_fruit fruit_apple
apple_mango mango_apple
apple_orange orange_apple
fruit_apple apple_fruit
juice_mango mango_juice
juice_orange orange_juice
mango_apple apple_mango
mango_juice juice_mango
orange_apple apple_orange
orange_juice juice_orange
I need to accurately identify and remove all the duplicate rows, whereby the duplicates of Column A reside in Column B and vice versa. This means that half of the rows in the matrix have to be removed but the challenge is how to identify which rows to be removed. For instance, in the above example, rows 1 and 4 are duplicates, rows 2 and 7 are duplicates, and so forth, and need to be removed.
I have tried different formulae and techniques but failed. Any help would be highly appreciated.
Best regards,
guest2013
Bookmarks