plagiarism detection

**abousetta** · 08-04-2010, 02:31 AM

Hi everyone,

Let me begin by thanking everyone who has contributed to this wonderful community. I truly believe in the power of shared experiences and learning. A special thank you for shg who has helped me out a lot and shown me avenues that I never thought were even possible with an everyday program as excel. Keep up the good work everyone.

OK, now my question... Plagiarism is a problem in many fields, but its detection is not always simple and straightforward. In a recent thread, shg was able to demonstrate how to compute the the Levenshtein Distance between s and t from two strings and determine their similarities. This works very well, but with large strings carries a major computing burden. Does anyone know of any way of checking two strings for similarities based on the similarity of words? Maybe this can be an extenstion of shg's code to compare words instead of characters... Maybe this can be a completely different approach.. I am very interested in understanding more ways of comparing strings in an efficient manner, but have found very little on the internet on the subject.

So the simple question is... how can we detect possible plagiarism between two groups of words, code, etc.

All thoughts, ideas and suggestions are welcome.

Thanks.

abousetta

**MarvinP** · 08-04-2010, 02:38 AM

Search the net for Fuzzy Logic. There are many articles on the topic.

**DonkeyOte** · 08-04-2010, 02:43 AM

a related thread:

http://www.excelforum.com/excel-gene...ercentage.html

below as a basic UDF put together re: the above

Please Login or Register  to view this content.

How applicable the above transpires to be is all rather dependent on the specific rules you wish to apply.

**abousetta** · 08-04-2010, 03:25 AM

Hi MarvinP and DonkeyOte,

I will look into Fuzzy Logic. This name came across when I was doing my search but I guess I didn't understand its full application in this case. I will have a closer look.

DonkeyOte, thank you so much for sending me the link to the original thread. This is very similar to what I am looking for (I believe). I have attempted to modify your code but am having difficulty. Could you possibly have a look?

Also one possible modification I could really use help with is adding a way for the macro to conditionally format the cells above a user-defined cutoff. This cutoff I will use an input box for the user to indicate (e.g. 50%).

Thank you...

abousetta

**DonkeyOte** · 08-04-2010, 03:30 AM

Originally Posted by abousetta

I have attempted to modify your code but am having difficulty. Could you possibly have a look?

Could you first outline

a) what you're comparing

and i

b) some expected / desired results based on sample data

Originally Posted by abousetta

Also one possible modification I could really use help with is adding a way for the macro to conditionally format the cells above a user-defined cutoff. This cutoff I will use an input box for the user to indicate (e.g. 50%).

Conditional Formatting is super-volatile - I would be wary of using this UDF in a Volatile manner given use of Dictionary Object etc...
(a UDF can't format a cell directly of course)

**abousetta** · 08-04-2010, 03:37 AM

Hi DonkeyOte,

I am comparing the cells in each column separately (B2 vs. B3, then B2 vs. B4, etc..). What I am looking for is similarity between the words in the two cells. How similar are they? Capitalization is not a problem.

As for conditional formatting, I just was looking for a way to easily identify the ones that are similar. This can be done post hoc manually but with a large number of similar cells, it would prove difficult. Any thoughts?

Thanks again.

abousetta

**DonkeyOte** · 08-04-2010, 03:45 AM

So how do you wish to tabulate your results ?

For ex.

Please Login or Register  to view this content.

will tell you the closest % match of current string against remainder

Please Login or Register  to view this content.

will return the string associated with closest match (on a unique word basis)

I confess I've not gone through the results to test.

edit: you might need to consider discounting numerics from the string comparison (eg year)

**abousetta** · 08-05-2010, 04:22 AM

Hi DonkeyOte,

Thanks for this code. I thought I replied yesterday but I guess my message got lost in cyberspace. Anyways... Now I have been thinking about how best to incorporate your macro into my work and I have made a few alterations but can't complete the task without your help.

My attached file contains a sample of what I am comparing. Column A - E contain the original data that is being compared. The rest of the columns contain the formulas that you provided me with. In the two examples, Row 2 & 3 and row 4 & 5, you can see that the percentage comparison is 100% because the data in column A is the same for these two rows. But when we look at the data for the corresponding rows in the rest of the columns, they are different. What I hope to be able to do is create a 'best overall' percentage using one or all of the columns. The exact columns should be left for the user to decide. So if the user only wanted to compare column A they could and if they wanted to compare all the columns and get an average they could.

I hope I have explained myself clearly and please let me know if I need to further elaborate.

BW

abousetta

**DonkeyOte** · 08-05-2010, 06:48 AM

You've lost me a little I'm afraid...

I understand what you're trying to do (to an extent) I'm just not sure I understand the mechanics you wish to apply to achieve that aim.

If you wish to be able to compare any combination of columns to determine a closest overall match then I think you must first decide on some sort of weighting.

However, to me the most likely source of plagiarism will be in the Abstract (or poss. Keywords?).
Neither Citation nor Title would seem to be a good basis for comparison - or if used they should have minimal weight - at least that would be my expectation.
Whether or not two people publish something with the same title (or are the same author) should have no real bearing on determining whether or not something is plagiarised - that (sh)would be based on the content alone, no ?

On that basis I would suggest you base on the Abstract (or possibly the keywords ?)

The next thing would be to determine what it is you're looking to do thereafter ?

If you wish to return all info. of the closest match (ie A:E) then it might be an idea to revise the UDF to return the row number of the closest match in a single cell.
To retrieve A:E in additional columns simply use an INDEX with the UDF result rather than repeating what is an expensive UDF calculation - ie establish the closest match once only per record.

Another thing to bear in mind in this scenario is I suspect the date... if something is published before something else I suspect you should be disregarding that record in your calculation altogether.......

edit: and presumably same author also

**snb** · 08-05-2010, 09:01 AM

I tried thise one:

Please Login or Register  to view this content.

**abousetta** · 08-06-2010, 11:32 PM

Hi DonkeyOte,

You are correct that weighting is important and I am still working on testing what works best for weighting to get the most precise results (sensitivity and specificity). At the same time versatility is important to me because my situation changes from run to run. Some runs are comparing citations coming from several different sources and so may contain the same citation more than once which isn't plagirism but just a duplicate listing. Unfortunately to complicate things, different databases don't index exactly the same (or something have minor errors during import or export) and therefore exact matches don't work.

I will scratch my head a couple of more times and see if the lightbulb on top of my head lights up with any new ideas. Also I will try to modify the code to use INDEX as you suggested because its only logical instead of rerunning the code several times.

I hope you don't mind but I will probably back again for advise soon.

Thanks for everything.

abousetta

**abousetta** · 08-06-2010, 11:32 PM

Hi snb,

I will give this a try and let you know how it goes.

Thanka again.

abousetta

**abousetta** · 08-07-2010, 12:57 AM

Hi DonkeyOte,

I have created a userform that I think will (I hope) simplify what I hope to achieve. The user can choose among:

Criteria for comparison:
* Author, Journal, Year
* Title
* Abstract
* Key Words

AND

Closeness of Matches:
* Identical Match
* Best Match
* Percentage Match

If the user choses more than one checkbox from the first batch (e.g. Title and Abstract) then I would like the macro to calculate the number of words same between the two strings seperately then add them together (probably lost here so let me explain):

String 1:
Title (3 words): The moon rises.
Abstract (5 words): The moon rises at night

String 2:
Title (3 words): The sun rises.
Abstract (6 words): The sun rises in the morning

The current function calculates the following percentage:

Title: 2/3 = 66%
Abstract: 2/6 = 33%

What I am hoping to do is:

Title + Abstract: 4/9 = 44%

I hope you can help me develop this further. Please let me know if its getting too complex.

Thanks again.

abousetta

**DonkeyOte** · 08-09-2010, 04:29 AM

I confess I'm still not sure I understand fully the mechanics behind the implementation...

For ex. how are you intending to use the UserForm exactly ?

Is it your intention for the UF to act as a control panel for the UDF calls - ie store the chosen settings and have the UDF calls adjust accordingly ?

Or, are you looking to pick a string and invoke the UF to return a specific result for that one specific string ?

It's not really very clear I'm afraid

Regards the matching - I confess I don't really understand how you intend to implement "Closeness of Match" within the UDF ?
I can't really envisage instances where the choice would generate different results - the UDF is always going to return the closest match (ie exact where exact or greatest % where partial).

In terms of handling the multiple ranges - again how you handle this will largely depend on the first point regards the UF implementation.

It may be that you will want to use a ParamArray else modify the UDF to pass all ranges and process only those columns requested.
I've not tested but pending volume of data I suspect the "multiple column" search will have a relatively significant impact on performance of the UDF overall.

I am happy to continue looking at this but I think you need to start nailing down the specific requirements - ie avoiding feature creep and the hypothetical
In short, given the (potential) complexity of what you're trying to do it's important (for all) to avoid ambiguity.

As you've already seen there are others looking at this thread who will invariably offer alternatives - how appropriate any / all of the code offered will be will invariably stem from how clear the requirements are in the first instance.

To avoid confusion always good to outline expected results for any sample data provided - for all scenarios.

**abousetta** · 08-09-2010, 07:05 AM

Hi DonkeyOte,

Thanks for having a look at this and I apologize for any confusion. You bring up some important points and I will do my best to clarify.

(1) UserForm: The UF would act as a control panel for the user to choose one of a combination of options. This adds needed versatility depending on the situation at hand. By checking which criteria to compare, the user is allowed to increase the sensitivity or specificity as needed. Example, if I was looking for an exact match for all four Criteria for Comparison (e.g. Author, Journal, Year; Title; Abstract; Key Words) then I would check all four boxes and under Closeness of Matches I would choose: Identical Match. The result would be a reference was identical in all four criteria (highly specific). If I decreased the specificity to increase sensitivity (e.g. decreased the number of criteria to only identical Title) then it would pick up more results (more sensitive but less specific).

(2) Regarding the Closeness of Match, the UDF that you provided is excellent and works for a single column. Honestly I am not familar with a ParamArray so I can't comment on that, but what I was envisioning is comaring each column separately and then adding the fractions (absolute closeness) rather than the percentage (relative closeness) of each chosen by the user via the userform. What I mean to say is that instead of averaging the percentages of closeness from each column which will give equal weight to all columns, I hope to weight the columns by the number of words in each cell. Therefore if the title had 10 words and the abstract 90 words then the title would have 10% of the power to control the Closeness of Match and the abstract 90%.

I hope that my explanations are more this time around and I apologize in advance if I am still not clear. Providing clear concise answers is an art I am still unable to grasp, hence my rumblings.

Thanks again for everything.

abousetta

**snb** · 08-09-2010, 10:02 AM

In the attachment a suggestion.

The userform contains 2 spinbuttons.
The left one is meant to choose the row which has to be compared.
The right one is meant to choose the row with which it has to be compared to.

So:
- choose the row you want to compare (left spinbutton)
- choose the row with which that row has to be compared to (right spinbutton)
- choose the elements you want to compare (checkboxes)

Because I do not understand your distinction of comparing methods I stick to the one I posted earlier.

**abousetta** · 08-10-2010, 01:13 AM

Hi snb,

Thank you for all your efforts. I will have to study your design further to learn your advanced techniques. Unfortunately the userform can only be used a control form because the comparisons will be among thousands of rows of data and so I would not be able to manually compare them without being copied to an excel sheet. Having said that, I will study your approach because I have learned a lot from your posts over the past few months and am encouraged that I will be able to find a way forward.

Thanks again for posting this code.

abousetta

**DonkeyOte** · 08-10-2010, 06:06 AM

edit: file reloaded at 11:10 UK Time - typos

abousetta, attached is a much simplified approach but one that I hope is heading in the right direction ?

In the attached the UDF is utilised only in Column F to return row number of "closest" match and INDEX calls are used thereafter to retrieve the info.

A number of your requirements are not included in this at this stage - the above is merely meant to test the waters to see if it's along the right lines.

That is to say - in the attachment:

- the User Form is not utilised - instead a basic Y/N table on CP sheet is used
- we're only considering the inclusion/exclusion of certain fields - ie there is no "closeness" of match scale
- the UDF is at present just aggregating the % field matches to find the greatest match as opposed to weighting on a word match basis
(eg 3/5 -> 60% + 30/50 -> 60% -> 120% would still supercede 1/5 -> 20% + 40/50 -> 80% -> 100% even though on a word count basis 41 > 33)

I still have some fairly major reservations regards whether or not this is going to trap plagiarism which was the original intent ?
I think (?) to that end my earlier points still hold true - ie same author records should be discounted and publish date is significant (ie earlier publish date = exclusion).

Regards calculating the overall percentage - whether or not the % should be based on a (word) weighted basis or field specific basis is - I would say - open to some discussion.

The code itself is ungainly and with more thought can be streamlined, however, at this stage it's more important to nail down the fundamental logic / workflow - once finalised the code can be tidied.

In the sample you will find that if you opt to exclude Title then the results for rows 7 & 9 will alter accordingly.

Code below:

Please Login or Register  to view this content.

**snb** · 08-10-2010, 06:22 AM

Based on our latest information I added a new userform that will automatically compare all rows on the indicated fields.
Cfr. the attachment.

**DonkeyOte** · 08-10-2010, 06:30 AM

@snb - looks impressive but I noted some oddities, for ex. at first glance I noted:

row 3 to row 3 not a 100% match for any field except Author
row 3 & row 4 show 100% match for all fields

I didn't check any further ... as is I suspect our codes are approaching things a little differently but hopefully between the two abousetta will resolve to satisfaction.

edit: PM'd snb as I think there was some confusion re: significance of legacy UDF information (Cols F onwards) - should prove to be a quick fix

**DonkeyOte** · 08-10-2010, 09:22 AM

@snb - I suspect (but might be wrong - often am) that OP would want to avoid partial word matches which will result from the use of InStr.

If you compare Title of rows 3 & 5 (having adjusted the +7 to +1) the first 3 to 5 will return 0% whereas the latter (5 to 3) will return 16.67% given Med (row 5) embedded within Medicophys (row 3) - ie InStr > 0

My assumption is that OP would expect both to return 0% but again this might be a misinterpretation on my part.
If not, were you to conduct a Numeric Match test of wd against a 2nd Array (ie c05 based on jj) that would I think resolve that issue... again this is all assuming it's not intentional functionality.

edit: of course - goes without saying - apologies in advance if I have misinterpreted the code etc..

**snb** · 08-10-2010, 11:54 AM

@DonkeyOte

Thanks for pointing out to me. I had some difficulty understanding what Aboussetta intended. That's why I made my suggestions. Testing those could clarify his/her intentions.

The partial match resulting from using Instr I tried to avoid with some minor alterations.
What has to be compared to what is still a riddle for me.
But assuming you are right I changed the code accordingly.
I hope the OP will volunteer to test it and provide us with some feedback.

Please Login or Register  to view this content.

**romperstomper** · 08-10-2010, 12:02 PM

I'm bookmarking this as an example for when people ask me why you should use meaningful variable/procedure names and code comments.

**DonkeyOte** · 08-10-2010, 12:28 PM

@snb - thanks for posting the revision
(in my version of your code I made the change to 4 in the Resize per earlier post so as to permit additional records)

The % calculation is obviously open to debate but I am not convinced this method as it stands will be appropriate given duplicates are excluded in the match but included in the divisor.

Consider:

c03 = "test test"
c04 = "test test test"

The iteration will loop "test" and "test" in c03 and in both instances "test" will be found in c04 (assuming you correct c04 to have leading and trailing space) .
However, only the first instance will count as match given the 2nd instance should (in theory) be found within c05 (ie the incrementing string of unique words in c03 found thus far in c04).

Conversely the divisor remains total word count of c04 irrespective of duplication so in this instance you end up with 1/3

FWIW, I am not convinced any of the methods offered thus far are best fit but as you say we need OP to tell us either way !

@R, I think if you have a personal style and only you maintain the code then fair enough.
I confess it took me minutes rather than seconds to work through snb's code but I'm sure for snb it's a matter of seconds ... we all have our own styles - jindon's isn't easy to read through either (not a : in sight!)

**romperstomper** · 08-10-2010, 12:52 PM

@DO,
That's true, but given that it's the answer to a question, doesn't really seem to apply here.

I appreciate that many just take code without ever trying to understand it, in which case on their own heads be it, but I will stand by what I said, which I think was pretty reasonable especially compared to what I could, and probably should, have said about some other posts.
Each to their own, I guess, as long as it's not actively bad advice.

**snb** · 08-10-2010, 02:46 PM

cfr.........

**romperstomper** · 08-10-2010, 03:00 PM

The irony of the response is not lost on me.

**abousetta** · 08-11-2010, 01:28 AM

Hi DonkeyOte, snb and romperstomper,

Thank you all for your interest in my thread and all your inovations and thoughts. I never imagined that this thread would have taken a life of its own but I guess I underestimated the potential complexity of the issue. My examples so far have been simplified versions of what I am currently doing (and what I hope to achieve with the macro).

Step 1: Search mulitple database of citations to published literature (journal articles, books, dissertations, conference proceedings, etc.)
Step 2: Compile all these citations into a single reference management database (in Excel)
Step 3: Use conditional formatting to identify potential identical duplicates across the different fields (Author, Journal, Year; Title; Abstract)
Step 4: Remove all identical duplicates (same citation that came from different databases but are really references to the same article and so is not any form of plagiarism or duplicate publication of the work)

The steps above are possible using the built-in functions in Excel, but are extremely time consuming and accuracy is sometimes questionable because it is done by eyeballing the results. That's why I have to have someone double-check all the results to make they are actual identical entries and not different citations.

.... after this point here is where it becomes close to impossible to do 100% accurately without some sort of quantitative guidance...

Step 4: Go through the individual 'unique' citations and compare them against each other to find 'similar' citations to find potential plagiarised work (whether it is true plagiarism by another person or self-plagiarism. The similarities may be due to the same author publishing work on similar populations, drug doses, etc. or because the work was published in several different platforms (e.g. conferences and journal articles) OR more importantly someone stealing (plagiarising work from someone else).

That's why in the userform that I suggested I had three options to choose from under Closeness of Matches:
* Identical Match (can be used as a first run to identify duplicate entries coming from different databases)
* Percentage Match (user-defined minimal percentage to retain matches (e.g. only citations that are at least 70% similar are reported))
* Best Match (Most accurate when the list has been cut down to the ones that are potential plagiarism).

Combing these three runs in sequence should allow the user to compare the large number of citations to find the 'needle in the haystack'. The actual database consists of an average 8000 to 12000 citations, and so could be computationally challenging and that's why I have proposed the use of sequential runs, in addition to the flexibility that it offers the user.

DonkeyOte, I agree that we could use the year as a limiting factor, but just one point I should clarify. Sometimes there is no date reported by the database or is imported incorrectly. So you could get a citation that looks like:

* abousetta, Journal of ExcelForum, 2010
or
* abousetta, Journal of ExcelForum

Therefore if dates are used then must not exclude blank dates nor citations published in the same year.

I am happy to elaborate further or discuss any of the options further.

Thanks again everyone for your interest and all your help. I have learned a lot from the posts over the past few days.

abousetta

plagiarism detection

LinkBack

Thread Tools

Rate This Thread

Display

plagiarism detection

Re: plagiarism detection

Re: plagiarism detection

Re: plagiarism detection

Re: plagiarism detection

Re: plagiarism detection

Re: plagiarism detection

Re: plagiarism detection

Re: plagiarism detection

Re: plagiarism detection

Re: plagiarism detection

Re: plagiarism detection

Re: plagiarism detection

Re: plagiarism detection

Re: plagiarism detection

Re: plagiarism detection

Re: plagiarism detection

Re: plagiarism detection

Re: plagiarism detection

Re: plagiarism detection

Re: plagiarism detection

Re: plagiarism detection

Re: plagiarism detection

Re: plagiarism detection

Re: plagiarism detection

Re: plagiarism detection

Re: plagiarism detection

Re: plagiarism detection

Thread Information

Users Browsing this Thread

Bookmarks

Bookmarks

Posting Permissions