Damocles was a plagiarism detection system for electronic documents developed by Dr David Squire. From 2000 until 2014 at Monash University, Damocles was used with success to detect plagiarism from documents on the web, as well as student-student copying, in hundreds of subjects in faculties including Information Technology, Arts, Law, Business and Economics, and Engineering. It was also used at Monash College. As of January 2014, more than 45,000 student essays had been analysed by Damocles.
First developed in 2000, Damocles works in several phases. First, it converts documents in ps, pdf, html or doc formats (and several others) to plain text, using a variety of tools. The output of these tools is passed through a heuristic text repairing filter, which repairs words broken across lines, missing characters due to ligatures in fonts, and other common faults introduced in the conversion processes.
These text files are then scanned for URLs, since some students in a class will have cited their on-line sources. These URLs are crawled to a specified depth, and the retrieved documents' paragraphs are indexed to create a database, using text retrieval techniques such as word stemming and inverted files. The student documents can also be included in this database so that student-student copying can be detected.
Each paragraph of each student document is then used as a query against this database. The top n best matching results are then checked against the query paragraph for runs of words above a given length threshold (typically 10). If there are no matches, a web search is launched, using sentence fragments. If a search result is a sufficiently good match, that document is retrieved. The Damocles index is periodically rebuilt to include these new documents, thus increasing the chance of a local hit, and reducing web traffic. This subject-specific database can be reused in future years, or for related subjects.
After all documents have been checked, reports are generated in which runs of matching words are highlighted in colour, with each colour corresponding to a different matched document. Beside each paragraph the titles of the matched documents are shown, along with links that when clicked will show the matched paragraph and source paragraph in parallel, along with a link to the original matched document on the web. Below is an example of part of a Damocles report:
|6||Biologists say they have found what is quite likely to be the first |25 words|documented case of "re-evolution", suggesting that nature does indeed offer second chances - a species can evolve a new characteristic, lose it and then regain it.||
Walking Sticks, Just Winging It (washingtonpost.com) Matching text
The report is arranged in three columns. The leftmost indicates the paragraph number (here "6") in the source document (i.e. the one for which the report was generated), which is useful when one wishes to refer to a particular part of the report.
The centre column, which occupies most of the report, shows the text of the query document. Runs of consecutive words which occur in matched documents are shown in colour and boldface, with the run length given as a purple superscript at the start of the run1. Words which occur in the matched document(s), but not in a sufficiently long run, are shown in colour, but not in boldface. Each matched document is assigned a colour, and all matches to that document will be shown in that colour (here nearly all the matches are to the same document, and are shown in red).
The rightmost column shows the title of each matched document, with colour-coded links labelled "Matching Text". Clicking on one of these links will generate a page showing both the query and matched paragraphs for direct comparison. A link is also provided to the location where the matched document was found on the web (you can try the one above).
Full details of the matched documents, including their URLs, appear at the bottom of the report.
This example is from a report which was generated for an article from The Age newspaper, which had been taken from one in the Washington Post, with attribution. Damocles found the source article. It is interesting to note that the two articles do not match exactly: some editing has been done for the Australian audience of The Age.
1. If overlapping runs occur in different matched document paragraphs, the run length for the run which starts second will appear after the first run terminates, but the run length will be from the start of the second run. Consequently the number of highlighted words for the second run will appear to be less than the stated run length, since some matched words will occur before the run length number, and will be highlighted in the colour of the first run.
Anya Daly, In a knowledge economy, plagiarism is just a matter of degree, The Age, August 24, 2012.2007-10-24: Damocles wins a contest investigating plagiarism in academic publications run by the University of Geneva (in French). See the original submission (in English).
Maurie Hasen and Michèle Y Huppert, The Trial of Damocles: an investigation into the incidence of plagiarism at an Australian university, in Proceedings of the AARE Conference, 2005.
Debora Weber-Wulff, Kurse über Plagiat Fremde Federn Finden, August, 2004.
Damocles ranked equal 1st in benchmarking of plagiarism detection tools on German text.
Jim Buckell, Plagiarists under survey, but detection delayed, The Australian, January 15, 2003.
Jim Buckell, Software to snare plagiarists, The Australian, August 14, 2002.