Plagiarism Detection: A Brief Overview

For a more detailed report on our project, please see our paper here.

A motivating example -- consider these three paragraphs from Harry Potter and the Sorcerer's Stone:

"But on the edge of town, drills were driven out of his mind by something else. As he sat in the usual morning traffic jam, he couldn't help noticing that there seemed to be a lot of strangely dressed people about. People in cloaks. Mr. Dursley couldn't bear people who dressed in funny clothes -- the getups you saw on young people! He supposed this was some stupid new fashion. He drummed his fingers on the steering wheel and his eyes fell on a huddle of these weirdos standing quite close by. They were whispering excitedly together.

Mr. Dursley was enraged to see that a couple of them weren't young at all; why, that man had to be older than he was, and wearing an emerald-green cloak! The nerve of him! But then it struck Mr. Dursley that this was probably some silly stunt -- these people were obviously collecting for something...yes, that would be it. The traffic moved on and a few minutes later, Mr. Dursley arrived in the parking lot, his mind back on drills.

The evil of the actual disparity in their ages (and Mr. Woodhouse had not married early) was much increased by his constitution and habits; for having been a valetudinarian all his life, without activity of mind or body, he was a much older man in ways than in years; and though everywhere beloved for the friendliness of his heart and his amiable temper, his talents could not have recommended him at any time."

Intrinsic Detection

"Hmmm...that third paragraph seems much more sophisticated than the first two."
Use stylometric features from the text to detect passages that are "different" from the rest of the passages.

Extrinsic Detection

"Hmmm...that third paragraph is from Emma, not Harry Potter and the Sorcerer's Stone."
Find passages that are similar to passages in an external corpus.

Intrinsic Plagiarism Detection

The goal of intrinsic plagiarism detection is to find passages within a document which appear to be significantly different from the rest of the document. In order to do so, we break the process down into three steps.

Extrinsic Plagiarism Detection

Extrinsic plagiarism detection is given more information to work with: in addition to a suspicious document, we are also given a number of external documents, or source documents to compare to the suspicious document. The extrinsic detection process can be broken into three steps: