Wednesday, April 7, 2010

Moving to the Design Phase

Today I begin the Design Phase of the  project. So, for now, I will leave the reading up on the subject of Plagiarism Detection behind me and start to try and apply my recently acquired knowledge to something useful.

The Design Phase will consist of a lot of decision making. I have already decided that the automatic plagiarism detection system will be implemented in python. But how should it be implemented? How should the implementation process be? What Integrated Development Environment (IDE) should be used? How should the implementation be built and tested? What name should it have? etc..

To help my decsion making process I will sketch a lot and try out different ideas. But before that can comence I have realised something.. I need to read some more.. But this time it will focus more on the genereal area of Natural Language Processing (NLP) and Python. The next text I will lay my eyes on is the Style for Coding Python (PEP).





Thursday, April 1, 2010

A strategy for detecting plagiarism

We have decided on a strategy on how to automatically detect plagiarism. It will be some sort of hybrid of techniques from nearby research areas.

Our aim is to catch plagiarism in a semantical and stylistic way. We have a nice word space model that will be used to capture semantic features of the text and for style recognition we will use techniques from the authorship identification research field.

I will implement two baseline algorithm to be used to measure our results. The first one will be a really naïve one and will act as lower bound that we should never get close too. The second will represent "the state of the art" plagiarism detection tool that we will strive to surpass and will probably be the winner of the 1st International Competition of Plagiarism Detection, namely ENCOPLOT.

A definition of Plagiarism

I hereby (and hereafter) define plagiarism as:


source : {(exact copy | (word, phrase) (addition | deletion | substitution) | phrase (split | merge | reorder)) & uncredited} -> plagiarism