Thursday, March 25, 2010

Weekly meeting

Today me, J, and M had a meeting about the advancement of the project.

Right now we are in the acquirement of new knowledge phase. That is I will read up on the current research and its various implementations. Next week we will move to the design phase.

To conclude the knowledge phase I will present the current status of the automatic plagiarism detection field so we get an overview. This will be achieved by constructing a box diagram where every box will represent an idea, algorithm, concept, technique, etc. that has been previously used and tested when doing plagiarism detection. This box diagram will then be used in the design phase to help us when deciding in how we should solve the problem. I will focus on the boxes that has to do with the actual classification of whether or not a text-sequence is plagiarism or not. But there will be some boxes concerning preprocessing (like information retrieval) and postprocessing too.

I will also put some time aside to get to know the data a little bit more. PAN has provided us with a large training corpus that consist of original and suspicious documents. Some of the suspicious documents will contain plagiarized text-sequences that are marked up. I will try out the machine learning framework Weka and NLTK and try to learn how to use these frameworks to classify documents in different ways.

We decided to change the weekly meeting to Thursdays instead of Wednesdays so I can attend a machine learning course without missing half the lectures.

No comments:

Post a Comment