Thursday, May 20, 2010

So we probably found the system bottleneck..

I guess we have suspected for a while not that our system will have quite a long runtime to perform its task of detecting plagiarism.

I ran some calculations in order to be able to tell where we might put in some effort on optimisation. As part of the external analysis we will check out the similarities between sentences (or perhaps some other text sequence). These similarities will be measured by the cosine similarity measurement which by itself might be a good and quite fast method but.. our training data consist of 22 million sentences in the set of source documents and another 22 million sentences in the set of suspicious documents. So there will be a need to compute 4.84 × 10^14 cosine similarities!

I have now started to search for a solution or reduction to this problem but yet none is found.. wish me luck else we will spend most of our upcoming days staring at a system process that runs and runs and runs... but most likely won't be done for a while..

Monday, May 10, 2010

Sentence classification

I got an assignment from J where I should dig into the world of textual sentences from a stylometric view and try to fill in the blank in the statement: a sentence is _.

So a sentence is:

  1. short
  2. long
  3. simple
  4. complex
  5. correct, grammatically well formed
  6. common
  7. factual
  8. jounalistic
  9. legal
  10. scientific
  11. temporal: past, now, future
  12. narrative: first/second/third person, genus
  13. non alpha: numerical, symbols
  14. part of a language: english, german, spanish
  15. in matrix form
  16. compound
  17. made up of difficult words
  18. declarative
  19. imperative
  20. interrogative
  21. exclamatory
  22. conditional
  23. regular
  24. irregular

Status update

So, it has been a while since my last post and that is mainly because we have run in to some problems in the project.

First we had a problem with detecting plagiarism on a fine grained level. Our models provided us to decide whether or not a document had plagiarized passages but we needed to be able to detect it on a character level. After some thinking we decided it might be OK to skip some granularity so we will now try to detect plagiarism on a sentence level. We hope that this level of granularity will provide enough detail and claim that one who plagiarizes will most do that in full sentences. Let us hope that the upcoming experiments will prove this to be correct.. :)

Since we decided to change the level of granularity I had to update the tagging in the training data so that we will be able to learn on a sentence level instead of the previous character level. In doing so I ran in to some difficulties but I hope that I have gotten past them now.. although I expect that there might be a bug somewhere because there was some strange behavior when I did some testing.

I would say that we now are past the Design Phase of the project and is now in the Implementation Phase (at least I am.. :) ). So I expect that there will be a lot more problems or difficulties ahead. All and all there will be some exiting weeks coming up and in the 1st of June we have to be done with the implementation so cross your fingers that everything will proceed in the best possible way...

Until then..

Wednesday, April 7, 2010

Moving to the Design Phase

Today I begin the Design Phase of the  project. So, for now, I will leave the reading up on the subject of Plagiarism Detection behind me and start to try and apply my recently acquired knowledge to something useful.

The Design Phase will consist of a lot of decision making. I have already decided that the automatic plagiarism detection system will be implemented in python. But how should it be implemented? How should the implementation process be? What Integrated Development Environment (IDE) should be used? How should the implementation be built and tested? What name should it have? etc..

To help my decsion making process I will sketch a lot and try out different ideas. But before that can comence I have realised something.. I need to read some more.. But this time it will focus more on the genereal area of Natural Language Processing (NLP) and Python. The next text I will lay my eyes on is the Style for Coding Python (PEP).





Thursday, April 1, 2010

A strategy for detecting plagiarism

We have decided on a strategy on how to automatically detect plagiarism. It will be some sort of hybrid of techniques from nearby research areas.

Our aim is to catch plagiarism in a semantical and stylistic way. We have a nice word space model that will be used to capture semantic features of the text and for style recognition we will use techniques from the authorship identification research field.

I will implement two baseline algorithm to be used to measure our results. The first one will be a really naïve one and will act as lower bound that we should never get close too. The second will represent "the state of the art" plagiarism detection tool that we will strive to surpass and will probably be the winner of the 1st International Competition of Plagiarism Detection, namely ENCOPLOT.