I guess we have suspected for a while not that our system will have quite a long runtime to perform its task of detecting plagiarism.
I ran some calculations in order to be able to tell where we might put in some effort on optimisation. As part of the external analysis we will check out the similarities between sentences (or perhaps some other text sequence). These similarities will be measured by the cosine similarity measurement which by itself might be a good and quite fast method but.. our training data consist of 22 million sentences in the set of source documents and another 22 million sentences in the set of suspicious documents. So there will be a need to compute 4.84 × 10^14 cosine similarities!
I have now started to search for a solution or reduction to this problem but yet none is found.. wish me luck else we will spend most of our upcoming days staring at a system process that runs and runs and runs... but most likely won't be done for a while..
Thursday, May 20, 2010
Monday, May 10, 2010
Sentence classification
I got an assignment from J where I should dig into the world of textual sentences from a stylometric view and try to fill in the blank in the statement: a sentence is _.
So a sentence is:
So a sentence is:
- short
- long
- simple
- complex
- correct, grammatically well formed
- common
- factual
- jounalistic
- legal
- scientific
- temporal: past, now, future
- narrative: first/second/third person, genus
- non alpha: numerical, symbols
- part of a language: english, german, spanish
- in matrix form
- compound
- made up of difficult words
- declarative
- imperative
- interrogative
- exclamatory
- conditional
- regular
- irregular
Status update
So, it has been a while since my last post and that is mainly because we have run in to some problems in the project.
First we had a problem with detecting plagiarism on a fine grained level. Our models provided us to decide whether or not a document had plagiarized passages but we needed to be able to detect it on a character level. After some thinking we decided it might be OK to skip some granularity so we will now try to detect plagiarism on a sentence level. We hope that this level of granularity will provide enough detail and claim that one who plagiarizes will most do that in full sentences. Let us hope that the upcoming experiments will prove this to be correct.. :)
Since we decided to change the level of granularity I had to update the tagging in the training data so that we will be able to learn on a sentence level instead of the previous character level. In doing so I ran in to some difficulties but I hope that I have gotten past them now.. although I expect that there might be a bug somewhere because there was some strange behavior when I did some testing.
I would say that we now are past the Design Phase of the project and is now in the Implementation Phase (at least I am.. :) ). So I expect that there will be a lot more problems or difficulties ahead. All and all there will be some exiting weeks coming up and in the 1st of June we have to be done with the implementation so cross your fingers that everything will proceed in the best possible way...
Until then..
First we had a problem with detecting plagiarism on a fine grained level. Our models provided us to decide whether or not a document had plagiarized passages but we needed to be able to detect it on a character level. After some thinking we decided it might be OK to skip some granularity so we will now try to detect plagiarism on a sentence level. We hope that this level of granularity will provide enough detail and claim that one who plagiarizes will most do that in full sentences. Let us hope that the upcoming experiments will prove this to be correct.. :)
Since we decided to change the level of granularity I had to update the tagging in the training data so that we will be able to learn on a sentence level instead of the previous character level. In doing so I ran in to some difficulties but I hope that I have gotten past them now.. although I expect that there might be a bug somewhere because there was some strange behavior when I did some testing.
I would say that we now are past the Design Phase of the project and is now in the Implementation Phase (at least I am.. :) ). So I expect that there will be a lot more problems or difficulties ahead. All and all there will be some exiting weeks coming up and in the 1st of June we have to be done with the implementation so cross your fingers that everything will proceed in the best possible way...
Until then..
Subscribe to:
Posts (Atom)