Thursday, May 20, 2010

So we probably found the system bottleneck..

I guess we have suspected for a while not that our system will have quite a long runtime to perform its task of detecting plagiarism.

I ran some calculations in order to be able to tell where we might put in some effort on optimisation. As part of the external analysis we will check out the similarities between sentences (or perhaps some other text sequence). These similarities will be measured by the cosine similarity measurement which by itself might be a good and quite fast method but.. our training data consist of 22 million sentences in the set of source documents and another 22 million sentences in the set of suspicious documents. So there will be a need to compute 4.84 × 10^14 cosine similarities!

I have now started to search for a solution or reduction to this problem but yet none is found.. wish me luck else we will spend most of our upcoming days staring at a system process that runs and runs and runs... but most likely won't be done for a while..

1 comment:

  1. For the web wsearching you mentioned, CopyCatch at http://www.getcopycatch.com works very well. I think it's only available for schools, but they offer free trials for professors and teachers at http://getcopycatch.com/pricing.html. I just filled out the form and they contacted me with credentials to log in. WCopyFind is useful, but it doesn't really come into play for web searches, which are absolutely necessary for catching essay plagiarism (which can come from many, many sources on the web).

    ReplyDelete