Thursday, May 20, 2010

So we probably found the system bottleneck..

I guess we have suspected for a while not that our system will have quite a long runtime to perform its task of detecting plagiarism.

I ran some calculations in order to be able to tell where we might put in some effort on optimisation. As part of the external analysis we will check out the similarities between sentences (or perhaps some other text sequence). These similarities will be measured by the cosine similarity measurement which by itself might be a good and quite fast method but.. our training data consist of 22 million sentences in the set of source documents and another 22 million sentences in the set of suspicious documents. So there will be a need to compute 4.84 × 10^14 cosine similarities!

I have now started to search for a solution or reduction to this problem but yet none is found.. wish me luck else we will spend most of our upcoming days staring at a system process that runs and runs and runs... but most likely won't be done for a while..

Monday, May 10, 2010

Sentence classification

I got an assignment from J where I should dig into the world of textual sentences from a stylometric view and try to fill in the blank in the statement: a sentence is _.

So a sentence is:

  1. short
  2. long
  3. simple
  4. complex
  5. correct, grammatically well formed
  6. common
  7. factual
  8. jounalistic
  9. legal
  10. scientific
  11. temporal: past, now, future
  12. narrative: first/second/third person, genus
  13. non alpha: numerical, symbols
  14. part of a language: english, german, spanish
  15. in matrix form
  16. compound
  17. made up of difficult words
  18. declarative
  19. imperative
  20. interrogative
  21. exclamatory
  22. conditional
  23. regular
  24. irregular

Status update

So, it has been a while since my last post and that is mainly because we have run in to some problems in the project.

First we had a problem with detecting plagiarism on a fine grained level. Our models provided us to decide whether or not a document had plagiarized passages but we needed to be able to detect it on a character level. After some thinking we decided it might be OK to skip some granularity so we will now try to detect plagiarism on a sentence level. We hope that this level of granularity will provide enough detail and claim that one who plagiarizes will most do that in full sentences. Let us hope that the upcoming experiments will prove this to be correct.. :)

Since we decided to change the level of granularity I had to update the tagging in the training data so that we will be able to learn on a sentence level instead of the previous character level. In doing so I ran in to some difficulties but I hope that I have gotten past them now.. although I expect that there might be a bug somewhere because there was some strange behavior when I did some testing.

I would say that we now are past the Design Phase of the project and is now in the Implementation Phase (at least I am.. :) ). So I expect that there will be a lot more problems or difficulties ahead. All and all there will be some exiting weeks coming up and in the 1st of June we have to be done with the implementation so cross your fingers that everything will proceed in the best possible way...

Until then..

Wednesday, April 7, 2010

Moving to the Design Phase

Today I begin the Design Phase of the  project. So, for now, I will leave the reading up on the subject of Plagiarism Detection behind me and start to try and apply my recently acquired knowledge to something useful.

The Design Phase will consist of a lot of decision making. I have already decided that the automatic plagiarism detection system will be implemented in python. But how should it be implemented? How should the implementation process be? What Integrated Development Environment (IDE) should be used? How should the implementation be built and tested? What name should it have? etc..

To help my decsion making process I will sketch a lot and try out different ideas. But before that can comence I have realised something.. I need to read some more.. But this time it will focus more on the genereal area of Natural Language Processing (NLP) and Python. The next text I will lay my eyes on is the Style for Coding Python (PEP).





Thursday, April 1, 2010

A strategy for detecting plagiarism

We have decided on a strategy on how to automatically detect plagiarism. It will be some sort of hybrid of techniques from nearby research areas.

Our aim is to catch plagiarism in a semantical and stylistic way. We have a nice word space model that will be used to capture semantic features of the text and for style recognition we will use techniques from the authorship identification research field.

I will implement two baseline algorithm to be used to measure our results. The first one will be a really naïve one and will act as lower bound that we should never get close too. The second will represent "the state of the art" plagiarism detection tool that we will strive to surpass and will probably be the winner of the 1st International Competition of Plagiarism Detection, namely ENCOPLOT.

A definition of Plagiarism

I hereby (and hereafter) define plagiarism as:


source : {(exact copy | (word, phrase) (addition | deletion | substitution) | phrase (split | merge | reorder)) & uncredited} -> plagiarism

Friday, March 26, 2010

Inspired by Statistics

I spent this morning reading up a bit on statistical analysis. It all started out when I read a blog about Statistics For Programmers, Programmers Need To Learn Statistics Or I Will Kill Them All, and Measuring Measures. Next thing I find myself trying out the R programming language..

Well.. I still think it was quite a fun and useful morning.. :)

Thursday, March 25, 2010

Weekly meeting

Today me, J, and M had a meeting about the advancement of the project.

Right now we are in the acquirement of new knowledge phase. That is I will read up on the current research and its various implementations. Next week we will move to the design phase.

To conclude the knowledge phase I will present the current status of the automatic plagiarism detection field so we get an overview. This will be achieved by constructing a box diagram where every box will represent an idea, algorithm, concept, technique, etc. that has been previously used and tested when doing plagiarism detection. This box diagram will then be used in the design phase to help us when deciding in how we should solve the problem. I will focus on the boxes that has to do with the actual classification of whether or not a text-sequence is plagiarism or not. But there will be some boxes concerning preprocessing (like information retrieval) and postprocessing too.

I will also put some time aside to get to know the data a little bit more. PAN has provided us with a large training corpus that consist of original and suspicious documents. Some of the suspicious documents will contain plagiarized text-sequences that are marked up. I will try out the machine learning framework Weka and NLTK and try to learn how to use these frameworks to classify documents in different ways.

We decided to change the weekly meeting to Thursdays instead of Wednesdays so I can attend a machine learning course without missing half the lectures.

Wednesday, March 24, 2010

List of terms and concepts

(character, word) n-gram, skipgram, word space model (WSM), latent semantic analysis (LSA), latent Dirichlet allocation (LDA), string kernel, external plagiarism, intrinsic plagiarism, (dis)similarity measure, recall, precision, granularity, subsequence, synonyms, longest common subsequence, author style, genre, morpheme, suffix array, suffix tree, obfuscation, permutation, antonym, index, inverted index, crowd-sourcing, coding, hash function, stemming, Jacquard coefficient, Kullback–Leibler divergence, Levenshtein distance, stylometry, (character, word) frequency, cosine distance, sliding window, bag of (characters, words), outlier detection, space partitioning, kd-tree, metric tree, curse of dimensionality, dimensionality reduction, Principal Component Analysis (PCA), Isomap, locality sensitive hashing (LSH), authorship identification, stop word, cluster pruning, part of speech tag (POS or POST), Penn Treebank part of speech tag set, Snowball stemming algorithms, hyponym, hypernym, Kolmogorov Complexity measure, Lempel-Ziv compression, cohesion word, readability test, compression, Support Vector Machine (SVM), Artificial Neural Net, boosting, mean average precision (MAP), clipping, synset, arg max, kappa statistics, token, corpus, td-idf, context-free grammar (CFG), (semantic, syntactic) class, Levin's verb classes, decision tree (DT), quasi-Newton method, sentence-to-sentence similarity, word correlation factor, n-gram phrase correlation, dot plot, Fuzzy fingerprint, text chunk, text statistics, closed word class, Zipf's law, vocabulary richness, shingle, near duplicate detection, average word frequency class, text complexity, understandability, readability, Context Dependent Thinning (CDT), Random Permutation (RP)

Dealing with new terminology

When you first come in contact with a new research field you are almost always restrained by the field specific terms. In order to understand information about the field you first have to understand the meaning of the terms and how they relate. I have gained some knowledge of the terms used in the field I am approaching (that is Information Theory, Information Retrieval, Computational Linguistics, and Machine Learning) but as I learn more terms it seems like I come in contact with even more terms that I do not understand... This is both fun and frustrating and therefore I will start a list of unknown terms that I hope to have gained an understanding of by the end of this project.

Project specification draft

Finally finished a second draft of a project specification for the problem to be studied and how to proceed in solving it.

Friday, March 19, 2010

Blog post about Plagiarism Detection and Python

http://doingdigitalhistory.wordpress.com/2010/02/06/diy-plagiarism-detection/

Sweet toolkit

I just stumbled upon the following toolkit http://www.nltk.org/ which I believe will be very useful since it can be used to run Weka algorithms through Python.

Wednesday, March 17, 2010

The infamous project Gantt chart

First day at the office

Today was my first day at the office.

I met up with J at 10 a.m. and then we resolved all my administrative needs. It was nice to see how smootly these things can work in a smaller organisation.

Next me, J and M met up at Salongen and had a long and thoroguh discussion about my upcoming work but more about this in the next post. This was quite productive and we decided to make this a repeating thing and hereafter we will have a weekly meeting on Wednesdays at 10 a.m. where we will talk about the projects progress. In the end O joined in on our meeting.

After the meeting there was time for lunch.

After lunch there was a little free time and I started working on the Gantt chart that will be needed in the project specification.

At 1 p.m. I got the chance to meet most of the UserWare lab since they held there weekly meeting. They have a lot going on and most of them are only working part time at SICS.

Then I got back to work some more on the Gantt chart and also to create this blog.

It's fun to have started working and to see that some things are getting clearer. Overall a nice day at the office.

Monday, March 8, 2010

Project rationale

I should probably say something about why we will perform this plagiarism detection project.

First of, about me. I am a 27 year old student that is about to finish my degree as a Master's of Science within Computer Science and Engineering from the Royal Institute of Technology (KTH). The plagiarism detection project is also my Master Thesis project so my incentives to finsh the project are quite high... :)

J and M are researchers at SICS and deal with Computationally Linguistical tasks like to find different styles and genres of the way we write texts or to find measurements about how alike words and sentences are. So this plagiarism detection problem belongs to their field of work and therefore they are interested in it.

There is a (yearly) conference that is organised by the Cross-Language Evaluation Forum (CLEF) that is named CLEF2010. A part of that conference is the PAN2010 labs and one of these labs is especially interesting for us. Namely the lab that deals with plagiarism detection. This is a call for participation that is put out by the research society and therefore we will participate.

Monday, March 1, 2010

Initial project ideas

I will do my Master's Thesis project at the Swedish Institute of Computer Science (SICS) and I there we will try to solve the mysteries of Plagiarism Detection. This means that I will have to dig into the fields of Computational Linguistics, Information Retrieval and Machine Learning.. yay! :D

J and M at SICS have, a lot of, ideas about how to detect plagiarism and these ideas might be kind of unique. These ideas deals with finding the meaning of a text and also other semantical patterns and this might be something that has not been done that much when detecting plagiarism detection. What I have heard todays methods uses mostly statistical methods with rights to the words used in the text and not the actual meaning of the text.

We have decided upon four different linguistical hypothesis that will be used when detecting plagiarism. But in order to describe them we need some notation; sj denotes the sentence at index j in a textual document, wi the word at index i in a sentence s, oi is a synonym of the word
wi , wi and xi are both words but not neccessarily the same word or even synonyms.
Then two parts of the same text are considered the same if:
1. (Equality) s1 = w1 + w2 + ...wn and s2 = w1 + w2 + ...wn
2. (Synonyms) s1 = w1 + w2 + ...wn and s2 = w1 + o2 + ...wn
3. (Permutation) s1 = w1 + w2 + ...wn and s2 = w2 + w1 + ...wn
4. (Topicality) s1 = w1 + w2 + ...wn and s2 = x1 + x2 + ...xn but s1 and s2 have
the same semantical meaning.

From these hypothesis we hope to revolutionalise the worl of plagiarism detection! Let's hope it works...