Friday, March 26, 2010

Inspired by Statistics

I spent this morning reading up a bit on statistical analysis. It all started out when I read a blog about Statistics For Programmers, Programmers Need To Learn Statistics Or I Will Kill Them All, and Measuring Measures. Next thing I find myself trying out the R programming language..

Well.. I still think it was quite a fun and useful morning.. :)

Thursday, March 25, 2010

Weekly meeting

Today me, J, and M had a meeting about the advancement of the project.

Right now we are in the acquirement of new knowledge phase. That is I will read up on the current research and its various implementations. Next week we will move to the design phase.

To conclude the knowledge phase I will present the current status of the automatic plagiarism detection field so we get an overview. This will be achieved by constructing a box diagram where every box will represent an idea, algorithm, concept, technique, etc. that has been previously used and tested when doing plagiarism detection. This box diagram will then be used in the design phase to help us when deciding in how we should solve the problem. I will focus on the boxes that has to do with the actual classification of whether or not a text-sequence is plagiarism or not. But there will be some boxes concerning preprocessing (like information retrieval) and postprocessing too.

I will also put some time aside to get to know the data a little bit more. PAN has provided us with a large training corpus that consist of original and suspicious documents. Some of the suspicious documents will contain plagiarized text-sequences that are marked up. I will try out the machine learning framework Weka and NLTK and try to learn how to use these frameworks to classify documents in different ways.

We decided to change the weekly meeting to Thursdays instead of Wednesdays so I can attend a machine learning course without missing half the lectures.

Wednesday, March 24, 2010

List of terms and concepts

(character, word) n-gram, skipgram, word space model (WSM), latent semantic analysis (LSA), latent Dirichlet allocation (LDA), string kernel, external plagiarism, intrinsic plagiarism, (dis)similarity measure, recall, precision, granularity, subsequence, synonyms, longest common subsequence, author style, genre, morpheme, suffix array, suffix tree, obfuscation, permutation, antonym, index, inverted index, crowd-sourcing, coding, hash function, stemming, Jacquard coefficient, Kullback–Leibler divergence, Levenshtein distance, stylometry, (character, word) frequency, cosine distance, sliding window, bag of (characters, words), outlier detection, space partitioning, kd-tree, metric tree, curse of dimensionality, dimensionality reduction, Principal Component Analysis (PCA), Isomap, locality sensitive hashing (LSH), authorship identification, stop word, cluster pruning, part of speech tag (POS or POST), Penn Treebank part of speech tag set, Snowball stemming algorithms, hyponym, hypernym, Kolmogorov Complexity measure, Lempel-Ziv compression, cohesion word, readability test, compression, Support Vector Machine (SVM), Artificial Neural Net, boosting, mean average precision (MAP), clipping, synset, arg max, kappa statistics, token, corpus, td-idf, context-free grammar (CFG), (semantic, syntactic) class, Levin's verb classes, decision tree (DT), quasi-Newton method, sentence-to-sentence similarity, word correlation factor, n-gram phrase correlation, dot plot, Fuzzy fingerprint, text chunk, text statistics, closed word class, Zipf's law, vocabulary richness, shingle, near duplicate detection, average word frequency class, text complexity, understandability, readability, Context Dependent Thinning (CDT), Random Permutation (RP)

Dealing with new terminology

When you first come in contact with a new research field you are almost always restrained by the field specific terms. In order to understand information about the field you first have to understand the meaning of the terms and how they relate. I have gained some knowledge of the terms used in the field I am approaching (that is Information Theory, Information Retrieval, Computational Linguistics, and Machine Learning) but as I learn more terms it seems like I come in contact with even more terms that I do not understand... This is both fun and frustrating and therefore I will start a list of unknown terms that I hope to have gained an understanding of by the end of this project.

Project specification draft

Finally finished a second draft of a project specification for the problem to be studied and how to proceed in solving it.

Friday, March 19, 2010

Blog post about Plagiarism Detection and Python

http://doingdigitalhistory.wordpress.com/2010/02/06/diy-plagiarism-detection/

Sweet toolkit

I just stumbled upon the following toolkit http://www.nltk.org/ which I believe will be very useful since it can be used to run Weka algorithms through Python.

Wednesday, March 17, 2010

The infamous project Gantt chart

First day at the office

Today was my first day at the office.

I met up with J at 10 a.m. and then we resolved all my administrative needs. It was nice to see how smootly these things can work in a smaller organisation.

Next me, J and M met up at Salongen and had a long and thoroguh discussion about my upcoming work but more about this in the next post. This was quite productive and we decided to make this a repeating thing and hereafter we will have a weekly meeting on Wednesdays at 10 a.m. where we will talk about the projects progress. In the end O joined in on our meeting.

After the meeting there was time for lunch.

After lunch there was a little free time and I started working on the Gantt chart that will be needed in the project specification.

At 1 p.m. I got the chance to meet most of the UserWare lab since they held there weekly meeting. They have a lot going on and most of them are only working part time at SICS.

Then I got back to work some more on the Gantt chart and also to create this blog.

It's fun to have started working and to see that some things are getting clearer. Overall a nice day at the office.

Monday, March 8, 2010

Project rationale

I should probably say something about why we will perform this plagiarism detection project.

First of, about me. I am a 27 year old student that is about to finish my degree as a Master's of Science within Computer Science and Engineering from the Royal Institute of Technology (KTH). The plagiarism detection project is also my Master Thesis project so my incentives to finsh the project are quite high... :)

J and M are researchers at SICS and deal with Computationally Linguistical tasks like to find different styles and genres of the way we write texts or to find measurements about how alike words and sentences are. So this plagiarism detection problem belongs to their field of work and therefore they are interested in it.

There is a (yearly) conference that is organised by the Cross-Language Evaluation Forum (CLEF) that is named CLEF2010. A part of that conference is the PAN2010 labs and one of these labs is especially interesting for us. Namely the lab that deals with plagiarism detection. This is a call for participation that is put out by the research society and therefore we will participate.

Monday, March 1, 2010

Initial project ideas

I will do my Master's Thesis project at the Swedish Institute of Computer Science (SICS) and I there we will try to solve the mysteries of Plagiarism Detection. This means that I will have to dig into the fields of Computational Linguistics, Information Retrieval and Machine Learning.. yay! :D

J and M at SICS have, a lot of, ideas about how to detect plagiarism and these ideas might be kind of unique. These ideas deals with finding the meaning of a text and also other semantical patterns and this might be something that has not been done that much when detecting plagiarism detection. What I have heard todays methods uses mostly statistical methods with rights to the words used in the text and not the actual meaning of the text.

We have decided upon four different linguistical hypothesis that will be used when detecting plagiarism. But in order to describe them we need some notation; sj denotes the sentence at index j in a textual document, wi the word at index i in a sentence s, oi is a synonym of the word
wi , wi and xi are both words but not neccessarily the same word or even synonyms.
Then two parts of the same text are considered the same if:
1. (Equality) s1 = w1 + w2 + ...wn and s2 = w1 + w2 + ...wn
2. (Synonyms) s1 = w1 + w2 + ...wn and s2 = w1 + o2 + ...wn
3. (Permutation) s1 = w1 + w2 + ...wn and s2 = w2 + w1 + ...wn
4. (Topicality) s1 = w1 + w2 + ...wn and s2 = x1 + x2 + ...xn but s1 and s2 have
the same semantical meaning.

From these hypothesis we hope to revolutionalise the worl of plagiarism detection! Let's hope it works...