ASK HERE

Computer Science Clay · 14-06-2009, 01:41 AM

Seminar Report
On
PLAGIARISM DETECTION TECHNIQUES
Submitted by
JAYA P A
In the partial fulfillment of requirements in degree of Master of Technology
(M Tech)
In
Computer and Information Science
DEPARTMENT OF COMPUTER SCIENCE
COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY
KOCHI682022
2007 Page 2

ACKNOWLEDGEMENT
I would like to express my sincere thanks to Lord Almighty without whose blessings I
would not have completed my seminars. I would like to thank all those who have
contributed to the completion of the seminar and helped me with valuable suggestions for
improvement.
I am extremely grateful to Prof. Dr. K Poulose Jacob, Director, Dept.of computer
Science, for providing me with best facilities and atmosphere for the creative work
guidance and encouragement.
I would like to thank my coordinator, G. Santhosh Kumar, Lecturer, Dept.of computer
Science, CUSAT, for all help and support extend to me. I thank all staff members of my
college and friends for extending their cooperation during my seminars.
Above all I would like to thank my parents without whose blessings; I would not have
been able to accomplish my goal.Page 3

ABSTRACT
This paper gives an overview of plagiarism and the detection techniques used. Plagiarism
in the sense of "theft of intellectual property" has been around for as long as humans have
produced work of art and research. However, easy access to the Web, large databases,
and telecommunication in general, has turned plagiarism into a serious problem for
publishers, researchers and educational institutions. More and more people begin to
realize that plagiarism is a moral phenomena that canâ„¢t exist in society with high ethical
standards. Nowadays many methods to fight against plagiarism are developed and used.
In this paper, I concentrate on plagiarism detection methods and features of this detection
methods. After that, analyses of already developed tools are presented.
Key words: Plagiarism, plagiarism prevention, plagiarism detection, similarity measures Page 4

CONTENTS
1. IntroductionÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦...1
2. Describing plagiarismÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦...1
3. Methods to reduce plagiarismÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦...3
4. Prevention methodsÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦...5
5. Detection MethodsÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦5
5.1 Document Source ComparisonÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦...5
5.2 Manual Search Of Characteristic PhrasesÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦.10
5.3 Quiz methodÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦..11
6. Available toolsÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦12
6.1 Attributes of detection toolsÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦..12
6.2 TurnitinÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦..12
6.3 GlattÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦...14
6.4 JplagÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦..16
6.5 WCopyfindÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦17
7. Limitations of detection toolsÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦...20
8. ConclusionÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦20
9. ReferencesÂ¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦Â¦21 Page 5

Plagiarism Detection Techniques
Department of Computer Science, CUSAT
1
1. INTRODUCTION
Plagiarism is a significant problem on almost every college and university campus. The
problems of plagiarism go beyond the campus, and have become an issue in industry,
journalism, and government activities. Although plagiarism has been a problem for
centuries, the Internet and Copy/Paste operation makes plagiarism very easy and
attractive for students in the twenty-first century.
In order to detect plagiarism, comparisons must be made between a target document (the
suspect) and reference documents. A second method is an expansion of the document
check but where the set of target documents is Ëœeverythingâ„¢ that is reachable on the
internet and the candidate to be checked for is a characteristic paragraph or sentence
rather than the entire document. The emergence of tools such as Google has made this
type of check feasible.
The remainder of this paper is organized as follows. The next section gives some ideas
about plagiarism methods and then about plagiarism reduction. Then different methods
for plagiarism detection are described. After that analysis of already developed tools are
presented. Finally, some conclusions are given.
2. DESCRIBING PLAGIARISM
Plagiarism can be described as:
turning in someone else's work as your own
copying words or ideas from someone else without giving credit
failing to put a quotation in quotation marks
giving incorrect information about the source of a quotation
changing words but copying the sentence structure of a source without giving
credit
copying so many words or ideas from a source that it makes up the majority of
your work, whether you give credit or not Page 6

Plagiarism Detection Techniques
Department of Computer Science, CUSAT
2
Plagiarism is derived form the Latin word plagiarius which means kidnapper. It is
defined as the passing off of another person's work as if it were one's own, by claiming
credit for something that was actually done by someone else. Plagiarism is not always
intentional or stealing some things from some one else; it can be unintentional or
accidental and may comprise of self stealing. The broader categories of plagiarism
include:
Â¢ Accidental: due to lack of plagiarism knowledge, and understanding of citation or
referencing style being practiced at an institute.
Â¢ Unintentional: the vastness of available information influences thoughts and the same
ideas may come out via spoken or written expressions as one's own.
Â¢ Intentional: a deliberate act of copying complete or part of some one elseâ„¢s work
without giving proper credit to original creator.
Â¢ Self plagiarism: using self published work in some other form without referring to
original one.
Commonly in practice there are different plagiarism methods. Some of them include:
Copy â€œ paste plagiarism (copying word to word textual information);
Paraphrasing (restating same content in different words);
Translated plagiarism (content translation and use without reference to original
work);
Artistic plagiarism (presenting same work using different media: text, images
etc.);
Idea plagiarism (using similar ideas which are not common knowledge);
Code plagiarism (using program codes without permission or reference);
No proper use of quotation marks (failing to identify exact parts of borrowed
content);
Misinformation of references (adding reference to incorrect or non existing
source). Page 7

Plagiarism Detection Techniques
Department of Computer Science, CUSAT
3
3. METHODS TO REDUCE PLAGIARISM
Nowadays many methods to fight against plagiarism are developed and used. These
methods can be divided into two classes:
(1) Methods for plagiarism prevention, and
(2) Methods for plagiarism detection.
If we consider plagiarism as a kind of social illness then we can say that methods of the
first class are precautionary measures which aim are to preclude rise of illness, but
methods of the second class are cures which are aimed to avert existing illness. Some
examples of methods in each class are as follows: plagiarism prevention â€œ honesty
policies and/or punishment systems, and plagiarism detection â€œ software tools to reveal
plagiarism automatically.
Each method has a set of attributes that determine its application. Two main attributes
which are common to all methods are:
1) Work â€œ intensity of methodâ„¢s implementation;
2) Duration of methodâ„¢s efficiency.
Work â€œ intensity of methodâ„¢s implementation- means amount of resources (mainly time)
which is needed to develop this method and bring into usage. Plagiarism prevention
methods are usually timeâ€œ consuming in their realization, while plagiarism detection
methods require less time.
Duration of methodâ„¢s efficiency- means the period of time in which positive effect of
methodâ„¢s realization exists. Implementation of prevention methods gives a long-term
positive effect. In contrast, implementation of detection methods gives short â€œ term
positive effect. Methods have different duration of positive effect, because of antipodal
approaches which methods use to fight against plagiarism â€œ detection methods based on Page 8

Plagiarism Detection Techniques
Department of Computer Science, CUSAT
4
societyâ„¢s intimidation, while prevention methods more rely on the societyâ„¢s change of
attitude against plagiarism.
Method
Attributes of method
Implementation
work â€œ intensity
Duration of positive effect
Plagiarism
prevention methods
Require more time
to implement
Positive effect isnâ„¢t momentary,
but it is long â€œ term
Plagiarism
detection methods
Require less time
to implement
Positive effect is momentary,
but it is short â€œ term
Table 3.1: Attributes of plagiarism detection and prevention methods
Despite of differences in prevention and detection methods all these methods are used to
achieve one common goal â€œ to fight against plagiarism. To make this fight efficient,
system approach to plagiarism problem solving is needed, i.e. it is needed to combine
plagiarism prevention and detection methods. To achieve momentary, short â€œ term
positive results plagiarism detection methods must be applied at problemâ„¢s initial stages,
but to achieve positive results in long â€œ time period plagiarism prevention methods must
be put into action. Plagiarism detection methods can only minimize plagiarism, but
plagiarism prevention methods can fully eliminate plagiarism phenomena or at least to a
great extent decrease it. That is why plagiarism prevention methods without doubt are
more significant measures to fight against plagiarism. Unfortunately, plagiarism
prevention is a problem for society as a whole, i.e., it is at least national wide problem
which can not be solved by efforts of one university or its department. Page 9

Plagiarism Detection Techniques
Department of Computer Science, CUSAT
5
4. PREVENTION METHODS
Honesty Policies: Although plagiarism is reasonably well defined and explained in many
forums, the penalty for cases detected varies from case to case and institution to
institution, many universities have well defined policies to classify and deal with
academic misconduct. Rules and information regarding it are made available to students
during the enrolment process, via information brochures and the university web sites.
Academic dishonesty can be dealt with at teacher-student level or institute-student
level. The penalties that can be imposed by teachers include written or verbal warning,
failing or lower grades and extra assignments. The institutional case handling involves
hearing and investigation by an appropriate committee, with the accused aware and part
of whole process. The institutional level punishments may include official censure,
academic integrity training exercises, social work, and transcript notation, suspension,
and expulsion, revocation of degree or certificate and possibly even referral of the case to
legal authorities.
5. PLAGIARISM DETECTION METHODS
5.1 DOCUMENT SOURCE COMPARISON
Plagiarism detection usually is based on comparison of two or more documents. A
collection of submitted work is known as a corpus. Where the source and copy
documents are both within a corpus this is known as intra-corpal plagiarism, or
occasionally as collusion. Where the copy is inside the corpus and the source outside, for
instance in a textbook, a submission from a student who took the assessment in a
previous session, or on the Web, this is known as extra-corpal plagiarism.
This approach can be further divided into two categories; one that operates locally on the
client computer and does analysis on local databases of documents or performs internet
searches, the other is server based technology where the user uploads the document and
the detection processes take place remotely. The most commonly used techniques in Page 10

Plagiarism Detection Techniques
Department of Computer Science, CUSAT
6
current document source comparison involve word stemming or fingerprinting. The core
finger printing idea has been modified and enhanced by various researchers to improve
similarity detection. Many current commercial plagiarism detection service providers
claim to have proprietary fingerprinting and comparison mechanisms.
Figure 5.1.1: A generic structure of document source comparison based plagiarism
detection system.
In order to compare two or more documents and to reason about degree of similarity
between them, it is needed to assign numeric value, so called, similarity score to each
document. This score can be based on different metrics. There are many parameters and
aspects in the document which can be used as metrics.
5.1.1 CLASSIFICATIONS OF METRICS
I) NUMBER OF SUBMISSIONS PROCESSED BY THE METRICS USED
First classification is based on the number of documents involved in the metrics
calculation process the methods that the detection engines used to find similarity are of
most academic interest. It is suggested that metrics can be differentiated from one another
based on the number of documents that are processed together to generate them, a set of
classifications not previously found in the literature. Studying technical descriptions of Page 11

Plagiarism Detection Techniques
Department of Computer Science, CUSAT
7
detection engines based on the documents they process two main types of metrics have
been identified; it is proposed to name these singular metrics and paired metrics. These
operate on one and two documents at a time respectively to generate a numeric value.
Paired metrics are intended to give more information that could be gleaned by simply
computing and combining two singular metrics.
For completeness corpal metrics and multi-dimensional metrics are also defined here,
each of which operates simultaneously on a greater number of documents. A corpal
metric operates on an entire corpus at a time, for instance to find some general property
of it. One use of this might be to compare the standard of work from different tutor
groups. A multi-dimensional metric operates on a chosen number of submissions, so a
singular metric would be 1-dimensional, a paired metric 2-dimensional and a corpal
metric n-dimensional where n is the size of the corpus. Multi-dimensional metrics might
be useful for finding clusters of similar submissions.
Source Code
Free Text
Singular
Metrics
Mean number of characters per line.
Proportion of 'while' loops to 'for'
loops
Mean number of words per
sentence.
Proportion of use of 'their'
compared to 'there'.
Paired
Metrics
Number of keywords common to two
source code submissions.
The length of the longest tokenisation
substring common to both.
Number of capitalised words
common to two free text
submissions.
The length of the longest
substring common to both.
Multi-
Dimensional
Metrics
The proportions of keywords common
to a set of submissions.
The proportion of words from a
chosen group common to a set
of submissions.
Corpal
Metrics
The proportion of source code
submissions using the keyword
'while'.
The proportion of submissions
using the word 'hence'.
TABLE 5.1 - Examples of Dimensional Metrics Page 12

Plagiarism Detection Techniques
Department of Computer Science, CUSAT
8
Table 5.1 contains some examples of possible metrics that fall under each classification.
Examples are given for both source code and free text although some might prove to be
inappropriate for detection.
II) COMPLEXITY OF THE METRICS USED
Second classification is based on computational complexity of the methods employed to
find similarities. These groups have been named superficial metrics and structural
metrics. A superficial metric is a measure of similarity that can be gauged simply by
looking at one or more student submissions. No knowledge of the structure of a
programming language or the linguistic features of natural language is necessary. A
structural metric is a measure of similarity that requires knowledge of the structure of one
or more documents. For source code submissions this might involve a parse of the
submissions. For free text submissions this could involve reducing words to their
linguistical root word form.
Source Code
Free Text
Superficial
Metrics
The count of the reserved
keyword Ëœwhileâ„¢.
The number of runs of five words
common to two submissions.
Structural
Metrics
The number of operational
paths through a program.
The size of the parse tree for a
submission.
TABLE 5.2 - Examples of Operational Complexity Metrics
Although these categories are fully inclusive and mutually exclusive it is impossible to
give a definition that can be consistently applied in every case. The borderline between
where a superficial metric stops and a structural metric begins is necessarily a fuzzy one.
For instance if a submission is tokenized and a superficial metric applied the whole
process could instead be thought of as just a structural metric, since tokenization is a
structure dependent process. Hence in some cases these definitions are open to individual
interpretation. Page 13

Plagiarism Detection Techniques
Department of Computer Science, CUSAT
9
Most intra-corpal plagiarism detection engines work by comparing every submission with
every other possible submission, giving time complexity proportional to the square of the
number of submissions (known as O(n
2
) for n submissions). This means that processing
time increases exponentially as the number of submissions grows. More computationally
efficient comparison methods may take less time, an issue which is important when
considering scalability, for instance when a free-text engine is being linked to a sizeable
database of possible sources.
Another way how to classify metrics is according to main principle build-in them, i.e, and
documentsâ„¢ contents analysis is based on semantical or statistical methods. In statistical
methods there are no needs to understand the meaning of the document. A common
statistical approach is the construction of document vectors based on values describing
the document, like, the frequencies of words, compression metrics, Lancaster word pairs
and other metrics. Statistical metrics can be language-independent or language sensitive.
Purely statistical method is N-gram approach where text is characterized with sequences
of N consecutive characters. Based on statistical measures each document can be
described with so called fingerprints, where n-grams are hashed and then selected some
to be fingerprints. There can be also measures which contain probabilities.
In many cases similarity score between two documents is calculated as Euclidean
distance between document vectors. The similarity of identical documents is zero.
Similarity also can be calculated as scalar product of document vectors divided by their
lengths. This is equivalent to the cosine of the angle between two document vectors seen
from the origin. In many cases document vectors are composed from word frequency and
word weight which are automatically calculated for each document. Word frequency is
taken into account in proportion function. Symmetric or asymmetric similarity measures
are one more classification. Asymmetric similarity measures are heavy frequency vector
and heavy inclusion proportion model, which are derived from cosine function and
proportion function by combining asymmetric similarity concept with heavy frequency
vector. Asymmetric similarity measures can be used for searching subset coping. Usually
in different tools statistical methods are implemented due to their simplicity. Page 14

Plagiarism Detection Techniques
Department of Computer Science, CUSAT
10
One of the most well known methods for string comparison is the Running Karp-Rabin
Matching and Greedy String Tiling (RKR-GST). The algorithm is described by Wise as a
method for comparing amino acid biosequences. Despite its origins in biology, the
method has application in. The RKR-GST algorithm appears to be the principle method
used in most commercial plagiarism detection software. This algorithm attempts to detect
the longest possible string common to both documents.With the RKR-GST algorithm it is
not necessary for the strings to be contiguous in order to be matched. This is a powerful
concept because it means that matches can be detected even if some of the text is deleted,
or if additional text has been inserted. It is possible for RKR-GST to detect matches even
when portions of multiple documents have been combined to create a patchwork of
plagiarized material. The algorithm can be further enhanced by parsing the documents to
remove trivial words and tokens.
5.2 MANUAL SEARCH OF CHARACTERISTIC PHRASES
Using this approach the instructor or examiner selects some phrases or sentences
representing core concepts of a paper. These phrases are then searched across the internet
using single or multiple search engines. Let us explain this by means of an example.
Suppose we detect the following sentence in a studentâ„¢s essay
Let us call them eAssistants. They will be not much bigger than a credit card, with a fast
processor, gigabytes of internal memory, a combination of mobile-phone, computer,
camera
Since eAssistant is an uncommon term, it makes sense to input the term into a Google
query. Indeed if this done the query produces:
"(Maurer H., Oliver R.) The Future of PCs and Implications on Society - Let us call them
eAssistants. They will be not much bigger than a credit card, with a fast processor,
gigabytes of internal memory, a combination of...
jucsjucs_9_4/the_future_of_pcs/Maurer_H_2.html - 34k -" Page 15

Plagiarism Detection Techniques
Department of Computer Science, CUSAT
11
This proves that without further tools the student has used part of a paper published in the
Journal of Universal Computer Science. It is clear that this approach is labor intensive;
hence it is obvious that some automation will make sense.
Another method is the cataloging of past student papers. Some institutions have
maintained vaults of student composition papers cross-indexed in several ways. A teacher
who suspected plagiarism could descend into the vault (or more likely, send a teaching
assistant) to search for a paper submitted during a previous semester. Many faculty
members detect plagiarism by observing writing styles. Sometimes a paper seems to be
too professionally written to have been prepared by a student. Another clue is a sudden
shift in writing styles. Ironically, the Copy/Paste process that makes plagiarism so easy
also betrays the crime because students forget to reformat the text into a uniform font.
All of the manual methods have serious deficiencies. It is impossible to know all of the
literature on most topics that undergraduate students are likely to be writing about. The
problem is exacerbated by the growing number of informal, unpublished papers available
over the Internet.
5.3 QUIZ METHOD
The Glatt Plagiarism Screening System involves the quiz method. The program removes
words from a studentâ„¢s paper and asks the student to replace the missing words. A score
is generated based on the accuracy of the student responses and the amount of time it
takes for students to complete the task.
Another method of detecting plagiarism is quizzing students about their written work. A
student who has produced their own paper should be familiar with its contents and should
be able to answer questions about it. Effective questioning can involve asking students
about ideas that were left out or rejected. Page 16

Plagiarism Detection Techniques
Department of Computer Science, CUSAT
12
6. AVAILABLE TOOLS
Several applications and services exist to help academia detect intellectual dishonesty.
I have selected some of these tools which are currently particularly popular and describe
their main features in what follows.
6.1 ATTRIBUTES OF DETECTION TOOLS
According to analytical information available on the Web leader between detection tools
is Turnitin, due to its functionality. Each tool has a set of attributes that determine its
application. Two main attributes which are common to all tools are:
1) Type of text tool operates on;
2) Type of corpus tool operates on.
According to attribute type of text tool operates on tools can be divided into two
groups: tools that operate on non-structured (free) text and tools that operate on
structured (source code) text. In fact, detection tools are not limited to operate on free text
or source code. It may be used to find similarity in spreadsheets, diagrams, scientific
experiments, music or any other non-textual corpora.
According to attribute type of corpus tool operates on tools can be divided in three
groups: tools that operate only intra-corpally (where the source and copy documents are
both within a corpus), tools that operate only extra-corpally (where the copy is inside the
corpus and the source outside) and tools that operate both â€œ intra- and extra-corpally.
6.2 Turnitin
This is a product from iParadigms [iParadigm 2006]. It is a web based service. A team of
researchers at UC Berkeley developed the computer programs in 1996 to monitor
plagiarism in undergraduate classes. A digital portfolio service that will provide storage
and retrieval of academic documents is a coming feature.Page 17

Plagiarism Detection Techniques
Department of Computer Science, CUSAT
13
Detection and processing is done remotely. The user uploads the suspected document to
the system database. The system creates a complete fingerprint of the document and
stores it. Proprietary algorithms are used to query the three main sources: one is the
current and extensively indexed archive of Internet with approximately 4.5 billion pages,
books and journals in the ProQuestâ€žÂ¢ database; and 10 million documents already
submitted to the Turnitin database.
Fig: 6.1.1
Turnitin offers different account types. They include consortium, institute, department
and individual instructor. The former account type can create later mentioned accounts
and have management capabilities. At instructor account level, teachers can create classes
and generate class enrolment passwords. Such passwords are distributed among students
when joining the class and for the submission of assignments. Figure 6.1.1 and 6.1.2
gives an idea of the systemâ„¢s user-interface.
The system generates the originality report within some minutes of submission. The
report contains all the matches detected and links to original sources with color codes
describing the intensity of plagiarism. It is however not a final statement of plagiarism. A Page 18

Plagiarism Detection Techniques
Department of Computer Science, CUSAT
14
higher percentage of similarities found do not necessarily mean that it actually is a case
of plagiarism one has to interpret each identified match to deduce whether it is a false
alarm or actually needs attention.
Figure 6.1.2: Turnitin, originality report of a submission
6.3 Glatt
Glatt Plagiarism Services founded 1987 by Dr. Barbara S. Glatt, which uses Wilson
Taylor's (1953) cloze procedure. This method is based on writing styles and patterns. In
this approach every fifth word in a suspected document is removed and the writer is
asked to fill the missing spaces. The number of correct responses and answering time is
used to calculate plagiarism probability. For example the examiner suspects that the
following paragraph is plagiarized. Page 19

Plagiarism Detection Techniques
Department of Computer Science, CUSAT
15
The proposed framework is a very effective approach to deal with information available
to any individual. It provides precise and selected news and information with a very high
degree of convenience due to its capabilities of natural interactions with users. The
proposed user modelling and information domain ontology offers a very useful tool for
browsing the information repository, keeping the private and public aspects of
information retrieval separate. Work is underway to develop and integrate seed resource
knowledge structures forming basis of news ontology and
user models using.....
The writer is asked to take a test and fill in periodic blank spaces in text to verify the
claim of authorship. A sample test based on above paragraph is shown in figure 6.3.1
.
Figure 6.3.1: Glatt Plagiarism Self-Detection Program Page 20

Plagiarism Detection Techniques
Department of Computer Science, CUSAT
16
Fig 6.3.2: Glatt Plagiarism Detection Results
The percentage of correct answers can be used to determine if the writing is from the
same person or not. The result of the mentioned test is shown in figure 6.3.2. This
approach is not always feasible in academic environment where large numbers of
documents are needed to be processed, but it provides a very effective secondary layer of
detection to confirm and verify the results.
6.4 JPlag
Is an internet based service which is used to detect Plagiarisms Among a Set of
Programs. Plag is used to detect software plagiarism. It finds similarities among
multiple sets of source code files. Created by Guido Malpohl, JPlag currently supports
Java, C, C++, Scheme, and also natural language text. Users upload the files to be
compared and the system presents a report identifying matches. Page 21

Plagiarism Detection Techniques
Department of Computer Science, CUSAT
17
Jplag takes as input a set of programs, compares these programs pairwise (computing for
each pair a total similarity value and a set of similarity regions), and provides as output a
set of HTML pages that allow for exploring and understanding the similarities found in
detail. Jplag works by converting each program into a stream of canonical tokens and
then trying to cover one such token string by substrings taken from the other (string
tiling). JPlag does programming language syntax and structure aware analysis to find
results.
6.5 WCopyfind
Developed by Lou Bloomfield, Professor of Physics, University of Virginia, this program
examines a group of documents that an instructor selects, and pulls out text portions of
those documents with matching words in phrases of a specified minimum length. The
program cannot find such "shared" phrases from documents that are "external" or those
not entered for testing. Recent versions of software can handle web documents.
Since the WCopyfind works at the string-of-text level, language is unimportant and
matches are readily identified from the candidate documents submitted for analysis. Note
that such a procedure cannot find plagiarism based on documents not submitted, for
example Web resident documents. Of course, further analysis of a small subset can be
submitted for Web-based document comparison with Google for example. In this case a
sample of the identified within-cohort plagiarized text was submitted for a Google search
and immediately revealed a source on the Web containing the same text
Figure 6.3.1, Figure 6.3.2, and Figure 6.3.3 show WCopyfind â€œ system interface,
WCopyfind â€œ report, and WCopyfind â€œ document comparison, respectively. Page 22

Plagiarism Detection Techniques
Department of Computer Science, CUSAT
18
Figure 6.3.1: WCopyfind â€œ system interface
Figure 6.3.2: WCopyfind â€œ reportPage 23

Plagiarism Detection Techniques
Department of Computer Science, CUSAT
19
Figure 6.3.3: WCopyfind â€œ document comparison
The comparison window and matches.txt both list two numbers of matches. The first or
Total Match is the number of perfectly matching words that have been marked in the
pair of documents. The second or Basic Match is the number of perfectly matching
words in phrases of at least Shortest Phrase to Match words. That second value is
essentially the value that would have been obtained if no imperfections were allowed in
the matching. In fact, if the Most Imperfections to Allow parameter is set to zero,
Total Match and Basic Match will be the same. In the reports, perfect matches are
indicated by red- underlined words and bridging, But non- matching words are indicated
by green-underlined words and bridging.
Operation of plagiarism detection tools is based on statistical or semantical methods or
both to get better results. Since the statistical methods are easier to implement in
software, most of the detection tools uses statistical methods to detect plagiarism. Page 24

Plagiarism Detection Techniques
Department of Computer Science, CUSAT
20
7. LIMITATIONS OF DETECTION TOOLS
The drawbacks of detection tools are:
Â¢ Inability to distinguish correctly cited text from plagiarized text.
Â¢ Books are typically not searched by these services.
Â¢ Detect plagiarized words, not plagiarized thoughts or ideas.
Â¢ Inability to process textual images for similarity checks.
Analysis of the known plagiarism detection tools shows that although these tools provide
excellent service in detecting matching text between documents, even advanced
plagiarism detection software canâ„¢t detect plagiarism so good as human does. Plagiarism
detection tools inability to distinguish correctly cited text from plagiarised text is one of
the serious drawbacks of these tools. That is why human interposition is necessary before
a paper is declare plagiarised â€œ manual checking and human judgment are still needed
8. CONCLUSION
In the age of information technologies plagiarism has become more actual and turned into
a serious problem. In the paper ways how to reduce plagiarism are discussed. Plagiarism
prevention methods which are based on societyâ„¢s change of attitude against plagiarism
without any doubt are the most significant means to fight against plagiarism, but
implementation of these methods is a challenge for society as a whole. Human brain is
universal plagiarism detection tool, which is able to analyze document using statistical
and semantical methods, is able to operate with textual and non-textual information. At
the present such abilities are not available for plagiarism detection software tools. But
nevertheless computer â€œ based plagiarism detection tools can considerably help to find
plagiarised documents.Page 25

Plagiarism Detection Techniques
Department of Computer Science, CUSAT
21
9. REFERENCES
1) Romans Lukashenko, Vita Graudina, Janis Grundspenkis, Computer Based Plagiarism
Detection Methods and Tools an Overview. Proceedings of the 2007 international
conference on Computer systems and technologies; Vol. 285
2) Lancaster T., F. Culwin. Classifications of Plagiarism detection engines. ITALICS
Vol.4(2),2005
3) Maurer, H., F. Kappe, B. Zaka. Plagiarism â€œ A Survey. Journal of Universal Computer
Sciences, vol. 12, no. 8, pp. 1050 â€œ 1084, 2006.
4) plagiarism.org
5) http://educause.edu/ir/library/pdf/ser07017b.pdf
6) https://ipd.uni-karlsruhe.de/jplag
7) http://plagiarism.phys.virginia.edu/WCopyfind%202.5.exe

Possibly Related Threads...
Thread		Author	Replies	Views	Last Post
	OBJECT TRACKING AND DETECTION full report	project topics	9	31,255	06-10-2018, 12:20 PM Last Post: jntuworldforum
	Host-Based Intrusion Detection Using user signatures	nit_cal	2	2,408	06-10-2016, 10:27 AM Last Post: ijasti
	DETECTION OF THE MALARIAL PARASITE INFECTED BLOOD IMAGES BY 3D-ANALYSIS	project report tiger	2	2,387	26-09-2016, 10:55 AM Last Post: ijasti
	Optical Computer Full Seminar Report Download	computer science crazy	46	67,200	29-04-2016, 09:16 AM Last Post: dhanabhagya
	ULTRA SONIC TECHNIQUES FOR THE DETECTION OF HIDDEN CORROSION IN AIR CRAFT WING SKIN	seminar projects crazy	5	6,123	15-04-2016, 08:04 PM Last Post: knagpur
	Digital Signature Full Seminar Report Download	computer science crazy	20	44,510	16-09-2015, 02:51 PM Last Post: seminar report asees
	HOLOGRAPHIC VERSATILE DISC A SEMINAR REPORT	Computer Science Clay	20	39,469	16-09-2015, 02:18 PM Last Post: seminar report asees
	Computer Sci Seminar lists7	computer science crazy	4	11,626	17-07-2015, 10:29 AM Last Post: dhanyasoubhagya
	Steganography In Images (Download Seminar Report)	Computer Science Clay	16	25,991	08-06-2015, 03:26 PM Last Post: seminar report asees
	Mobile Train Radio Communication ( Download Full Seminar Report )	computer science crazy	10	28,166	01-05-2015, 03:36 PM Last Post: seminar report asees

Important Note..!

ASK HERE