nav[aria-label="Primary Navigation"] { padding: 0; & ul { list-style: none; width: 100%; display: flex; flex-direction: row; justify-content: start; align-items: start; gap: 30px; padding: 0; & li { margin: 0; } & ul li { list-style: none; } } }

Machine Learning and AI for Plagiarism Detection

SydneyF

The general concept behind plagiarism is relatively simple; plagiarism is wrongfully taking credit for original work that has been created by someone else. The word plagiarism comes from the Latin word plagiarus, meaning to kidnap, and was first used in the context of someone wrongfully claiming credit for creative work by a roman poet.

The widely perceived unethicality of plagiarism and the corresponding ideal of originality rose into prominence in Europe during the late 18th century as a part of Romanticism (an intellectual movement that emphasized emotion and individualism), and continues to be an important concept in our culture today, particularly in the context of academia.

It is important to note here that plagiarism is not a legal concept. Copyright infringement is. Plagiarism is considered dishonest and unethical and is often punished by institutions (like getting kicked out of school or losing your job), but it is not totally the same thing as copyright infringement. It is possible to commit plagiarism without copyright infringement, or commit copyright infringement without plagiarism. Plagiarism is entirely a moral concept, enforced by institutions and the court of public opinion.

The Internet by Anthony Clark: https://nedroidcomics.tumblr.com/post/41879001445/the-internetIn the age of the internet and AI, it is in some ways both easier and harder than ever to plagiarize (and get away with it). Because the internet is gigantic, and many of our activities are somewhat anonymized, it can be easy to get away with plagiarism. But the same tools that make it easy to plagiarize make it easy to find instances of plagiarism.

Defining the way plagiarism applies to different fields and has also become a little more ambiguous now that we are more globally connected and copyable creative mediums have expanded and changed.

Can code, a technical language used to communicate with computers be plagiarized? Is it plagiarism to pay someone else on the internet to write a paper for you? To what extent does something need to be reworked for it to no longer be considered plagiarized? Thinking about this in terms of music, is drawing inspiration from or sampling another track or artist plagiarism?

Questions like these hint at the tricky nature of plagiarism as it functions in modern culture.

Despite ambiguity and change, plagiarism continues to be an important moral concept, particularly in creative fields and academia. So much so that software companies continue to develop solutions for plagiarism detection and identification; leveraging both checking algorithms as well as machine learning and AI.

Plagiarism Checkers

There is a thriving software industry built around plagiarism detection, as well as general writing and grading assistance. Recently, the company Turnitin, which primarily sells a plagiarism detection service commonly used by high schools and universities, was acquired for 1.735 billion dollars. Plagiarism detection applications, like the one provided by Turnitin, use algorithms that compare writing to text in a database to search for identical or near-identical matches in writing between the database and a submitted paper. Typically, the plagiarism checkers return a report with a percent value of how much text was identified as matching. Users of the plagiarism detection software are encouraged to review the report and make their own assessment if the paper was plagiarized. In practice, the metrics from these types of reports may be a part of an automated system.

The algorithms used by plagiarism detection services are liable to identify false positives (the checker detects plagiarism where none occurred and false negatives (the checker failed to detect plagiarism). If there is no human element reviewing these detections, there is the possibility that people will be wrongfully punished (e.g., an admissions application that is never reviewed because it exceeded an arbitrary percent match threshold), or never caught (e.g., a student that engineers their papers using a plagiarism detection software marketed to students so their paper does not get flagged).

Despite the seemingly widespread use of plagiarism detection software in education, the feelings on both plagiarism checkers and the business models of companies that provide them are mixed. On a social level, there are concerns that plagiarism checkers imply that teachers don’t trust their students and that students need to hand over their intellectual property to a company to be added to their database. Critics of plagiarism detection software also argue that people will use these technologies as a crutch in the review process and do not look for more obvious indicators of plagiarism, like a shift in style or tone, or a difference in voice between submissions, which the algorithms in plagiarism checking software are less likely to detect.

Stylometry: Plagiarism Detector 2.0?

Another type of plagiarism that is becoming more and more popular is student’s purchasing “ghostwritten” essays and turning them in as their own work. Because the work is technically original, it is not going to be detected by a plagiarism checker. This can be seen as an example of plagiarism that is not copyright infringement. Although the writer was paid for the writing and willingly provided it with the intent of having it used and claimed by someone else, the student is still wrongfully misrepresenting authorship, and it is considered to be cheating in an academic context. This is a type of plagiarism that traditional detection algorithms will fail to identify.

A more advanced approach to identifying plagiarism than text matching is leveraging stylometry, which is the study of linguistic style (but also music and art), in combination with machine learning to identify authorship.

On a high-level, the idea of stylometry is that everyone has a writing “fingerprint”. The way that I write is somewhat unique to me and different from the way you would write. If we were to both write a 1000 word paper on the same topic, the way we went about describing and discussing that topic would be very different (let me know if that's an experiment you'd like to try, I'm game).

You may remember several years ago (specifically, 2013) when people found out widely beloved author J.K. Rowling had been writing books under the pen name Robert Galbraith. The initial investigation of this claim leveraging stylometry. Using statistics and a computer program called Java Graphical Authorship Attribution Program (JGAAP), Analyst Patrick Juola analyzed the books by each "author" and found that Rowling’s work was very similar to Gilbreath’s. Soon after these findings were published she came out as having written the book.

Using stylometry and machine learning is a potential solution to more tricky cases of plagiarism detection and is an active area of research in both traditional writing and computer programming.

Coding Stylometry

Plagiarism is most frequently talked about in terms of writing. It also applies to music and art. One arena that you might not think about right away is writing code.

Coding is a creative act. This might not be the first thing that comes to mind when people think of programmers working on creating software, but to be able to have a computer do what you tell it to, you need to learn to communicate in a new language, and you need to use that language creatively. That is part of what makes programming challenging. It is both a logical and creative pursuit.

The same way that J.K. Rowling left a “fingerprint” in the work she wrote under a pseudonym, authors of code have a stylometric fingerprint they leave on their work. Numerous research papers have been published over the past 20 years on leveraging (deep) neural networks or other machine learning algorithms and code stylometry to identify the authors of code - with high levels of success. A fair amount of this type of research is funded by groups falling under the Department of Defense (DOD). This is because these types of algorithms can be used forensically to identify the authors of malicious code.

There are concerns with these types of algorithms violating the privacy of coders. There has also been evidence that coders can imitate the style of another coder, potentially framing them for code they didn't write.

Although there is potentially more plagiarism that can be captured by these more advanced algorithms, there will likely be criticism for plagiarism checking software regardless of how it is implemented. The criticisms of plagiarism detection software are similar to those against any other algorithms making decisions about people without understanding the algorithms and data that drive these decisions.

Copyright Infringement and Coding

It's really important to remember that plagiarism and copyright infringement are not the same things (both of which are also different from patent protection). Plagiarism is saying you created something you didn't, copyright infringement is when you use creative authored work and the copyright on that work belongs to another person. Copyright infringement around code and technology is an area of law that is actively being defined through many different court cases.

Oracle vs. Google

In 2009, Oracle acquired the company Sun Microsystems, which had created the Java programming language starting in 1991. Upon acquiring Sun Microsystems, Oracle filed a lawsuit against Google for its use of Java APIs in its Android SDK (Software Development Kit) in its Android software. We are now in year 10 of the litigation efforts, with Google currently attempting to appeal to the U.S. Supreme Court.

Google openly admits to using the same names, organization, and functionality of Java APIs when creating their own version of Java for the Android OS after negotiations to license Java SE from Sun Microsystems fell through. The contention, in this case, is whether APIs are copyrightable at all and if they are, whether using these APIs constitutes fair use.

In the first trial, the local court that heard the case ruled that APIs were not copyrightable. This was appealed and overturned in the Federal Court of Appeals. The case was heard again, and the second local court ruling was that although the APIs are copyrightable, Google’s use of the APIs falls under fair use (APIs fall under fair use). This was also overturned by the federal court of appeals. There is now a case going for damages, and Google is trying to appeal the second ruling of the federal court of appeals.

The federal court feels that the choice to plagiarize the APIs was done for the sake of expedience, not because there was not an option to write their own API packages. Google argues that the copying of the APIs was to ensure that applications developed for Java would work in Android without alteration, encouraging interoperability.

Copyrights do not cover ideas, procedures, processes, systems, methods of operation, concepts, principles, or discovery. Copyrights cover works of authorship including literature, music, dramatic works, images, audiovisual works, and architecture. The work needs to have a degree of originality and creativity. The question is really if specific API code is a creative act of code, a specific expression of an idea, or a method of operation for a specific software language. There are a limited number of cases that set legal precedent set for this case, the most famous case being Lotus Vs. Borland, where the Supreme Court defined a standard that has allowed software developers can create competing versions of copyrighted software without infringement.

Many software developers, computer scientists, and technologists, including the Electronic Frontier Foundation are in support of Google. The concern with a ruling in favor of Oracle is that making APIs and similar bits of code copyrightable and not subject to fair use may stifle the development and progress in computer science. Allowing the open use of APIs allows and encourages software interoperability – limiting their use may cause stifling of software innovation growth. The intent of an API is to make it easy, or even possible, to interface with a software. Another concern with making APIs copyrightable is that it will make licensing mandatory, increasing the cost of development to programmers, and sparking an increase in nuisance lawsuits.

Smartcar vs. Otonomo

There is a currently developing conflict between the two self-driving car companies Smartcar and Otonomo. Smartcar has publicly accused Otonomo of copying all their APIs and documentation verbatim, down to variable names, example snippets, and spelling errors. This case is similar to the Google vs. Oracle case, but also different because it is the entire copying of a company’s API codebase, including help documentation and examples, by a directly competing company. Otonomo has since pulled down their API documentation page, and are investigating the accusation, but hold that their “rigorous standards of integrity remain uncompromised.”

Whether this case makes it to court remains to be seen, but it will certainly be interesting to follow as a related case.

Plagiarism in Technology and Media Today

Plagiarism is widely considered to be morally wrong – it is wrongfully stealing something creative and lying to take the credit for it. We search for plagiarism so diligently in the work of our students because it is important to learn by doing an assignment yourself, but also because it is important in our culture to instill an aversion to plagiarism early in an academic and professional career.

In some ways, our society is struggling to define the culture of reuse, attribution, and what is morally acceptable around code and other creative mediums as they emerge and change. This is what makes cases like Oracle vs. Google and the developing conflict between Smartcar and Otonomo important – they establish and define what will be acceptable going forward.

In legal terms, copyright infringement does cover creative works. If you are a writer, you can write about the same topics or story as other writers, but the words of an individual writer on a topic are protected. If you’re going to use something for inspiration or a resource, change the words, combine some other resources, and cite them all.

Where coding falls in the world of copyright is tricky – are the individual words of a program protected, or is it simply a way to execute a procedure or idea, which are not copyrightable? This is still something being defined and explored in court, but for the most part, APIs are intended to be used to make programs interoperable. The Google case is on the more egregious end of API usage – they openly and directly used APIs of a language to create a new, competing language to avoid a licensing agreement. Despite this, most programmers and computer scientists are in support of Google. It is common practice in coding to incorporate the work of other programmers. In this context, I think what matters most is acknowledging the sources you are starting with (citation = not plagiarism) and incorporating them into a new, creative work you can comfortably call your own.

Best Practices

Accepted answers

All comments

There are no accepted answers yet

Quick Links

Popular Tags

Child Item

This months top contributors

atcodedog05 19598

Qiu 15963

binu_acs 15783

MarqueeCrew 13710

apathetichell 13703