Shingling for Similarity and Plagiarism Detection
This article introduces you to the concept of shingling, the basics of the shingling technique, Jaccard similarity, advanced techniques, and optimizations.
Join the DZone community and get the full member experience.
Join For FreeIn the digital age, where information is readily available and easily accessible, there is a need for techniques that can detect plagiarism (intentional or unintentional) from content duplication to enhancing natural language processing capabilities. What sets shingling's capabilities apart is the way it extends to various applications, including but not limited to, document clustering, information retrieval, and content recommendation systems.
The article outlines the following:
- Understand the concept of shingling
- Explore the basics of the shingling technique
- Jaccard similarity: measuring textual similarity
- Advanced techniques and optimizations
- Conclusion and further reading
Understanding the Concept of Shingling
Shingling is a widely used technique in detecting and mitigating textual similarities. This article introduces you to the concept of shingling, the basics of shingling technique, Jaccard similarity, advanced techniques, and optimizations. The process of converting a string of text in documents into a set of overlapping sequences of words or letters is called Shingling. Programmatically, think of this as a list of substrings from a string value.
Let's take a string: "Generative AI is evolving rapidly." Let's denote the length of the shingle as k and set the value of k to 5.
The result is a set of five letters:
{'i is ', ' evol', 'apidl', 'e ai ', 'ai is', 'erati', 've ai', 'rapid', 'idly.', 'ing r', ' ai i', 's evo', 'volvi', 'nerat', ' is e', 'ving ', 'tive ', 'enera', 'ng ra', 'is ev', 'gener', 'ative', 'evolv', 'pidly', ' rapi', 'olvin', 'rativ', 'lving', 'ive a', 'g rap'}
This set of overlapping sequences are called "shingles" or "n-grams." Shingles consist of consecutive words or characters from the text, creating a series of overlapping segments. The length of a shingle denoted above as "k," varies depending on the specific requirements of the analysis, with a common practice involving the creation of shingles containing three to five words or characters.
Explore the Basics of Shingling Technique
Shingling is part of a three-step process.
Tokenization
If you are familiar with prompt engineering, you should have heard about Tokenization. It is the process of breaking up a sequence of text into smaller units called tokens. Tokens can be words, subwords, characters, or other meaningful units. This step prepares the text data for further processing by models. With word tokenization, the above example "Generative AI is evolving rapidly" will be tokenized into:
['Generative', 'AI', 'is', 'evolving', 'rapidly', '.']
For tokenization, you can use either a simple Python `split`
method or Regex. There are libraries like NLTK (Natural Language ToolKit) and spaCy that provide advanced options like stopwords etc.,
Link to the code.
Shingling
As you know by now, Shingling, also known as n-gramming, is the process of creating sets of contiguous sequences of tokens (n-grams or shingles) from the tokenized text. For example, with k=3, the sentence "Generative AI is evolving rapidly." would produce shingles like
[['Generative', 'AI', 'is'], ['AI', 'is', 'evolving'], ['is', 'evolving', 'rapidly.']]
This is a list of shingles. Shingling helps capture local word order and context.
Hashing
Hashing simply means using special functions to turn any kind of data, like text or shingles, into fixed-size codes. Some popular hashing methods include MinHash, SimHash, and Locality Sensitive Hashing (LSH). Hashing enables efficient comparison, indexing, and retrieval of similar text segments. When you turn documents into sets of shingle codes, it's much simpler for you to compare them and spot similarities or possible plagiarism.
Simple Shingling
Let's consider two short text passages that are widely used to explain simple shingling
- Passage 1: "The quick brown fox jumps over the lazy dog."
- Passage 2: "The quick brown fox jumps over the sleeping cat."
With a word size of 4, using the w-shingle Python above, the shingles for Passage 1 would be:
python w_shingle.py "The quick brown fox jumps over the lazy dog." -w 4
[['The', 'quick', 'brown', 'fox'], ['quick', 'brown', 'fox', 'jumps'], ['brown', 'fox', 'jumps', 'over'], ['fox', 'jumps', 'over', 'the'], ['jumps', 'over', 'the', 'lazy'], ['over', 'the', 'lazy', 'dog.']]
For passage 2, the shingles would be:
python w_shingle.py "The quick brown fox jumps over the sleeping cat" -w 4
[['The', 'quick', 'brown', 'fox'], ['quick', 'brown', 'fox', 'jumps'], ['brown', 'fox', 'jumps', 'over'], ['fox', 'jumps', 'over', 'the'], ['jumps', 'over', 'the', 'sleeping'], ['over', 'the', 'sleeping', 'cat']]
By comparing the sets of shingles, you can see that the first four shingles are identical, indicating a high degree of similarity between the two passages.
Shingling sets the stage for more detailed analysis, like measuring similarities using things like Jaccard similarity. Picking the right shingle size "k" is crucial. Smaller shingles can catch small language details, while larger ones might show bigger-picture connections.
Jaccard Similarity: Measuring Textual Similarity
In text analysis, Jaccard similarity is considered a key metric. It is the similarity between two text samples by calculating the ratio of the number of shared shingles to the total number of unique shingles across both samples.
J(A,B) = (A ∩ B) / (A ∪ B)
Jaccard similarity is defined as the size of the intersection divided by the size of the union of the shingle sets from each text. Though it sounds straightforward and simple, this technique is powerful as it provides a means to calculate textual similarity, offering insights into how closely related two pieces of text are based on their content. Using Jaccard similarity enables researchers and AI models to compare analyses of text data with precision. It is used in tasks like document clustering, similarity detection, and content categorization.
Shingling can also be used to cluster similar documents together. By representing each document as a set of shingles and calculating the similarity between these sets (e.g., using the Jaccard coefficient or cosine similarity), you can group documents with high similarity scores into clusters. This approach is useful in various applications, such as search engine result clustering, topic modeling, and document categorization.
While implementing Jaccard similarity in programming languages like Python, the choice of shingle size (k) and the conversion to lowercase ensures a consistent basis for comparison, showcasing the technique’s utility in discerning textual similarities.
Let’s calculate the Jaccard similarity between two sentences:
def create_shingles(text, k=5):
"""Generates a set of shingles for given text."""
return set(text[i : i + k] for i in range(len(text) - k + 1))
def compute_jaccard_similarity(text_a, text_b, k):
"""Calculates the Jaccard similarity between two shingle sets."""
shingles_a = create_shingles(text_a.lower(), k)
print("Shingles for text_a is ", shingles_a)
shingles_b = create_shingles(text_b.lower(), k)
print("Shingles for text_b is ", shingles_b)
intersection = len(shingles_a & shingles_b)
union = len(shingles_a | shingles_b)
print("Intersection - text_a ∩ text_b: ", intersection)
print("Union - text_a ∪ text_b: ", union)
return intersection / union
Link to the code repository.
Example
text_a
= "Generative AI is evolving rapidly."
text_b
= "The field of generative AI evolves swiftly."
shingles_a = {'enera', 's evo', 'evolv', 'rativ', 'ving ', 'idly.', 'ative', 'nerat', ' is e', 'is ev', 'olvin', 'i is ', 'pidly', 'ing r', 'rapid', 'apidl', 've ai', ' rapi', 'tive ', 'gener', ' evol', 'volvi', 'erati', 'ive a', ' ai i', 'g rap', 'ng ra', 'e ai ', 'lving', 'ai is'}
shingles_b = {'enera', 'e fie', 'evolv', 'volve', 'wiftl', 'olves', 'rativ', 'f gen', 'he fi', ' ai e', ' fiel', 'lves ', 'ield ', ' gene', 'ative', ' swif', 'nerat', 'es sw', ' of g', 'ftly.', 'ld of', 've ai', 'ves s', 'of ge', 'ai ev', 'tive ', 'gener', 'the f', ' evol', 'erati', 'iftly', 's swi', 'ive a', 'swift', 'd of ', 'e ai ', 'i evo', 'field', 'eld o'}
J(A,B) = (A ∩ B) / (A ∪ B) = 12 / 57 = 0.2105
So, the Jaccard Similarity is 0.2105. The score signifies that the two sets are 21.05 % (0.2105 * 100) similar.
Example
Instead of passages, let’s consider two sets of numbers:
A = { 1,3,6,9}
B = {0,1,4,5,6,8}
(A ∩ B) = Common numbers in both the sets = {1,6} = 2
(A ∪ B) = Total numbers in both the sets = {0,1,3,4,5,6,8,9} = 8
Calculate Jaccard similarity to see how similar these two sets of numbers are
(A ∩ B) / (A ∪ B) = 2/8 = 0.25.
To calculate dissimilarity, just subtract the score from 1.
1- 0.25 = 0.75
So both the sets are 25% similar and 75% dissimilar.
Advanced Techniques and Optimizations
Advanced shingling and hashing techniques and optimizations are crucial for efficient similarity detection and plagiarism detection in large datasets. Here are some advanced techniques and optimizations, along with examples and links to code implementations:
Locality-Sensitive Hashing (LSH)
Locality-Sensitive Hashing (LSH) is an advanced technique that improves the efficiency of shingling and hashing for similarity detection. It involves creating a signature matrix and using multiple hash functions to reduce the dimensionality of the data, making it efficient to find similar documents.
The key idea behind LSH is to hash similar items into the same bucket with high probability, while dissimilar items are hashed into different buckets. This is achieved by using a family of locality-sensitive hash functions that hash similar items to the same value with higher probability than dissimilar items.
Example
Consider two documents, A and B, represented as sets of shingles:
- Document A: {"the quick brown", "quick brown fox", "brown fox jumps"}
- Document B: {"a fast brown", "fast brown fox", "brown fox leaps"}
We can apply LSH by:
- Generating a signature matrix using multiple hash functions on the shingles.
- Hashing each shingle using the hash functions to obtain a signature vector.
- Banding the signature vectors into bands.
- Hashing each band to obtain a bucket key.
- Documents with the same bucket key are considered potential candidates for similarity.
This process significantly reduces the number of document pairs that need to be compared, making similarity detection more efficient.
For a detailed implementation of LSH in Python, refer to the GitHub repository.
Minhashing
Minhashing is a technique used to quickly estimate the similarity between two sets by using a set of hash functions. It's commonly applied in large-scale data processing tasks where calculating the exact similarity between sets is computationally expensive. Minhashing approximates the Jaccard similarity between sets, which measures the overlap between two sets.
Here's how Minhashing works:
Generate Signature Matrix
- Given a set of items, represent each item as a set of shingles.
- Construct a signature matrix where each row corresponds to a hash function, and each column corresponds to a shingle.
- Apply hash functions to each shingle in the set, and for each hash function, record the index of the first shingle hashed to 1 (the minimum hash value) in the corresponding row of the matrix.
Estimate Similarity
- To estimate the similarity between the two sets, compare their respective signature matrices.
- Count the number of positions where the signatures agree (i.e., both sets have the same minimum hash value for that hash function).
- Divide the count of agreements by the total number of hash functions to estimate the Jaccard similarity.
Minhashing allows for a significant reduction in the amount of data needed to represent sets while providing a good approximation of their similarity.
Example: Consider Two Sets
- Set A = {1, 2, 3, 4, 5}
- Set B = {3, 4, 5, 6, 7}
We can represent these sets as shingles:
- Set A shingles: {1, 2, 3}, {2, 3, 4}, {3, 4, 5}, {4, 5}, {5}
- Set B shingles: {3, 4}, {4, 5}, {5, 6}, {6, 7}, {3}, {4}, {5}, {6}, {7}
Now, let's generate the signature matrix using Minhashing:
Hash Function |
Shingle 1 |
Shingle 2 |
Shingle 3 |
Shingle 4 |
Shingle 5 |
Hash 1 |
0 |
0 |
1 |
0 |
1 |
Hash 2 |
1 |
1 |
1 |
0 |
0 |
Hash 3 |
0 |
0 |
1 |
0 |
1 |
Now, let's estimate the similarity between sets A and B:
- Number of agreements = 2 (for Shingle 3 and Shingle 5)
- Total number of hash functions = 3
- Jaccard similarity ≈ 2/3 ≈ 0.67
Code Implementation: You can implement Minhashing in Python using libraries like NumPy and datasketch.
Banding and Bucketing
Banding and bucketing are advanced optimization techniques used in conjunction with Minhashing to efficiently identify similar sets within large datasets. These techniques are particularly valuable when dealing with massive collections of documents or data points.
Banding
Banding involves dividing the Minhash signature matrix into multiple bands, each containing several rows. By partitioning the matrix vertically into bands, we reduce the number of comparisons needed between sets. Instead of comparing every pair of rows across the entire matrix, we only compare rows within the same band. This significantly reduces the computational overhead, especially for large datasets, as we only need to consider a subset of rows at a time.
Bucketing
Bucketing complements banding by further narrowing down the comparison process within each band. Within each band, we hash the rows into a fixed number of buckets. Each bucket contains a subset of rows from the band. When comparing sets for similarity, we only need to compare pairs of sets that hash to the same bucket within each band. This drastically reduces the number of pairwise comparisons required, making the process more efficient.
Example
Let's say we have a Minhash signature matrix with 100 rows and 20 bands. Within each band, we hash the rows into 10 buckets. When comparing sets, instead of comparing all 100 rows, we only need to compare pairs of sets that hash to the same bucket within each band. This greatly reduces the number of comparisons needed, leading to significant performance gains, especially for large datasets.
Benefits
- Efficiency: Banding and bucketing dramatically reduce the number of pairwise comparisons needed, making similarity analysis more computationally efficient.
- Scalability: These techniques enable the processing of large datasets that would otherwise be impractical due to computational constraints.
- Memory optimization: By reducing the number of comparisons, banding, and bucketing also lower memory requirements, making the process more memory-efficient.
Several open-source implementations, such as the datasketch library in Python and the lsh library in Java, provide functionality for shingling, minhashing, and banded LSH with bucketing.
Candidate Pairs
Candidate pairs is an advanced technique used in conjunction with shingling and minhashing for efficient plagiarism detection and near-duplicate identification. In the context of shingling, candidate pairs work as follows:
Shingling
Documents are first converted into sets of k-shingles, which are contiguous sequences of k tokens (words or characters) extracted from the text. This step represents documents as sets of overlapping k-grams, enabling similarity comparisons.
Minhashing
The shingle sets are then converted into compact minhash signatures, which are vectors of fixed length, using the minhashing technique. Minhash signatures preserve similarity between documents, allowing efficient estimation of Jaccard similarity.
Banding
The minhash signatures are split into multiple bands, where each band is a smaller sub-vector of the original signature.
Bucketing
Within each band, the sub-vectors are hashed into buckets using a hash function. Documents with the same hash value for a particular band are placed in the same bucket.
Candidate Pair Generation
Two documents are considered a candidate pair for similarity comparison if they share at least one bucket across all bands. In other words, if their sub-vectors collide in at least one band, they are considered a candidate pair.
The key advantage of using candidate pairs is that it significantly reduces the number of document pairs that need to be compared for similarity, as only candidate pairs are considered. This makes the plagiarism detection process much more efficient, especially for large datasets.
By carefully choosing the number of bands and the band size, a trade-off can be made between the accuracy of similarity detection and the computational complexity. More bands generally lead to higher accuracy but also increase the computational cost.
Conclusion
In conclusion, the combination of shingling, minhashing, banding, and Locality Sensitive Hashing (LSH) provides a powerful and efficient approach for plagiarism detection and near-duplicate identification in large document collections.
Shingling converts documents into sets of k-shingles, which are contiguous sequences of k tokens (words or characters), enabling similarity comparisons. Minhashing then compresses these shingle sets into compact signatures, preserving similarity between documents.
To further improve efficiency, banding splits the minhash signatures into multiple bands, and bucketing hashes each band into buckets, grouping similar documents together. This process generates candidate pairs, which are pairs of documents that share at least one bucket across all bands, significantly reducing the number of document pairs that need to be compared for similarity.
The actual similarity computation is then performed only on the candidate pairs, using the original minhash signatures to estimate the Jaccard similarity. Pairs with similarity above a specified threshold are considered potential plagiarism cases or near-duplicates.
This approach offers several advantages:
- Scalability: By focusing on candidate pairs, the computational complexity is significantly reduced, making it feasible to handle large datasets.
- Accuracy: Shingling and minhashing can detect plagiarism even when content is paraphrased or reordered, as they rely on overlapping k-shingles.
- Flexibility: The choice of the number of bands and the band size allows for a trade-off between accuracy and computational complexity, enabling optimization for specific use cases.
Several open-source implementations, such as the datasketch library in Python and the lsh library in Java, provide functionality for shingling, minhashing, and banded LSH with bucketing and candidate pair generation, making it easier to integrate these techniques into plagiarism detection systems or other applications requiring efficient similarity search.
Overall, the combination of shingling, minhashing, banding, and LSH offers a powerful and efficient solution for plagiarism detection and near-duplicate identification, with applications across academia, publishing, and content management systems.
Further Reading
- Plagiarism selection in text collections
- Minhash using datasketch
- chrisjmccormick/MinHash: Python implementation with tutorial
- MinHashing - Kaggle: Comprehensive exploration of minhashing
- Stack Overflow discussion: Suggestions for minhash implementations
Opinions expressed by DZone contributors are their own.
Comments