This article includes a list of references, related reading, or external links, but its sources remain unclear because it lacks inline citations. Please help to improve this article by introducing more precise citations. (March 2023) (Learn how and when to remove this template message)

In natural language processing a w-shingling is a set of unique shingles (therefore n-grams) each of which is composed of contiguous subsequences of tokens within a document, which can then be used to ascertain the similarity between documents. The symbol w denotes the quantity of tokens in each shingle selected, or solved for.

The document, "a rose is a rose is a rose" can therefore be maximally tokenized as follows:

(a,rose,is,a,rose,is,a,rose)

The set of all contiguous sequences of 4 tokens (Thus 4=n, thus 4-grams) is

{ (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose) } Which can then be reduced, or maximally shingled in this particular instance to { (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is) }.

Resemblance

For a given shingle size, the degree to which two documents A and B resemble each other can be expressed as the ratio of the magnitudes of their shinglings' intersection and union, or

r(A,B)=((|S(A)\cap S(B)|} \over {|S(A)\cup S(B)|))

where |A| is the size of set A. The resemblance is a number in the range [0,1], where 1 indicates that two documents are identical. This definition is identical with the Jaccard coefficient describing similarity and diversity of sample sets.

References

This article needs more complete citations for verification. Please help add missing citation information so that sources are clearly identifiable. (March 2023) (Learn how and when to remove this template message)

Broder; Glassman; Manasse; Zweig (1997). "Syntactic Clustering of the Web". SRC Technical Note #1997-015.
Manber (1993). "Finding Similar Files in a Large File System" (PDF). Does not yet use the term "shingling".
Manning, Christopher D.; Raghavan, Prabhakar; Schütze, Hinrich (7 July 2008). "w-shingling". Introduction to Information Retrieval. Cambridge University Press. ISBN 978-1-139-47210-4.

Natural language processing

General terms

Text analysis

Text segmentation	Compound-term processing Lemmatisation Lexical analysis Text chunking Stemming Sentence segmentation Word segmentation

Automatic summarization

Machine translation

Distributional semantics models

Language resources,
datasets and corpora

Types and standards	Corpus linguistics Lexical resource Linguistic Linked Open Data Machine-readable dictionary Parallel text PropBank Semantic network Simple Knowledge Organization System Speech corpus Text corpus Thesaurus (information retrieval) Treebank Universal Dependencies
Data	BabelNet Bank of English DBpedia FrameNet Google Ngram Viewer UBY WordNet

Automatic identification
and data capture

Topic model

Computer-assisted
reviewing

Natural language
user interface

Resemblance

See also

References