This template creates a link that can be used to store a Wikipedia search box query. A search link is useful for collaborative search on Talk pages and most other pages, but it is not to be used in articles. If used in articles, it outputs the above warning.


Basics

((Search link|first|second|third))

The name of the template is Search link, or sl for short.
The second and third parameters are optional and have defaults, so the short form is ((sl|query)).

Both a search link and a search box go to the same search engine. The same query produces the same result.

The basic search covers articles. It finds words and phrases composed of letters and numbers very quickly, but a basic search can also query for all articles that contain a string that includes punctuation, math, and other symbols as seen in the page content or as seen in the page wikitext.

Basic search principles when using a search link
1 ((Search link
|"search engine query"))

"search engine query"

There is one search term, a phrase that produces 18 results, including a redirect. For one term, the page ranking rule is simple: title matches, on top. Two pages hit on "Search Engine Query" and one on "[[Search engine (computing)|search engine]] query".
2 ((Search link
|"search engine query"

insource:/"search engine query"/))
"search engine query" insource:/"search engine query"/

Added a term: insource:/"slash delimited regex"/. Now there are 15. Three were filtered out because regex match only exact strings. All other searches always ignore capitalization, punctuation, math, and other symbols, like the ]] above. Proves a basic difference with search 1: only insource: searches wikitext. All other terms search what is rendered.
3 ((Search link
|search engine query))

search engine query

There are three search terms. They produce 1169 results. Many page ranking rules apply to make the top most likely and the bottom least likely, probably.
4 ((Search link
|search engine query
insource:/"search engine query"/))

search engine query insource:/"search engine query"/

Similar to search 3, the regex crawled character-by-character through the same 1169-page filter to produce its same 15 results. That's nothing compared to what would happen if you ran an unfiltered (unaccompanied) regex exposed to (some users' default search domain) 30 million pages (is all namespaces) to produce 15 results.[1]
5 ((search link
|insource:/"2 + 2"/
prefix:Arithmetic
|"Arithmetic" titles & "2 + 2"))

"Arithmetic" titles & "2 + 2"
The regexp is the first term, but the prefix: term first filters titles that start with the characters A-r-i-t-h-m-e-t-i-c, then the regexp crawls character-wise. Perhaps such a label conveys this to your team.

This template differs from the search box superficially when searching for an equals sign. In the search box you just say =, but here you must use the five-letter string ((=)).[2]

In search 5 notice the need for the double quotes around the search pattern: insource:/"slash delimited regexp"/. These protect any characters from being interpreted as regex metacharacters, and insures they are interpreted literally. In advanced searches the double quotes are not used, so that the metacharacters can act as conditional and branching operators to created general patterns that match.

In search 5 also, especially note also the prefix: filter used. We use filters with regex searches. The easiest filter to apply is like that in search 2; just take the same phrase and make it a separate term. That will act like a filter, and it will speed through the database index to produce a list of pages that might actually match the regex. Prefix is a filter at the end, and a namespace name at the beginning is another easy filter to apply.

The next section covers Search link arguments more in depth.

Advanced

Further information: Help:Searching and Search engine queries

Here are the template parameters for Search link.

1 or |query= The search query. It becomes the text of the search link (how the link will look) so it accepts |text=.
2 or |label= A label to replace the default text. A new look to the link, so it also accepts |link=. Defaults to show the search query.
3

20

|3|4|5||20
or |ns=

The search domain: one or more namespaces abbreviated "nsx", where x is any namespace number.
|nsx|nsx|nsx|…|nsx, or ns=nsx&nsx&nsx…&nsx, or ns=all. Defaults to ns0.

Namespaces
Subject namespaces Talk namespaces
0 (Main/Article) Talk 1
2 User User talk 3
4 Wikipedia Wikipedia talk 5
6 File File talk 7
8 MediaWiki MediaWiki talk 9
10 Template Template talk 11
12 Help Help talk 13
14 Category Category talk 15
100 Portal Portal talk 101
118 Draft Draft talk 119
710 TimedText TimedText talk 711
828 Module Module talk 829
Former namespaces
108 Book Book talk 109
442 Course Course talk 443
444 Institution Institution talk 445
446 Education Program Education Program talk 447
2300 Gadget Gadget talk 2301
2302 Gadget definition Gadget definition talk 2303
2600 Topic 2601
Virtual namespaces
-1 Special
-2 Media
Current list (API call)

When the query goes through this template, the default search domain is article space, just as it is for basic users. The default search domain of a user, logged-in or not, is article space unless the user set there preference.[3] But no matter who uses a search link the results will always be the same. "Cut and paste" can never guarantee the same results for a search, but a search link can because the search domain is just article space for everyone, or the search domain is the set of namespaces you set for everyone.

If you know a few search domain numbers you just type them in ns=ns0&ns1&ns2600. You learn them from the namespace table to the right. Otherwise you refine your query and search domain on the search results page, whose Advanced interface is designed to select and adjust namespaces with no knowledge of the namespace numbers. Once that produces satisfactory results, you copy the namespaces string from the URL (in your browser's address bar), and past it into |ns=, and you can get the query from the search results page search box, and paste it as the query, and that's your search link.

You can use "all" to specify all namespaces:

((sl|query|ns=all))</nowiki>
((sl|query|label|all))</nowiki>

but when specifying "all", the query time is about seven times greater because there are that many more pages on the wiki than there are articles. A a more targeted search is possible, it runs much more quickly than the "all" search.

For example, if you have a query for which know the search domain is 10 and 11, and you want no label, then you need a parameter 3, but you need no parameter 2, so per the template parameter rules the search link can be made in four general ways:

For another example, if you select the "Wikipedia" and "Help" namespaces, then run a query, the URL will show ns4=1&ns12=1. Copy that and paste it to |ns=ns4=1&ns12=1. (Note: you can ignore the "=1" part from the URL.)

Note how the URL contains ns0, ns1, ns2, and ns3, and how it got them:

((sl|systems operations|3=ns2|4=ns1|ns=ns3|20=ns0))systems operations
((sl|query = systems operations|||ns2|ns1|ns3|ns0))systems operations
((sl|systems operations|3=ns2&ns1&ns3&ns0))systems operations

If you know need to develop a specific search domain, a more elaborate set of namespaces or you can cut and paste the entire string from the URL of a search that has already run from a Special:Search Advanced page. Paste it into one named parameter |ns=.

To type in namespaces 0, 1, 2, 3, 4 and 5, with no label, the two easiest ways are:

Advanced examples

For examples, see Help:Searching/Regex/Sandboxing.

All these involve insource:/slash delimited regex/ with filters. Any search link with an insource:/regex/ search should always provide the additional query terms that would filter (reduce) the search domain as much as possible. This template defaults to article space if no namespace is given, which is a filter itself in some cases.

Quoting

The need to match an equals in an article is not surprising, and is basic. You have to use ((=)) or |query= or |1= just to get the equals sign in your query to the search engine, or ((!)) to get the pipe character to the search engine. Both pipe characters and the equals sign are template sensitive for all templates, so you always quote them with curly brackets like that inside templates. Although the search box can take = and | directly quoting would also be necessary in the search box.

Regex are sensitive to punctuation, brackets, math and other symbolic characters, collectively known as "punctuation" so you quote them to accommodate that when targeting a punctuation character literally in the text. Otherwise they have their regex metacharacter meaning. The "metacharacters" of CirrusSearch have claimed most punctuation characters as functions in their regex, but you don't have to know all the metacharacter functions just to search for them as targets literally. You can simply quote all punctuation to search for them as literal targets in wikitext. The way to easily quote every character in an entire regexp is to put the whole term in quotes: insource:/"regexp with literal characters"/

To get a pipe character through both the template and the search engine to target it as a character in wikitext, you have to quote it twice, hence the frequent need for the six characters \((!)) in an advanced search link.

To generate advanced regex searches, see about doing so at ((regex)).

Search engine features

The search engine can

A search matches what you see rendered on the screen and in a print preview. The raw "source" wikitext is searchable by employing the insource parameter. For these two kinds of searches a word is any string of consecutive letters and numbers matching a whole word or phrase. All other keyboard characters like punctuation marks, brackets and slashes, math and other symbols, are not normally searchable.

By default Search will also stem the words and match them too. It automatically sorts results by the frequency and location of these, but also can boost page ranking by time, template usage, or even similarity to other pages.

Search is a search engine that does a full text search by querying an index database. It offers search syntax and parameters exceeding the capabilities and control of other public search engines that could search Wikipedia.

Page score

Say the search box is given two words. The search starts with two index lookups, and the two results are combined with a logical AND. But before they are displayed as search results, they must all be assigned a final score before the top twenty (listed on the first page) can be displayed, and they must be formatted with snippets and highlighting. Page ranking deals quickly with very large numbers of pages, by approaching things statistically, and taking several swipes through the data.

  1. The frequency and location of each word determines the first sorting.[4]
  2. The order of the words determines the second sorting. If the two words happen to be found in the same order on a page, that page is boosted again.
  3. The number of incoming links.[5]

These attributes for a word earn that page a higher score:

There can be several other scoring mechanisms. The parameters that you can control are morelike, boost-template, and prefer-recent.

General description

There are now eleven parameters for various approaches to searching the many namespaces. Four of the seven new parameters now offer to target these page characteristics: hastemplate and linksto, insource and insource:/regexp/. The other three now offer to target page ranking: morelike works all alone, a prefer-recent term can be added to any query, and there is now also a boost-template parameter. The other four, preserved in name only, from the entirely rewritten previous version of Search, are intitle, incategory, prefix, and namespace.

Any search will feature one of these approaches

The concept of a search domain plays an important part in all this. By default it is just article space, but in general a search domain starts out as a set of namespaces, and ends up as all the pages in the search result.

One term of a query will set the search domain for another term in the same query. The order is optimized by the search engine. The query term1 term2 transforms the search domain twice to get those search results. For example, a bare namespace returns the pages of the namespace. The query term1 term2 regexp relies heavily on the first two terms to reduce the search domain size.

All terms in a query are indexed searches unless they are a regexp. Indexed terms run word-wise instantly, and a regexp runs character-wise slowly. Even the most basic use of a regexp, just to find an exact string, should always limit the size of its search domain to as little as possible. This can be as simple as adding a few terms, (as covered below), because each term in a query tends to reduce the number of pages. Never run a bare regexp on the wiki especially if your user profile is preset to Everything. The search engine limits the number of regexp searches that can run at once. Without the proper filter running alongside a regexp it will run for up to twenty seconds, and then incur an HTML timeout.

On the search results page, the initial search domain on which the query was run is indicated by the following, given in increasing power to override the others:

For example, if the namespace parameter is all, the size of the initial search domain will be the 60,679,532 pages in all namespaces: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 100, 101, 118, 119, 710, 711, 828, 829 A prefix parameter specifies just one of those namespaces, in whole or part. If the initial search domain is the default, Content pages its size is the 6,824,560 pages in namespace 0, (article space).

A search can be set into a link to specialize and share searches: [[Special:Search/search]]. Such a query should always be a fully specified by specifying an initial search domain so as to avoid user profile discrepancies. This way it gives the same results. For example, if more than one namespace is needed, use ((search link)).[6]

Other helpful approaches to the search engine features are

Syntax

Greyspace characters are the non-alphanumeric characters: ~!@#$%^&*()_+{}|[]\:";'<>?,./. Any string of greyspace characters and/or whitespace characters is "greyspace".

Greyspace is ignored except where it has meaning as a modifier in syntax.

Parameters also accept words and phrases, but each can search their own index and interpret their own arguments, such as for

The delimiters:

Colon : character:

Word and phrase

A search is a query with one or more terms. The query does not actually search the page database, but rather, a search queries a prebuilt, constantly maintained, search index database. When creating the search index of words on the wiki, or when entering a query, a word boundary is greyspace. Greyspace characters can create a multi-word_phrase. We must say tab and newline even though we cannot put those characters in our query; this is because of the important fact that the same analysis that is done on the wikitext is also done on the query. A word boundary is whitespace characters (tab, space, or newline) or greyspace characters. Greyspace characters and whitespace characters are all folded together as one, just as special characters like æ (ae) or á (a) are folded into the standard keyboard characters.

A phrase expresses an ordering of words,[7] and there are three ways to make one, depending on how aggressively you want the phrase to match.

"Quotation marks", phrases are called an "exact phrase" because it is exact wording: stemming, fuzzy search, and wildcards are not used in an "exact phrase". Like the rest of Search, an "exact phrase" tolerates greyspace between words. Joining_with_non-alphanumeric(characters) only, will employ stemming on the words. CamelCaseNaming or letter222number transitions, matches the phrase in greyspace, with stemming, and additionally matches the word itself. Parameters can require the quotation marks to include whitespace in their input.

The wikitext is searched by employing the insource parameter. The insource parameter ignores greyspace characters too.

For example, to find the phrase http://en.wikipedia.org/wiki/Search_engine, use http://en.wikipedia.org/wiki/Search_engine, or use insource: "http en wikipedia org wiki search engine".

When you search for a word, that word is just looked up in an index. An indexed search instantly concludes with all search result titles, without having to search the wiki itself.

Each word you see in a page's content (a title's content) is already in an index, where it points to all its other prearranged results. A word is indexed to a list of page names, where it is seen in the text, or it is seen in the title only.

Each indexed word is seen as

For transitions from lower to upper case, (or camelCase), and transitions from letter to number:

for or digit-letter these match singly or together. In other words you don't need the space, but that also works to find either "word" of a camel case or mixed alphanumeric word. You don't need a space, and non-alphanumeric characters are treated as that null space.

We may call these "word" characters or "alphanumeric" characters at times as opposed to the "non-word" characters, which are ignored except as to function as a word boundary. Usually a word boundary is just a space character.

These words are case-insensitive: a-z is equivalent to A-Z, so Search box will navigate to a pagename regardless of capitalization (even though wikilinks and URLs must match capitalization apart from the initial character).

Each word is aliased to all its word-stems, so cloud, clouding, clouds, clouded, cloudy will all point to the same index entry.

In Search the characters !@#$%^&*()_+-={}|[]\:;'<>,.?/ are ignored. Any mix of whitespace characters and these non-word characters, we may refer to as grey-space. Grey-space, then, is all non-word characters except the double quote character, which is not ignored.

Grey-space is a string of one or more characters such as brackets and math symbols and punctuation and space. Now, a search-indexed word will be found between grey-space, and grey-space is an implied AND of two words in a search query, but the AND is not always implied: when two phrase exist side-by-side the AND is required.

Exceptions to what "words" are indexed are these portioned words:

The word boundary between such numeric portions and an alphabetic portions may include grey-space or not, but a phrase search turns off portioning, because it is an "exact phrase search", the words in the phrase matching only alphanumeric words delimited by grey-space.

Words joined only by non-alphanumerics are treated like a phrase. So word1_word2&word3 is the same as "word1 word2 word 3". However they will also match camelCase and letter-number transitions. An exact phrase search will not match camelCase or letter-number transitions. For example, terms like wgCanonicalNamespace and !wgCanonicalSpecialPageName can be found looking for canonical page name.

For example:

The following match the single term txt2regEx on a page: txt , 2 , regex , reg , ex , txt2 , 2reg , 2regex. None of those portions would match in a phrase search; only "txt2regex" would match.[8]

The following match the two terms 2 + 2 : 2 or "2" , 2 2 or "2 2" , "2 2" or "2" , "2+2" or 2+2 , "2-2" or 2-2 , "2.2" or 2.2 Each term is a query, and the grey-space is an AND.

Fuzzy search, wildcards, and stemming

Stemming is a way to match meaning "ambitiously", to get the numbers up, for possible semantic matching, such that run_shoe also matches running shoes. Stemming is a spelling algorithm only distantly reliant on any dictionary.[9] The algorithm attempts to find the same word, but in all its word endings.

A fuzzy search will match a different word. Words (but not phrases) accept approximate string matching or "fuzzy search". A tilde ~ character is appended for this "sounds like" search. The other word must differ by no more than two letters.

But it can differ by one letter in these ways. A fuzzy search matches the word exactly plus words like it.

With wildcards you can specify which letters change, including the first two letters, and you can increase the number of letters that can change. Wildcards have their own rules:

While the word indexes are being built and updated, stemming automatically adds aliases to most entries. An actual dictionary is not used. Instead it runs an algorithm that applies generic English syntax rules for word endings. The results are imperfect.[10] Even misspelled words, non-words, and words with numbers in them are indexed and stemmed in this way. By adding different forms of the same word to the indexed search query, stemming is a standard method search engines use to aggressively garner more search results to then run a bunch of page-ranking rules against.

For example, stemming will alias cloud, clouds, clouded, and clouding. It will not alias the word cloudy, but it will alias the various forms of cloud to the non-word cloudion, because -ion is a common word ending.

Stemming is automatically turned off for insource searches:

To turn stemming off put the word in quotation marks, this is an "exact phrase" search.[11]

For example: gameFolks, game!folks, game:folks matches FolksSoul

Proximity

An "Exact phrase" or a word will match in a title. And creating a phrase "with tilde"~ just turns on stemming, (which is equivalent to forming a phrase by joining the words with_greyspace). But "exact phrase"~1 matches the wording in that order plus allows any one extra word to fall between the two words.

For example

"hitch4 hiker2" finds the two "words" in that order, (possibly separated by punctuation or brackets or other keyboard symbols like math symbols), and without the quotes finds them in the same article. In both cases the article is listed when the space satisfies the logical AND meaning.

hello_dolly does the same thing as "hello dolly" does, but the double quotes version offers a proximity filter. After the closing quote you add a tilde ~ and a number that indicates the total number of words allowed between all the terms.

Backward proximity works too, but includes the two end words between each segment. Proximity cannot make the last word proximate to the first. The proximity can be a large number, like 500 or 1000.

Say a page has word1 word2 word3 in that order.[12]

Two search terms with no quotes is two filters, and a bunch of page-ranking rules.

Search logic

Truth logic is AND, OR, and not.

Logical OR increases results, whereas logical AND decreases them. Logical not is a good way to refine a query by removing any kind of term except the prefix parameter.

For example while -refining -unwanted search results. For example credit card -"credit card" finds all articles with "card" and "credit"

Prefix and namespace

Prefix and namespace are the only positional parameters, and namespace is an unnamed search parameter. One or the other of them is used in a query to override the initial search domain set by user profile or by the search bar. They aren't used together: prefix overrides namespace.

The namespace argument must be at the beginning of a query, and the prefix: parameter must be at the end of a query.

Namespace

Namespace: is an unnamed search parameter that goes at the beginning of a query.[13] The namespace is followed by a colon, followed by zero or more whitespace characters. and matches a namespace name. The namespace names and "all" work as expected, but seeing one in the search box does not guarantee it represent the search results, as explained below.

In addition to the usual namespace names and their aliases

Pages with namespaces outnumber pages without them 7 to 1.

On the search bar at the search results page

These differ from namespace "all" by matching your search terms inside a pdf on a help:file page, that item on the search results page says "(matches file content)".

For example file:"885.7 seconds" matches inside a pdf, but all:"885.7 seconds" does not.

Prefix

prefix:namespace: string  filters a namespace down to one or more pages where string matches the pagename's beginning characters.[16] For example, prefix:help:t  finds Help pagenames that begin with "T".

Prefix can perform the function of the namespace filter, plus it can isolate a single article whereas intitle cannot. Prefix cannot isolate a single page if it has subpages.

An alternative to a prefix query is Special:PrefixIndex:

Compared

Comparing the namespace and prefix parameters:

The following methods set an initial search domain by namespace:

These are in the order of precedence. A prefix overrides a namespace overrides the GUI. The argument to the prefix parameter is a fullpagename, which conveys a namespace.

When alternating search domains, with the various techniques, and because of their priorities, it deserves repeating: check the search bar indication; it is most subtle. [17] The Advanced namespace selection pane from the search bar is not so subtle. It will remain for as long as the earlier selection "remember selection for future searches" is in effect. You can "remember" article space and then either 1) press Content, 2) choose another search bar search domain, or 3) remove all instances of &profile=advanced from the URL.

Page attributes

These five search parameters filter a namespace according to an input word or phrase.

These parameter names must be in all-lowercase letters.

Intitle

Intitle finds a word or phrase in a pagename. Like a word or phrase search stemming and fuzzy searches can apply.

To find a match in a redirect title, or to apply a proximity search to a title you can rely on page ranking software to boost title matches before content matches. So a basic word or phrase search, or proximity search, is an alternative to intitle.

For example

intitle: "forest ridge" finds one, while the proximity search
"forest ridge"~3 finds a dozen related titles immediately.
intitle: image_label shows stemming while intitle: "image label" does not.
intitle:juggle shows stemming.
intitle:sun intitle:moon shows how to search for two words in one title.

Incategory

Incategory has the general format

incategory: "category|category|...|category

and selects from the pages section of given category pages, those pages that are also in the search domain.

Because many pages outside the mainspace are also categorized, the counts often won't match the category unless the search domain is the entire wiki:

Multi-category input counts a page only once. The following two categories have 209 pages in article space, with six pages found in both categories:

incategory:"Information retrieval techniques" incategory:"Natural language processing" (6)
incategory: "Natural language processing" (159)
incategory: "Information retrieval techniques" (50)
incategory: "Information retrieval techniques|Natural language processing" (203:= 209−6)

On the other hand these are disparate categories:

Because of the nature of Wikipedia:categorization these categories share no pages:

Categories and Search are synergistic.

In the following examples, note how the page description in the category namespace show category sizes instead of page sizes.

Hastemplate

Hastemplate finds pages that transclude a given template. Finds template usage, not just a name pattern, because it will find all pages where the template content itself was used in any way. The results differ slightly depending on the alias you give.

Hastemplate

If you don't find the searched template name on the wikitext of the page, it can mean either that you gave the canonical pagename but it found an alias, or that it was called as a secondary template by way of a template that is shown in the wikitext. To find visible (primary) calls only, use insource.

Insource

Insource: term   finds a word or phrase in wikitext.

Unlike a normal search insource doesn't find things "sourced" by a transclusion.

Insource targets wikitext in two ways. They look similar, but the regexp form employs the slash / character to delimit the regexp.[19]

  1. insource: term   finds an indexed word or phrase.
  2. insource:/regexp/   targets the entire wikitext of every page in the search domain as one long string of characters per page, either having a pattern or not. This is the "regular expression" (or regexp, or regex). Its metacharacters can represent multiple possibilities for a character position or a range of character positions within a page, using metacharacters for truth logic, grouping, counting, and modifying the characters to be found.

A basic regexp is an easy way to find a specific, /"exact strings"/, as shown below. The double quotes are field delimiters. They are escape characters which quote all the set of characters between them, and keep their interpretation literal (keep any metacharacter interpretation from occurring).

An advanced regexp uses the metacharacters to program general string patterns. It finds everything, even pieces and parts of words, conveying no notion of "words", but only that of a string of characters in a sequence. Metacharacters are interpreted unless quoted by a backslash, double quotes, or square brackets. See the section on regex. The obvious example is, you must quote any slash in your pattern so it won't be interpreted as the closing slash delimiter, using \/ instead of / to match a literal slash. A regexp interprets all metacharacters. Testing a regexp pattern responsibly, requires limiting the search domain

Abusing regexp will not harm Wikipedia performance, but it limits regex search information from flowing elsewhere.

Only regex interpret greyspace characters. The regular insource, as everywhere else, ignores greyspace characters. So insource:"M S" matches m/s, as do insource:"M-S" and insource:"m=s". But insource:/M\/S/ will match it, and the filtered version will too: insource:"M/S" insource:/M\/S/. The insource:"word1 word2" filter is the most obvious filter for insource:/word1 word2/, where the two wikitext words are only separated by punctuation and space. Say the target string is ((Val|9999|ul=m/s|fmt=commas)):

Insource matches words sequentially, but the match could occur anywhere on the page, not necessarily inside the ((template markup)). For this there is ((template usage)), and it matches any regex inside the template.

For thorough precision, use /regex/. For example, to find any bare URL inside <ref name=name>...</ref>, with [external link brackets label], with possible ref name=name you than can't use the simpler insource:"ref http server com". Taking a cautious approach, before trying the full regexp, create a search domain under 10,000 pages. Starting with two filters, prefix and insource:

  1. insource: "ref http" prefix:A 98000 is too many to start.
  2. insource: "ref http" prefix:AA 1000 is good.
  3. So ya try adding a regex term insource:/\<ref[^>]\> *\[?https?:\/\/[^][<> "]+\]? */ zero for prefix:AA, one for prefix:AB
  4. So ya try just insource:/\<ref[^>]\> instead, and then try prefix:AA zero; try AB, one.
  5. You notice you forgot the modifier for [^>]*.
  6. insource: "ref http" insource:/\<ref[^>]*\> prefix:AB. There are 3700, and that is OK.
  7. Experiment further. Then decide to do the project in segments AA, AB, AC, ... ZZ.
  8. insource:/\<ref[^>]*\> *\[?https?:\/\/[^][<> "]+\]? */ insource: ref prefix:AA

We have the only possible filter insource: ref prefix:AA. That filter produces a regex search domain of only 2300. The filter insource: ref prefix:A produces a search domain of 264000. Running the regex on that many pages is possible, and produces 64000 results.

To find a more targeted URL, say yahoo.brand.edgar.com, use insource: "http yahoo brand edgar com"   (or cut and paste the entire URL, slashes dots, and all; it doesn't matter). Do another search with the https version. These searches capable of more flexibility than Special:LinkSearch. No filter is needed, but every search always benefits from extra information: any word, any phrase, and most parameters.

Linksto

Linksto Reports wikilinks to a page name.

Linksto reports wikilinks to a page name, even if the wikilink is

Linksto can differ from the "What links here" tool, because the search domain for "What links here" is all. Linksto search results are in your default search domain. (Also linksto reports the count, as do all searches.)

In addition to wikitext it searches inside a pages transcluded content.

first, and then scan the contents.[20] For example

linksto:"Mozart and scatology"

will report a list of 300 articles that link to it, as will "What links here". But Mozart and scatology is actually linked only 15 times by content authors. The rest are due to Mozart and scatology in Template:Wolfgang Amadeus Mozart on the unwanted pages. The template is wanted, but the "links to" reference is probably not.[21]

The trick to getting around this, and just finding all authorship links to an article is a regexp search:

: insource:"pagename"   insource:/\[\[ *[Pp]agename *[]|]/

That search will find articles only because the initial : limits the initial search domain to article space, no matter how your default search domain happens to be set. It will find all of the links many times more quickly than a bare regexp would, because the first insource term instantly creates the refined search domain that sets the proper limits for the regexp search. A regexp can accommodate for the variations found in the wikitext allowed by the permissions of wikilinks: 1) the metacharacter * allows for "zero or more" space characters before and after the title, and 2) the [character class] at the beginning allows for the relaxed capitalization of the first character in any pagename, and 3) the character class at the end finds the link whether it is labeled via the pipe character | or closed via the square bracket ] of the wikilink.

Links to transclusions are handled by hastemplate.

Sorting results

A page's overall score determines its place in the search results.

A better match will raise the score.

Wikiproject "importance" and article quality assessments can factor in. Searching from a page, its categories, wikidata, and geo-location can factor in.

Knowing this you may be able to better find, for example, a half-remembered title. Using intitle may skew the results too much because of the order of the words. Use those in a word search, and depend on page ranking. The titular words will show up on top.

To get an idea of how CirrusSearch might work see mw:Search/Old#Search_Weighting_Ideas.

To sort search results by date, use prefer-recent. To sort search results by template usage, use boost-template.

Morelike

The morelike search parameter lists all articles that compare in word frequency and word length to one or more given articles.

morelike: pagename | pagename2 | ... | pagename50

Morelike calculates a multi-word search.

: word1 word2 ... wordN

See them highlighted in the snippet.

Morelike looks up the given pagename(s) in the search index, creates a word-frequency aggregate and a word-length aggregate from all the words, and calculates a multi-word search based on those, plus internal, variable settings. It is an expensive search.

For example, say you search for

morelike:William H. Stewart

then pick a name from that list and add it

morelike:William H. Stewart|Leroy Edgar Burney

then add more names, until you have five input pagenames. Then you could begin blindly adjusting this automatically calculated morelike query, saying the following sorts of things: Make the calculated query

Then, say, you adjust the number of input pagenames that have a word to two (out of five). https://en.wikipedia.org/w/index.php?title=Special:Search&profile=default&search=morelike:ant%7Cbee%7Cwasp%7CEusociality%7Ctermite&fulltext=Search&cirrusMtlUseFields=yes&cirrusMltFields=opening_text&limit=1150

It can also find similar articles based on just the title, or on just the headings, or on just the lead section.

The search results depend on internal (Mlt, More like this) variables, settable via the URL, concerning which words to search with:

&cirrusMltMinDocFreq How many articles with a search word, minimally
&cirrusMltMaxDocFreq How many articles with a chosen word, maximumally
&cirrusMltMaxQueryTerms number of search words, maximum
&cirrusMltMinTermFreq Minimum word frequency of a chosen word.
&cirrusMltMinWordLength Minimal length of a term to be considered. Defaults to 0.
&cirrusMltMaxWordLength The maximum word length above which words will be ignored. Defaults to unbounded (0).
&cirrusMltFields A comma separated list of the fields to use. Allowed fields are title, text, auxiliary_text, opening_text, headings and all.
&cirrusMltUseFields (true or false) use only the field data. Defaults to false: the system will extract the content of the text field to build the query.
&cirrusMltPercentTermsToMatch The percentage of terms to match on. Defaults to 0.3 (30 percent).

For example here is what the address bar (turned search bar) looks like for a morelike search for lead sections of two articles, as compared to other lead sections: https://en.wikipedia.org/w/index.php?title=Special:Search&profile=default&search=morelike:William+H.+Stewart%7CLeroy+Edgar+Burney&fulltext=Search&cirrusMtlUseFields=yes&cirrusMltFields=opening_text Notice the end containing the two added URL parameters that activated a morelike capability.

Prefer-recent

You can sort search results by date.

It goes anywhere in the query. It defaults to 160 days as "recent", and applies its boost formula 60% of the score. The formula is not the usual multiplier, it is an exponential multiplier, potentially much more powerful. This enables it to work where the default for "recent", instead of being 160 days, is can be as little as 9 seconds. If your "recent" means 9 seconds, use prefer-recent:0.0001

For example, if you're only interested in the relatively few articles that have changed in the last week, use 7 instead. How this works is that all articles older than seven days are only boosted half as much, and all articles older than 14 days are boosted half as much again, and so on.

The boost is more than the usual multiplier, it is exponential. The factor used in the exponent is the time since the last edit. The bigger the time since the last edit, the less the boost. The formula is e−t, where t is either the interval in days or interval of interest.

Add prefer-recent to the beginning of a search. It will give the more recently edited articles a boost in the search results. The general form is

prefer-recent:proportion_of_score_to_scale,half_life_in_days

This parameter accepts two, comma-separated arguments to allowing for adjusting the default settings. By default this will scale 60% of the score exponentially with the time since the last edit, with a half life of 160 days. So the default is prefer-recent:0.6,160.

This can be changed to increase the weight:

prefer-recent:0.8,360

or decrease it:

prefer-recent:0.4,10

The proportion_of_score_to_scale must be a number between 0 and 1 inclusive. The half_life_in_days must be greater than 0 but allows decimal points, and so works pretty well to sort close edit times if very small.

For example prefer-recent:0.6,0.0001 operates with a half-life of 8.64 seconds

This will eventually be on by default for Wikinews.

Boost-templates

Boost-templates:" " adds weight to pages with the given template or templates (plural). Using this search parameter overrides the normal template-boosting function of Search. Don't use this search parameter without supplying the weight-boosting argument unless you mean to disable the template weighting function for the search.

The general format is

boost-templates:"Template:pagename|parameter Template:pagename|parameter"

You see, normally the system message[22] titled MediaWiki:cirrussearch-boost-templates boosts the score of the following fullpagenames: Template:Featured article|200% Template:Featured picture|200% Template:Featured sound|200% Template:Featured list|175% Template:Good article|150% Template:Sockpuppet category|5% Template:Maintenance category|5% Template:Hidden category|5% Template:Tracking category|5% Template:Category class|5% Template:Category importance|5% Template:CatTrack|5% Template:Template category|5%. These are the actual template names and there actual boost. These are replaced during the boost-templates usage.

For example a search for "phenom" AND "lecture", with the templates Search link and regexp having the weighting score of the pages they are on multiplied by 1.5 and 2.25 respectively, ignoring all other templates (halting the addition of any score for any other template):

phenom lecture boost-templates:"Template:search link|150% tlusage|225%"

Boost-templtes differs from hastemplate in

If you just want your search results to include only pages with certain templates, use hastemplate one or more times instead, to filter out pages that don't. Otherwise, choose a multiplier similar to the system message shown above. Multiplying a page score by 10 is done with 1000%, and will probably mask all other weighting functions, such as "when the search words match in the title", will have little effect in the presentation of search results, and is not recommended because it affects the order of the entire list.

Either hastemplate or boost-templates one can go anywhere in the query, each having other terms on either side of it. is a term in a query that can go anywhere in the query, having other terms on either side of it.

Bugs

Relevant issues in CirrusSearch:

Workarounds

Troubleshooting


Indexed search

All pages on Wikipedia are scanned and indexed by Wikipedia's own search engine. The entire wiki is treated as one "full text" kept in a separate database (an "index") built just for searching. It's like the index in a book, but practically every word and every number is indexed to every page.[23]

Since each word in the prebuilt search index already points to the pages that contain it, a keyword search usually corresponds to a single record lookup in the index. (This is also true for phrases, to a certain extent.) "Index searches" take basically no time to execute. They are cheap and plentiful.

There are separate indexes kept updated for:

Any text transcluded from a template is indexed as if it were really present on its target page. (In other words, by default, a keyword search is done on the text of the rendered Wikipedia page, not on the page source itself. However, you can change this by using insource:keyword to search the source markup instead of the rendered page.)

Preparing and maintaining the search indexes is done by Wikipedia's servers, in the background, in near real time. As soon as you save the page, a few seconds later you can search for the changes you just made. For templates that are transcluded onto many many pages, the propagation of those changes to all the pages in the index might take a while.

The index is based on alphanumeric characters; it stores no information on non-alphanumeric characters. If you type any punctuation or brackets into the search box when doing an indexed search, those characters will be silently discarded.

A basic indexed search

Regular expression search

Instead of doing a basic indexed search on keywords, you can perform a regex search, which bypasses the index. A regex search scans the text of each page on Wikipedia in real time, character by character, to find pages that match a specific sequence or pattern of characters. Unlike keyword searching, regex searching is by default case-sensitive, does not ignore punctuation, and operates directly on the page source (MediaWiki markup) rather than on the rendered contents of the page.

To perform a regex search, use the ordinary search box with the syntax insource:/regex/ or intitle:/regex/. The expression regex denotes a regular expression in MediaWiki-flavored regular expression syntax.

Use regexes responsibly

Because regex searching scans each page character by character, it is generally much slower than an index search. You can — and should — add additional search terms when using insource:/regex/ to reduce the amount of text being processed. For example:

Adding an index-based search term to reduce the amount of text being scanned is important simply to make your own regex search finish in a reasonable amount of time. Regex searches that take too long will "time out" and return only partial results. Overuse of slow regex searches might cause temporary throttling of the feature for yourself and/or everyone on Wikipedia. (However, you cannot affect the site performance of Wikipedia as a whole simply by abusing regex search.) Remember that a single regex search can take multiple seconds, and there are currently 47,412,346 registered users on Wikipedia. Use regex search responsibly.

Metacharacters

MediaWiki's regular expression syntax works like this:

There are a few additional quirks of the syntax:

Workarounds for some character classes

Although character classes \n, \s, \S are not supported, you may use these workarounds:

PCRE MediaWiki Description
\n [^ -􏿽] A newline (also a tabulation character can be found[1])
[^\n] [ -􏿽] Any character except a newline and tabulation
\s [^!-􏿽] A whitespace character: space, newline, or tabulation
\S [!-􏿽] Any character except whitespace

^ To exclude the tabulation character as well, copy it and add it to the character set.

In these ranges, " " (space) is the character immediately following the control characters, "!" is the character immediately following space, and "􏿽" is U+10FFFF, the last character in Unicode. Thus, the range from " " to "􏿽" includes all characters except for control characters (of which articles may contain newlines and tabulation), while the range from "!" to "􏿽" includes all characters except for control characters and space.

Sandboxing procedure

Rather than use the search box, where entering an equals sign and a pipe character, and "quotes around phrases" is a straightforward matter, it is still easiest to use a regex-based search-link template — ((regex)) or ((tlusage)) — on the page with sample data, because then you can focus on the target data there and on writing the regexp pattern. It is easier, that is, if you already understand how templates "escape" the pipe character and the equals sign. See Help:Template#Parameters for other important details.

The procedure here is an iterative, read-evaluate-modify cycle. Regex development requires that you study the target data while writing and rewriting its pattern.

  1. Navigate to a page with the wikitext instances you are interested in mining. Or create one yourself, and save it to the database so the query will find it.
  2. Open the wikitext, and enter a ((regex)) or ((tlusage)).
  3. Show preview, and activate the search link. On the search results page, note the bold text in each match.
  4. Go back in your browser. Modify the regexp, and cycle until done. (Or don't go back, you may want to modify the query at the search box.)
  5. Expand the search domain, and test the accuracy of those results. You can trim or expand the number of the results using prefix:.

Caveat emptor: if you change the target for an immediate retesting, you'll have to save and purge, but not if you just change the regexp.


Examples

As an ad hoc sandbox, you can show the wikitext of a section like this, (already saved in the database), modify some of the patterns in the regex-search-link template calls on this page, do a Show Preview, and see what matches when you click on the newly formed regex search-link, all quite safely, and without changing a thing in the database.

The template calls that produce "ft/s, 2 sq ft, 3 m/s, 4 m*s-2, 5 ft.s-2, 6 °C/J, and J/C" appear in the wikitext of this section like this:

  1. ((val|1|ul=ft/s|fmt = commas))
  2. ((val|2|u=ft2))
  3. ((val|3|u=m/s| fmt =commas ))
  4. ((val|4|u=m*s-2))
  5. ((val|5|u=ft.s-2))
  6. ((val|6|u=C/J))
  7. ((val|7|ul=J/C))

Note how the above targets are |numbered|, then click on the links below.

Query Search link Answer
Q1 Using ((search link)), does this page employ template Val ? ((sl|hastemplate: Val))hastemplate: Val A. No, because this pagename is in Help not Article space.(Search link default). 1300 search results.
Q2 Using ((search link)) responsibly, does this page use Val's fmt parameter? ((sl|insource:/\{[Vv]al\((!))[^}]*fmt/ prefix:((FULLPAGENAME))))

insource:/\{[Vv]al\|[^}]*fmt/ prefix:Template:Search link/doc

A2.1. Look for 1 and 3 in the search results in bold text. (Adds an appropriate filter.)
Using ((regex)) instead... ((slre|\{[Vv]al\((!))[^}]*fmt))

insource:/\{[Vv]al\|[^}]*fmt/ prefix:Template:Search link/doc

A2.2 Less typing than ((search link)).
Using ((template usage)) instead... ((tlre|Val|pattern=fmt))

Testing fmt on this page

A2.3 Easiest for templates.
Q3. Who uses u=ft OR ul=ft? (one-letter differs) ((regex|ul?=ft))

insource:/ul?=ft/ prefix:Template:Search link/doc

A. Look for 1, 2, and 5 in bold text.
Using ((template usage))... ((tlre|val|pattern = ul?=ft))

Testing ul?=ft on this page

Finds same pattern, but only inside a Val template.
Q4. AND of these, who also uses fmt=commas after that? ((slre|ul?=ft.*commas))

insource:/regexp/ prefix:Template:Search link/doc

A. No context shown, but article title is shown. A half a Bug?
Who has one space before the word "commas"? ((slre|. commas))insource:/. commas/ prefix:Template:Search link/doc A. 1 but not 2.
Q5. Who uses either u or ul with "ft" OR uses "fmt=commas". ((slre|(ul? *= *ft((!))fmt *= *commas)))

insource:/regexp/ prefix:Template:Search link/doc

A. 1, 2, 3, and 5. (The pattern matches all possible spacing.)
Q6. Who uses ft or m, in |u= or |ul=? ((slre|ul? *((=)) *(ft((!))m)))

insource:/ul? *= *(ft|m)/ prefix:Template:Search link/doc

A. 1, 2, 3, 4, and 5.

Used ((!)) for the alternation metacharacter. Used ((=)). (Could have used named 1 = or nicely named pattern = .)

Q7. Who uses . or * in the unit code? ((tlre|val|pattern = u *= *(\.((!))\*)/))

Testing u *= *(\.|\*)/ on this page

A. 4 and 5.
Who uses a pipe? ((regex|\|))insource:/\/ prefix:Template:Search link/doc All of them
Q8. Who uses / or - within the |u= or |ul= paramter? ((tlre|val|ul? *= *[^((!))}]+(\/((!))-)))

Testing ul? *= *[^|}]+(\/|-) on this page

A. 1,3,4,5,6 and 7.
Q9. Where is Val used in the template namespace for numbers only, (no u, ul, up, or upl parameters). ((tlre|val|pattern = ~(u[lp].)|prefix = 10))

hastemplate:"val" insource:/\{\{ *[Vv]al *\|[^}]*~(u[lp].)/ prefix:Template:

A. In the 30 or so templates listed.
Q10. Which articles use ((Convert))'s and(-) option? ((tlre|convert|pattern=and\(-\)| prefix=0))

hastemplate:"convert" insource:/\{\{ *[Cc]onvert *\|[^}]*and\(-\)/ prefix::

A Coast Range Arc and Skipjack shad

In Q2, notice how the MediaWiki software ignores the spaces around parameters, but how in Q4 the same MediaWiki software processes the spaces inside parameters. Q2 might have been solved with a plain insource:val fmt search because "fmt" and "val" are whole words, and fmt is rarely seen apart from inside Val. How about hastemplate:val insource:fmt?


Notes

  1. ^ The search engine protects itself against bare regex searches by limiting all regex searches. A bare regex that crawls through millions of pages can take over twenty seconds, and may even cost you an HTML timeout. During that time very few other regex searches are allowed. Always use a filter with regex.
  2. ^ Searching for an equals sign requires using a regexp. As with any template, use ((=)) or |1= to pass in an equals sign to any parameter, even the link label.
  3. ^ Advancing editors who begin to search for Wikipedia's other pages may at times set their default search domain (at Special:Search Advanced) to all. Setting search to all is the most likely scenario to "set and forget". Since that includes article space, the usual results are comparable.
  4. ^ Unlike other data that score a page ranking, word frequency and location data can be kept updated in the index at all times. For each word on the wiki, the index stores a list of page names where that word can be found. Along with page name, the word's locations and count are also stored. Apache Lucene is the indexer, and it maintains the data; it uses the term frequency algorithm. For how it does this, see TFIDF Similarity.
  5. ^ Unlike for search indexes, page-ranking data is not immediately updated. When the number of incoming links has changed more than 20%, then it is updated.
  6. ^ ((search link)) always produces fully specified queries, even if no namespaces is given, because it defaults to article space.
  7. ^ A phrase will extend over whitespace unless it contains a bullet. A phrase can extend over an ordered list item, but not an unordered list item. In other words it can extend over a number # sign, but not an asterisk * character. The asterisk has special meaning to the analyzer. It is used to make an item in an unordered list, plus it is used as a modifier in search.
  8. ^ See the ElasticSearch "tokenizer" that CirrusSearch developed.
  9. ^ Stemming, like page ranking, is just a computer algorithm, and prone to needing occasional adjustments.
  10. ^ CirrusSearch uses kstem for the stemmer package, per T56022.
  11. ^ You can equally well use the insource parameter to turn stemming off. Also, please note that T113838 details this related bug: when stemming is turned off for a word the pages listed in the search results are correct, (they don't have stemmed-only variants, they all have the word as given) but any stems in the snipped are, incorrectly, highlighted.
  12. ^ This can't be proven in an example search of this page, but it will work on another page not containing this example. This because the match, showing in bold as proof here, prefers the proper order. It can be proved by put the target text on another page, then changing the query (on the search results page) initiate here to that page.
  13. ^ The search namespace matches in the first parameter of a query. This is consistent with its usage in navigation, wikilinking, transclusion, and page naming, where it is always the first word in the field.
  14. ^ To see all namespaces go to the search results page and click on Advanced. The default namespace shows in parenthesis.
  15. ^ The full text of every word on the wiki plus every word in every uploaded attachment, is all indexed together in a search database. CirrusSearch can parse and index thousands of formats.
  16. ^ Characters not allowed in pagenames are # < > [ ] | { }.
  17. ^ Always check the search bar for its indication. Activating the Advanced pane can show the default search domain, and the search box is very obvious with a namespace or prefix term. One way to do this is to click on the search bar search domain instead of clicking on the search button. The only time this does not work is when changing search domains in the Advanced tab: after you change them you must press Search, not Advanced.
  18. ^ To get deepcat as a search parameter install a gadget which automatically produces incategory:pagename1|pagename2|...|pagename70. To see the number of subcategories to see if there was more or less than 69, either go fwd and bwd in the browser history, or see the source HTML of the search results page, the <title> attribute
  19. ^ In computing it is common to delimit a /regular expression/ with slashes.
  20. ^ The search is not actually done page by page, but the index for the wiki is built page by page in this way.
  21. ^ By doing things like adding a Mozart navigation template to each page about Mozart [[wp:wikignomes|]] shore up the wiki infrastructure. Authorship, on the other hand, writes the prose of a page, one page at a time. (You cannot remove the unwanted links with -hastemplate:"Wolfgang Amadeus Mozart".
  22. ^ A system message is the value of a MediaWiki operations variable. It can consist of a snippet of plain text, wiki text, CSS, or Javascript. A message is used to customize the behavior of MediaWiki, especially as pertains to the user interface as seen by readers, but also including the way it itself appears as a simple message, and these for each language and locale.
  23. ^ When you do a basic keyword search on Wikipedia, you aren't scanning pages in real time; you are simply looking up an entry in the index. All content is at all times "known" and resides in indexes. So when you read something like "search for pages containing...", you can mentally replace "search for..." with "search the index for..."

=== Developing regular expressions in an ad hoc sandbox ===

Regular expressions are little computer programs, so it is characteristic of regex searches that they must be written while studying the target data, and tested to achieve their potential precision and thoroughness. However, only a few of these intensive searches are technically able to run at a time against the database.[1] A sandbox minimizes your footprint, and guarantees that you will never run an untested regexp on every namespace in the wiki, even if your default search would let you do that.

Although a normal search targeting the entire wiki will run quickly, a regexp search should target as few pages as possible by using filters in order to run quickly. A filter is part or whole of a database query. Filters include:

Order is not important because the search is optimized by the software before it is run.

To target just one page while experimenting with or developing a regex search, target a fullpagename. From the search box use the filter prefix:fullpagename. From the edit box (of any section of the page with the target data), you can always just write prefix:((FULLPAGENAME)) and it will "expand" for you to the fullpagename. Although you can edit a history page, technically a "history page" is not a page (in the database), and so ((FULLPAGENAME)) there will point to the database version (not its own rendering). For the same reason, you cannot search for the wikitext on a page that is not already saved (to the database), although you can certainly change the search parameters again and again with no need to save them.

Fullpagename is namespace:pagename. Knowing this you can adjust your Prefix parameter. Although prefix can filter down to one page, it can filter up to a namespace, and it also accepts the beginning letter(s) of set of pagenames if you want to reduce the namespace search domain.

Regex sandboxing uses an ad hoc sandbox made by editing any page containing the target data, and using it as a "sandbox" (not editing it to save it). It then develops by using adding a search link that includes insource:/regexp/, with the filter prefix:((FULLPAGENAME)) alongside.

Use of a sandbox enables the smallest possible footprint by using filters to limit the search domain. Once your regexp pattern is honed, you increase the search domain. A regex search is best run with filters, not alone even if it is a polished rexexp.

References

See also

Templates for searching Wikipedia

Search links

A search link stores a query in a link that takes you to live search results for that stored search. They're found on user pages and talk pages. Use one to bring the full feature set of MediaWiki Search, or features of external search engines, to bear on users unfamiliar with their search parameters.

One type of search link is a wikilink with all the capabilities of Search (search box), and with standard wikilink syntax: [[Special:Search/query| label]]. So this search link will (1) navigate: [[Special:search/Wales]] → Special:search/Wales or (2) search: [[Special:search/~Wales | search/~Wales]]search/~Wales if you prefix a ~ tilde character.

All other search links are made from a template that will build a URL instead of wikilink. A URL can for example can call off-site search engines to search Wikipedia.

Search boxes

Search boxes are made by <inputbox> tags. See mw:Extension:InputBox.

Page title searches

For searches with exact matches, exact in upper and lower cases, or in punctuation marks, see Help:Searching § grep.

Other Wikipedia editor help

See also