Title-Body Search
Wikipedia illustrates the Title-Body pattern for search. This pattern is characterized by documents that have a descriptive title and a body that contains the main content. On this type of search-enabled website, you typically interact with the search function by entering terms in a search field. When you click the Search button, the UI submits a search query with those terms as the search criteria. The search engine then matches those terms against the documents in the corpus and returns a ranked list of matching documents.
For example, if you enter software as a service, a CloudSearch query could specify those terms with the q parameter:
https://<domain's search endpoint>/2011-02-01/search?q=software as a service&return-fields=title...
The CloudSearch response contains the first set of matching documents, sorted by their text relevance scores (in descending order).
{ "rank": "-text_relevance",
"match-expr": "(label 'software as a service')",
"hits": {
"found": 29166,
"start": 0,
"hit": [
{ "id": "wikipedia21560276",
"data": { "title": ["Software business"] }
},
{ "id": "wikipedia22947099",
"data": { "title": [
"Service-oriented software engineering" ] }
},
... ]
},
...
}
However, this isn't a very good set of results. Ideally, the main Software as a Service page should be at the top. What happened?
The CloudSearch text relevance score is based on two components: tf-idf and proximity. Tf-idf measures the frequency of the query terms across the whole document corpus and scales by the number of occurrences in each matching document. A match scores higher if it has many occurrences of infrequently-occurring terms. Proximity measures the proximity of the query terms in each matching document. A match scores higher if it has many occurrences of the query terms in close proximity and in the correct order.
When ranking documents that follow the Title-Body pattern, you usually want to boost the value of matches in the document title. If you're searching for software as a service, the best matches probably contain software and service in their titles. Boosting title matches brings these documents to the top of the results.
In CloudSearch, you can define custom rank functions statically using the console, command line tools, or API, or dynamically within the query itself. To define a rank function within a query, you use the rank-<name> parameter. To select the rank function you want use to rank matches, you use the rank parameter. For example:
https://search endpoint>/2011-02-01/search?q=software as a service&return-fields=title&rank-title_boost=cs.text_relevance({weights:{"title":4.0}})&rank=-title_boost </p>
The rank-title_boost= cs.text_relevance({weights:{"title":4.0}}) URL parameter defines a new rank function called title_boost that multiplies the relevance of title matches by 4. By boosting the title, we get better matches listed first, including the main software-as-a-service page right at the top:
{ "rank": "-title_boost",
"match-expr": "(label 'software as a service')",
"hits": {
"found": 29166,
"start": 0,
"hit": [{ "id": "wikipedia2262333",
"data": { "title": [ "Software as a service" ] }
},
{
"id": "wikipedia29176559",
"data": { "title": [
"Nsite Software (Platform as a Service)"] }
},...} ... }
Social Search Patterns
Many of today's applications have a social component that enables users to find and follow their friends and rate, review, and share content. These users expect search results to reflect their connections and contributions.
One of the ways that social data can influence search is through surfacing the most popular content. First, an application must record the user-generated popularity in the data source. For example, if your application supports "liking" a post, you increment a counter in the database each time a user likes a post.
Next, you need to add a corresponding number_of_likes field to your index schema. In CloudSearch, you configure this field as an integer. Each time a post's like counter is incremented, you update the corresponding document so its number_of_likes field reflects the current popularity.
To reflect the popularity in search results, you build a rank function for the domain that takes into account the number of likes and use that function to rank the results:
rank_likes=text_relevance+log10(number_of_likes)*50
When defining rank expressions, keep in mind that the text_relevance function computes a score in the range 0 to 1000. You often need to scale the contribution from document fields to avoid overwhelming the text relevance score. For example, if the number_of_likes field ranges into the millions, use a log or normalizing constant. In rank_likes, we take the base-10 log of the number of likes to scale it roughly from 0 to 6 and then multiply by 50 to put it in the range of about 0 to 300. This enables it to influence result ranking without dominating.
When tuning relevance for your application, gather some common and long-tail queries that you expect or are already receiving. You can use the CloudSearch return-fields parameter to get the computed text_relevance score or the computed value for any of your own rank expressions. For example:
q=star+wars&return-fields=text_relevance
This returns the computed relevance value for each document:
"rank": "-text_relevance",
"match-expr": "(label 'star wars')",
"hits": {"found": 7, "start": 0,
"hit": [{
"id": "tt1185834",
"data": {"text_relevance": ["306"]}] ...
Mobile Search Patterns
Information about a user's immediate context can improve result relevance. For example, searching for pizza parlor in a location-aware application should find nearby pizza parlors. In CloudSearch, query-time rank expressions enable you to incorporate a user's current location into the rank function so you can boost matches that are located near the user.
First, you need to associate a location with each document. You can do this by transforming the location's latitude and longitude to unsigned integers and embedding those values with the document data (see the CloudSearch documentation for details):
...
"fields": {
"name":"Joe's Pizzeria",
"latitude":"12785",
"longitude":"5751",
...
You can use this embedded location information to perform bounding-box searches with integer range searches of the latitude and longitude fields. For example, you can search within a box that contains San Francisco with the query:
bq=(and 'pizzeria' latitude:12700..12900 longitude:5700..5800)
In CloudSearch, ranges can either be open- or closed-ended. 12700..12900 restricts matches to documents from 12700 to 12900, inclusive. To specify an open-ended range, you simply omit the upper or lower bound. For example, 1000.. matches documents with values greater than or equal to 1000.
To sort results by distance from the user, embed the user's current location in a query time rank function and use it to sort the results. The Cartesian distance function provides a fast-to-compute sorting function for user queries:
You can insert the user's current location as the (x_0,y_0 ) point and use the document's latitude and longitude as the (x,y) point in a rank function. For example:
rank-geo=sqrt(pow((userlat–latitude),2)+pow((userlon–longitude),2)&rank=geo
(Of course, you replace userlat and userlon with the user's actual latitude and longitude and URL-encode the query before submitting it.)
Note that the rank parameter specifies the geo function to sort in ascending order, closest first. In other examples, we've used a negative function (-text_relevance) to sort in descending order.
eCommerce Patterns
Search is a central component of eCommerce applications, responsible for connecting users with the products that they want to buy. In a typical eCommerce application, users narrow text-based search results using faceted drilldown to zero in on what they want.
A facet is a single attribute of a product, such as brand, size, or color. In a product search application, the UI typically shows the values for the facet below the facet name, often with a count of documents containing that value.
CloudSearch makes it easy to create this UI. First, add a field for the facet to each SDF document. Then, enable the Facet option for the field in the domain's indexing options. Through the AWS Management Console for Amazon CloudSearch, this is simple:
Now when you search, you can use the facet parameter to request facet counts for the field:
http://<search endpoint>/search?q=shirt&facet=color
In the response, the facet information is included after the hits:
{"rank": "-text_relevance",
"match-expr": "(label 'shirt')",
"hits": { "found": 389, "start": 0,
"hit": [ ... ]
},
"facets": {
"color": {
"constraints": [
{ "value": "Black", "count": 30 },
{ "value": "White", "count": 28 },
{ "value": "One Color", "count": 14 },
] } }, ...
}
When a user selects a facet in the UI, you restrict the search results by adding the facet's value as a filter to the query:
http://<search endpoint>/search?bq=(and 'shirt' color:'White')
This query returns documents that match shirt and have the value White in the color field.
Sometimes facets comprise a hierarchical set of categories. For example, you might have a three-level hierarchical categorization scheme for products: department / sub-department / leaf department. A single product would be placed in one leaf department like Clothing & Accessories/Men/Tops & Tees. By creating three fields in the search document, one for each level of the tree, you can provide users with result counts at all levels and enable them to narrow their searches:
...
"fields": {
"dept":"Clothing & Accessories",
"sub_dept":"Clothing & Accessories/Men",
"leaf_dept":"Clothing & Accessories/Men/Tops & Tees"
...
To retrieve counts across the search results at each level of the hierarchy, specify
facet=dept,sub_dept,leaf_dept
in the query. When displaying these counts to the user, you simply remove the prefix path. You can restrict searches to any level of the hierarchy by including the appropriate filter parameter:
bq=(and 'shirt' sub_dept:'Clothing & Accessories/Men')
This query retrieves all matches for the string shirt in any leaf category below Clothing & Accessories/Men. To support exact matching, be sure to configure these fields as literal fields.
For attributes like color, you usually present all choices and allow the user to select more than one option. If you pass in a query with a filter such as color:'green'
, you won't receive correct counts for any of the other values—the results will all have green for their color. Instead, you can submit two queries, one to get the search results with green selected, and a second to get the facet counts without any selection for color.
If you have a second multi-select facet, such as size, then you run one query with both the size and color selections to get the query results, a second query with just the size selections to get the correct color counts, and a third query with just the color selections to get the correct size counts.
You can also enable users to narrow their search by price. To narrow matches to a range of prices, include the price as a field for each of the products. (In CloudSearch, uint fields are the only type of numeric field, so store prices in cents by multiplying the price by 100. )
Specifying an integer field as a facet returns the min and max values in that field across all documents instead of counts for individual values. This means you can get the full range of prices for a search with the facet parameter:
http://<search endpoint>/search?q=shirt&facet=price
This query returns:
{"rank": "-text_relevance",
...
"facets": {
"price": {
"min": 1,
"max": 349900
}
} ...
To get counts for particular price ranges, you need to specify the ranges using the facet-<field name>-constraints parameter. For example, this query requests counts for prices from $0 to $25, $25 to $50, $50 to $100, $100 to $150, and $150 and up:
http://<search endpoint>/search?q=shirt&facet=price&facet-price-constraints=..2500,2500..5000,5000..10000,10000..15000,15000..
In the response, the constraint information is included with the facet information:
{...
"facets": {
"price": {
"min": 1,
"max": 349900,
"constraints": [
{ "value": "..2500", "count": 11194 },
{ "value": "2500..5000", "count": 3888 },
{ "value": "5000..10000", "count": 1526 },
{ "value": "15000..", "count": 603 },
{ "value": "10000..15000", "count": 452 }
]
}
} ...
To narrow subsequent searches, you include a filter in the query. For example, if the user selects the $25 to $50 price range, you add the constraint to the bq parameter:
http://<search endpoint>/search?bq=(and 'shirt' price:2500..5000)
Mixed Data Source Patterns
If you have documents of different types, you often want to enable users to search one type of document at a time. For example, if you're presenting a collection of hotels, cars, and flights, you want to support searching only hotels, only cars, or only flights.
Two main patterns enable you to manage and search a mix of document types:
- Create a separate search domain for each document type. With this approach, you use the search type to channel queries to the appropriate domain. This is the best solution when you have a large collection of each document type, a limited number of document types, and heterogeneous fields within each type of document. The key advantage of using separate domains is that each type's domain is updated and scaled independently.
- House all document types in a single search domain. To do this, you add a type field to every document that identifies the document type:
... "fields" : {
"type" : "flight",
...
In each query, you include a filter to restrict the matches to a particular document type:
bq=(and type:'flight' ...)
The advantage of using a single domain is that the combined index can be ideally packed into memory and efficiently partitioned across multiple hosts. However, it means that scaling is less granular—a single highly-active type can force scaling that the other types don't need. It also means that configuration changes for one type require rebuilding the index for every type. Combining types in a single domain also has an impact on relevance ranking. The text_relevance function depends on the distribution of tokens in the index—when there are multiple types of documents, the text_relevance score is aggregated across types, which can result in less accurate ranking.
Access Control Patterns
You can use a pattern similar to the single-domain mixed document pattern to restrict access to documents based on the identity of the user. You add a user_id field to each of the documents in the search domain and add a filter to the query to restrict the results to the documents the user owns.
It gets more complicated when users have access rights to documents that they do not own. The simplest solution is to maintain a list of users with access to each document as part of the document data:
... "fields" : {
"userid" : ["12345", "67890"],
...
The drawback to this approach is that you have to update each document whenever a user gains or loses access. In the worst case, you might have to update every document in the domain.
A better solution is to organize users in groups or documents in folders and manage user access outside of the search engine. In this case, you maintain a list of groups with access to each document as part of the document data:
... "fields" : {
"groups" : ["11111", "12345", "87654"]
...
Your search queries then filter using ORs of the groups to which the user belongs:
bq=(and 'query string' (or group:'11111' group:'22222'...))
When user permissions change, it happens outside of the search engine and doesn't require configuration changes or reindexing. However, if you have a large number of groups, your queries become more complicated, which can degrade performance.
{{ parent.title || parent.header.title}}
{{ parent.tldr }}
{{ parent.linkDescription }}
{{ parent.urlSource.name }}