Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Working With Schemas in a Schema-Flexible Datastore

DZone's Guide to

Working With Schemas in a Schema-Flexible Datastore

See how N1QL’s INFER command can help you write queries and define indexes in a schema-flexible environment.

· Database Zone
Free Resource

Learn how to create flexible schemas in a relational database using SQL for JSON.

A schema-flexible JSON datastore, such as Couchbase, provides a big advantage in permitting the application developer to manage schema evolution and referential integrity in a way best suited to the requirements of the application. There’s no need to bring down the database or get permission from the DBAs to change the schema, the application can just start writing data in a new schema, with the application logic handling new and old data schemas as appropriate.

This flexibility comes at a cost, however, beyond the need to write more sophisticated application code. Without a formal schema definition, it can be much harder for people to know the data structure details that are needed to write application code, queries, or define indexes:

  • How many distinct document schemas exist in a data store?
  • How frequently are fields present?
  • What is the data like in each field? Type? Format? Range?

Couchbase 4.5 introduced a new web-based query interface.

Screen Shot 2016-12-16 at 10.21.26 AM.png


This interface includes the “Bucket Analysis” panel, where if you click the triangle next to a bucket name, something very much like a schema appears:

Screen Shot 2016-12-16 at 10.24.52 AM.png


How is this possible, when Couchbase doesn’t have or enforce schemas? Behind the scenes, this panel is using N1QL’s INFER command, which takes a random sample of documents from a bucket, and infers what schema(s) exist for those documents. The example above looks at the beer-sample bucket (which is available with every instance of Couchbase). Note that:

  • Infer found not one but two schemas with different sets of fields, these schemas are called “flavors”. There are some fields in common between the flavors (description, name, and type), but more fields are different between the two flavors.
  • Hovering the mouse over a field name shows some sample values for that field.
  • Since infer keeps track of some of the sample values for each field, it knows which fields only have a single distinct value for the documents in flavor. In this case, it sees only a single value for the field named, “type”, so it lists that field as something “in-common” for all documents in each flavor. Using a “type” field is a common pattern when data modeling with Couchbase, yet there are no semantics to the name, it could be called “type” or “kind” or “category” or anything. The INFER command helps users learn what approach was taken by the person doing data modeling.

INFER can be run as a regular N1QL command to derive much more details about a bucket’s documents. How does INFER work?

  • INFER starts with a random sample of documents from a bucket. By default, it asks for 1000 documents, but a different sample size may be specified.
  • INFER extracts a schema from each document (field names, types), maintaining a hash table of all distinct schemas seen thus far. Two schemas are identical if they have the exact same fields and types. Maintaining all distinct schemas could be expensive if there were many of them - in theory every document in a bucket could have a different schema - but in practice most business data follows regular patterns and there aren’t many distinct schemas. Sample values for each field are maintained to help users understand a field’s format, though long values are truncated to 32 bytes to save space. When two identical schemas are combined, the sample values for each are combined, up to a specified maximum (by default at most 5 sample values are stored). INFER also keeps track of the number of distinct documents matching each schema.
  • Once all distinct schemas are found, similar schemas are merged into schema “flavors”. Two schemas are considered similar if the fraction of shared top-level fields is greater than the “similarity_metric” (default 0.6). If a field is a simple type, it is considered shared if both schemas have a field with the same name and type. If a field has an object type, then the similarity is considered recursively on the subtypes. If you set the similarity_metric to 0, then all schemas are merged into a single, universal-relation-like schema. At the other extreme, a similarity metric of 1.0 would mean that each distinct schema would become its own flavor. Flavors are different from schemas in one important way: with a schema, a field is either present or not. With a flavor, a field can be present some of the time, thus flavors keep track of the frequency that each field appears.
  • Each flavor is described using a JSON document whose format is an extension of the draft IETF standard for JSON schemas (http://json-schema.org). The extensions add fields for describing sample values and how often each field is found.

Let’s look at an example: the output of INFER for the ‘beer-sample’ bucket. Note how this is an array of flavors, each flavor following the json-schema format. At the top level each flavor has a “#docs” field which indicates how many documents from the sample matched this flavor. In the schema description, each field has a name and type, and also “#docs” and “%docs” indicating how often the field occurred. Sample values are also included for each field.

[
  [
    {
      "#docs": 807,
      "$schema": "http://json-schema.org/schema#",
      "Flavor": "type = \"beer\"",
      "properties": {
        "abv": {
          "#docs": 807,
          "%docs": 100,
          "samples": [
            5,
            5.4,
            0,
            4.9,
            8
          ],
          "type": "number"
        },
        "brewery_id": {
          "#docs": 807,
          "%docs": 100,
          "samples": [
            "limburg_beer_company",
            "caldera_brewing",
            "western_reserve_brewing",
            "anchor_brewing",
            "traquair_house_brewery"
          ],
          "type": "string"
        },
        "category": {
          "#docs": 593,
          "%docs": 73.48,
          "samples": [
            "North American Lager",
            "North American Ale",
            "British Ale",
            "Irish Ale",
            "Other Style"
          ],
          "type": "string"
        },
        "description": {
          "#docs": 807,
          "%docs": 100,
          "samples": [
            "Westmalle Dubbel is a dark, re...",
            "A conundrum in the world of be...",
            "Black, smooth and easy to drin...",
            "A true German-style Kölsch. O...",
            ""
          ],
          "type": "string"
        },
        "ibu": {
          "#docs": 807,
          "%docs": 100,
          "samples": [
            0,
            41,
            30,
            38,
            35
          ],
          "type": "number"
        },
        "name": {
          "#docs": 807,
          "%docs": 100,
          "samples": [
            "Witbier",
            "Dry Hop Orange",
            "Cloud Nine Witbier",
            "Anchor Steam",
            "Jacobite Ale"
          ],
          "type": "string"
        },
        "srm": {
          "#docs": 807,
          "%docs": 100,
          "samples": [
            0,
            12.5,
            47,
            35,
            45
          ],
          "type": "number"
        },
        "style": {
          "#docs": 593,
          "%docs": 73.48,
          "samples": [
            "American-Style Lager",
            "American-Style Amber/Red Ale",
            "American-Style Pale Ale",
            "American-Style Stout",
            "Classic English-Style Pale Ale"
          ],
          "type": "string"
        },
        "type": {
          "#docs": 807,
          "%docs": 100,
          "samples": [
            "beer"
          ],
          "type": "string"
        },
        "upc": {
          "#docs": 807,
          "%docs": 100,
          "samples": [
            0,
            7
          ],
          "type": "number"
        },
        "updated": {
          "#docs": 807,
          "%docs": 100,
          "samples": [
            "2010-07-22 20:00:20",
            "2011-07-23 20:37:34",
            "2011-08-15 11:46:53",
            "2011-05-17 03:19:48",
            "2011-02-16 12:38:43"
          ],
          "type": "string"
        }
      }
    },
    {
      "#docs": 193,
      "$schema": "http://json-schema.org/schema#",
      "Flavor": "type = \"brewery\"",
      "properties": {
        "address": {
          "#docs": 193,
          "%docs": 100,
          "items": {
            "#schema": "FieldType",
            "subtype": null,
            "type": "string"
          },
          "maxItems": 1,
          "minItems": 0,
          "samples": [
            [
              "7556 Pine Road"
            ],
            [
              "Val Dieu 225"
            ],
            [
              "6 Chapel Close"
            ],
            [
              "210 Swanson Avenue"
            ],
            [
              "808 West Main Street"
            ]
          ],
          "type": "array"
        },
        "city": {
          "#docs": 193,
          "%docs": 100,
          "samples": [
            "Arena",
            "Aubel",
            "South Stoke",
            "Lake Havasu City",
            "Ashland"
          ],
          "type": "string"
        },
        "code": {
          "#docs": 193,
          "%docs": 100,
          "samples": [
            "53503",
            "",
            "86403",
            "54806",
            "94019"
          ],
          "type": "string"
        },
        "country": {
          "#docs": 193,
          "%docs": 100,
          "samples": [
            "United States",
            "Belgium",
            "United Kingdom",
            "Canada",
            "Denmark"
          ],
          "type": "string"
        },
        "description": {
          "#docs": 193,
          "%docs": 100,
          "samples": [
            "Since firing up its brew kettl...",
            "(512) Brewing Company is a mic...",
            "An artisanal (non-conglomerate...",
            "Tröegs Brewing Company was es...",
            ""
          ],
          "type": "string"
        },
        "geo": {
          "#docs": 175,
          "%docs": 90.67,
          "properties": {
            "accuracy": {
              "#docs": 175,
              "%docs": 100,
              "samples": [
                "RANGE_INTERPOLATED",
                "GEOMETRIC_CENTER",
                "ROOFTOP",
                "APPROXIMATE"
              ],
              "type": "string"
            },
            "lat": {
              "#docs": 175,
              "%docs": 100,
              "samples": [
                43.1706,
                50.7046,
                51.5462,
                34.4686,
                46.5872
              ],
              "type": "number"
            },
            "lon": {
              "#docs": 175,
              "%docs": 100,
              "samples": [
                -89.9324,
                5.822,
                -1.1355,
                -114.341,
                -90.8921
              ],
              "type": "number"
            }
          },
          "samples": [
            {
              "accuracy": "RANGE_INTERPOLATED",
              "lat": 43.1706,
              "lon": -89.9324
            },
            {
              "accuracy": "GEOMETRIC_CENTER",
              "lat": 50.7046,
              "lon": 5.822
            },
            {
              "accuracy": "RANGE_INTERPOLATED",
              "lat": 51.5462,
              "lon": -1.1355
            },
            {
              "accuracy": "ROOFTOP",
              "lat": 34.4686,
              "lon": -114.341
            },
            {
              "accuracy": "ROOFTOP",
              "lat": 46.5872,
              "lon": -90.8921
            }
          ],
          "type": "object"
        },
        "name": {
          "#docs": 193,
          "%docs": 100,
          "samples": [
            "Lake Louie Brewing",
            "Brasserie de l'Abbaye Val-Dieu",
            "Ridgeway Brewing",
            "Mudshark Brewing",
            "South Shore Brewery"
          ],
          "type": "string"
        },
        "phone": {
          "#docs": 193,
          "%docs": 100,
          "samples": [
            "1-608-753-2675",
            "32-087-68-75-87",
            "44-(01491)-873474",
            "1-928-453-2981",
            "1-715-682-9199"
          ],
          "type": "string"
        },
        "state": {
          "#docs": 193,
          "%docs": 100,
          "samples": [
            "Wisconsin",
            "Lige",
            "Oxford",
            "Arizona",
            "Brabant Wallon"
          ],
          "type": "string"
        },
        "type": {
          "#docs": 193,
          "%docs": 100,
          "samples": [
            "brewery"
          ],
          "type": "string"
        },
        "updated": {
          "#docs": 193,
          "%docs": 100,
          "samples": [
            "2010-07-22 20:00:20",
            "2010-12-13 19:30:01"
          ],
          "type": "string"
        },
        "website": {
          "#docs": 193,
          "%docs": 100,
          "samples": [
            "http://www.jenningsbrewery.co....",
            "http://www.mudsharkbrewingco.c...",
            "http://512brewing.com/",
            "http://www.darkhorsebrewery.co...",
            ""
          ],
          "type": "string"
        }
      }
    }
  ]
]


The format for running the INFER command is as follows:

 INFER [ WITH { “parameter” : value, ... } ]  

The most common parameters are:

  • sample_size - how many documents to use when inferring schemas (default 1000).
  • num_sample_values - how many sample values to retain for each field (default 5).
  • similarity_metric - how similar two schemas must be to be merged into the same flavor (default 0.6).

For example, to INFER schemas for beer-sample using a random sample of 2000 documents, and maintaining 10 sample values for each field, you would use:

 INFER `beer-sample` with {“sample_size”: 2000, “num_sample_values”: 10} 

The output from INFER is quite detailed, and can be very useful for:

  • Learning about what fields exist and their data formats.
  • Identifying different schema versions, and finding out how many documents exist for each version.
  • Data cleaning, identifying fields that are used rarely or that should always be present but aren’t.

It should be noted that INFER, being based on random sampling, is not deterministic: You will likely get slightly different results each time you run it. You also might not pick up certain document schemas if they occur too rarely with respect to the specified sample size. On the whole, however, it is a powerful tool to help people understand what schemas are present in the documents of their schema-flexible data store.

Create flexible schemas using dynamic columns for semi-structured data. Learn how.

Topics:
couchbase ,json ,schema ,database ,tutorial

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}