Indexing Nested data



Sample data

  • The examples in this article are based on the following Classes and Sample Data:

    class OnlineShop:
        def __init__(self, shop_name: str = None, email: str = None, t_shirts: List[TShirt] = None):
            self.shop_name = shop_name
            self.email = email
            self.t_shirts = t_shirts
    
        @classmethod
        def from_json(cls, json_data: Dict[str, Any]) -> OnlineShop:
            return cls(
                json_data["shop_name"],
                json_data["email"],
                [TShirt.from_json(shirt_json_dict) for shirt_json_dict in json_data["t_shirts"]],
            )
    
        def to_json(self) -> Dict[str, Any]:
            return {
                "shop_name": self.shop_name,
                "email": self.email,
                "t_shirts": [tshirt.to_json() for tshirt in self.t_shirts],
            }
    
    
    class TShirt:
        def __init__(self, color: str = None, size: str = None, logo: str = None, price: float = None, sold: int = None):
            self.color = color
            self.size = size
            self.logo = logo
            self.price = price
            self.sold = sold
    
        @classmethod
        def from_json(cls, json_data: Dict[str, Any]) -> TShirt:
            return cls(json_data["color"], json_data["size"], json_data["logo"], json_data["price"], json_data["sold"])
    
        def to_json(self) -> Dict[str, Any]:
            return {"color": self.color, "size": self.size, "logo": self.logo, "price": self.price, "sold": self.sold}
    # Creating sample data for the examples in this article:
    # ======================================================
    shop1_tshirts = [
        TShirt(color="Red", size="S", logo="Bytes and Beyond", price=25, sold=2),
        TShirt(color="Red", size="M", logo="Bytes and Beyond", price=25, sold=4),
        TShirt(color="Blue", size="M", logo="Query Everything", price=28, sold=5),
        TShirt(color="Green", size="L", logo="Data Driver", price=30, sold=3),
    ]
    
    shop2_tshirts = [
        TShirt(color="Blue", size="S", logo="Coffee, Code, Repeat", price=22, sold=12),
        TShirt(color="Blue", size="M", logo="Coffee, Code, Repeat", price=22, sold=7),
        TShirt(color="Green", size="M", logo="Big Data Dreamer", price=25, sold=9),
        TShirt(color="Black", size="L", logo="Data Mining Expert", price=20, sold=11),
    ]
    
    shop3_tshirts = [
        TShirt(color="Red", size="S", logo="Bytes of Wisdom", price=18, sold=2),
        TShirt(color="Blue", size="M", logo="Data Geek", price=20, sold=6),
        TShirt(color="Black", size="L", logo="Data Revolution", price=15, sold=8),
        TShirt(color="Black", size="XL", logo="Data Revolution", price=15, sold=10),
    ]
    
    online_shops = [
        OnlineShop(shop_name="Shop1", email="sales@shop1.com", t_shirts=shop1_tshirts),
        OnlineShop(shop_name="Shop2", email="sales@shop2.com", t_shirts=shop2_tshirts),
        OnlineShop(shop_name="Shop3", email="sales@shop3.com", t_shirts=shop3_tshirts),
    ]
    
    Shops_ByTShirt_Simple().execute(store)
    Shops_ByTShirt_Fanout().execute(store)
    Sales_ByTShirtColor_Fanout().execute(store)
    
    with store.open_session() as session:
        for shop in online_shops:
            session.store(shop)
    
        session.save_changes()

Simple index - Single index-entry per document

  • The index:

    class Shops_ByTShirt_Simple(AbstractIndexCreationTask):
        class IndexEntry:
            def __init__(self, colors: List[str] = None, sizes: List[str] = None, logos: List[str] = None):
                # The index-fields:
                self.colors = colors
                self.sizes = sizes
                self.logos = logos
    
        def __init__(self):
            super().__init__()
            # Creating a SINGLE index-entry per document:
            self.map = (
                "from shop in docs.OnlineShops "
                "select new { "
                # Each index-field will hold a collection of nested values from the document
                "    colors = shop.t_shirts.Select(x => x.color),"
                "    sizes = shop.t_shirts.Select(x => x.size),"
                "    logos = shop.t_shirts.Select(x => x.logo)"
                "}"
            )
  • The index-entries:

    Simple - index-entries

    A single index-entry per document

    1. The index-entries content is visible from the Studio Query view.

    2. Check option: Show raw index-entries instead of Matching documents.

    3. Each row represents an index-entry.
      The index has a single index-entry per document (3 entries in this example).

    4. The index-field contains a collection of ALL nested values from the document.
      e.g. The third index-entry has the following values in the Colors index-field:
      {"black", "blue", "red"}

  • Querying the index:

    # Query for all shop documents that have a red TShirt
    shops_that_have_red_shirts = list(
        session.query_index_type(Shops_ByTShirt_Simple, Shops_ByTShirt_Simple.IndexEntry)
        .contains_any("colors", ["Red"])
        .of_type(OnlineShop)
    )
    from index "Shops/ByTShirt/Simple"
    where Colors == "red"

    # Results will include the following shop documents:
    # ==================================================
    # * Shop1
    # * Shop3
  • When to use:

    • This type of index structure is effective for retrieving documents when filtering the query by any of the inner nested values that were indexed.

    • However, due to the way the index-entries are generated, this index cannot provide results for a query searching for documents that contain specific sub-objects which satisfy some and_also condition.
      For example:

      # You want to query for shops containing "Large Green TShirts",
      # aiming to get only "Shop1" as a result since it has such a combination,
      # so you attempt this query:
      green_and_large = list(
          session.query_index_type(Shops_ByTShirt_Simple, Shops_ByTShirt_Simple.IndexEntry)
          .contains_any("colors", ["green"])
          .and_also()
          .contains_any("sizes", "L")
          .of_type(OnlineShop)
      )
      
      # But, the results of this query will include BOTH "Shop1" & "Shop2"
      # since the index-queries do not keep the original sub-subjects structure.
    • To address this, you must use a Fanout index - as described below.

Fanout index - Multiple index-entries per document

  • What is a Fanout index:

    • A fanout index is an index that outputs multiple index-entries per document.
      A separate index-entry is created for each nested sub-object from the document.

    • The fanout index is useful when you need to retrieve documents matching query criteria
      that search for specific sub-objects that comply with some logical conditions.

  • Fanout index - Map index example:

    # A fanout map-index:
    # ===================
    class Shops_ByTShirt_Fanout(AbstractIndexCreationTask):
        class IndexEntry:
            def __init__(self, color: str = None, size: str = None, logo: str = None):
                self.color = color
                self.size = size
                self.logo = logo
    
        def __init__(self):
            super().__init__()
            # Creating MULTIPLE index-entries per document,
            # an index-entry for each sub-object in the TShirts list
            self.map = (
                "from shop in docs.OnlineShops from shirt in shop.t_shirts "
                "select new {"
                "    color = shirt.color,"
                "    size = shirt.size,"
                "    logo = shirt.logo"
                "}"
            )
    class Shops_ByTShirt_JS(AbstractJavaScriptIndexCreationTask):
        def __init__(self):
            super().__init__()
            self.maps = {
                """
                map('OnlineShops', function (shop){ 
                           var res = [];
                           shop.t_shirts.forEach(shirt => {
                               res.push({
                                   color: shirt.color,
                                   size: shirt.size,
                                   logo: shirt.logo
                               })
                            });
                            return res;
                        })
                """
            }

    # Query the fanout index:
    # =======================
    shops_that_have_medium_red_shirts = list(
        session.query_index_type(Shops_ByTShirt_Fanout, Shops_ByTShirt_Fanout.IndexEntry)
        # Query for documents that have a "Medium Red TShirt"
        .where_equals("color", "red")
        .and_also()
        .where_equals("size", "M")
        .of_type(OnlineShop)
    )
    from index "Shops/ByTShirt/Fanout" 
    where Color == "red" and Size == "M"

    # Query results:
    # ==============
    #
    # Only the 'Shop1' document will be returned,
    # since it is the only document that has the requested combination within the TShirt list.
  • The index-entries: Fanout - index-entries

    1. The index-entries content is visible from the Studio Query view.

    2. Check option: Show raw index-entries instead of Matching documents.

    3. Each row represents an index-entry.
      Each index-entry corresponds to an inner item in the TShirt list.

    4. In this example, the total number of index-entries is 12,
      which is the total number of inner items in the TShirt list in all 3 documents in the collection.

  • Fanout index - Map-Reduce index example:

    • The fanout index concept applies to map-reduce indexes as well:

      class Sales_ByTShirtColor_Fanout(AbstractIndexCreationTask):
          class IndexEntry:
              def __init__(self, color: str = None, items_sold: int = None, total_sales: float = None):
                  self.color = color
                  self.items_sold = items_sold
                  self.total_sales = total_sales
      
          def __init__(self):
              super().__init__()
              # Creating MULTIPLE index-entries per document,
              # an index-entry for each sub-object in the TShirts list
              self.map = (
                  "from shop in docs.OnlineShops from shirt in shop.t_shirts "
                  "select new {"
                  "    color = shirt.color, "
                  "    items_sold = shirt.sold, "
                  "    total_sales = shirt.price * shirt.sold"
                  "}"
              )
              self.reduce = (
                  "from result in results group result by result.color into g select new {"
                  "    color = g.Key,"
                  "    items_sold = g.Sum(x => x.items_sold),"
                  "    total_sales = g.Sum(x => x.total_sales)"
                  "}"
              )

      # Query the fanout index:
      # =======================
      query_result = (
          session.query_index_type(Sales_ByTShirtColor_Fanout, Sales_ByTShirtColor_Fanout.IndexEntry)
          # Query for index-entries that contain "black"
          .where_equals("color", "black").first()
      )
      
      # Get total sales for black TShirts
      black_shirts_sales = query_result.total_sales or 0
      from index "Sales/ByTShirtColor/Fanout"
      where Color == "black"
  • Fanout index - Performance hints:

    • Fanout indexes are typically more resource-intensive than other indexes as RavenDB has to index a large number of index-entries. This increased workload can lead to higher CPU and memory utilization, potentially causing a decline in the overall performance of the index.

    • When the number of index-entries generated from a single document exceeds a configurable limit,
      RavenDB will issue a High indexing fanout ratio alert in the Studio notification center.

    • You can control when this performance hint is created by setting the PerformanceHints.Indexing.MaxIndexOutputsPerDocument configuration key (default is 1024).

    • So, for example, adding another OnlineShop document with a tShirt object containing 1025 items
      will trigger the following alert:

      Figure 1. High indexing fanout ratio notification

      High indexing fanout ratio notification

    • Clicking the 'Details' button will show the following info:

      Figure 2. Fanout index, performance hint details

      Fanout index, performance hint details

  • Fanout index - Paging:

    • A fanout index has more index-entries than the number of documents in the collection indexed.
      Multiple index-entries "point" to the same document from which they originated,
      as can be seen in the above index-entries example.

    • When making a fanout index query that should return full documents (without projecting results),
      the TotalResults property (available via the QueryStatistics object) will contain
      the total number of index-entries and Not the total number of resulting documents.

    • To overcome this when paging results, you must take into account the number of "duplicate"
      index-entries that are skipped internally by the server when serving the resulting documents.

    • Please refer to paging through tampered results for further explanation and examples.