Indexing Nested data
-
JSON documents can have nested structures, where one document contains other objects or arrays of objects.
-
Use a static-index to facilitate querying for documents based on the nested data.
-
In this page:
Sample data
-
The examples in this article are based on the following Classes and Sample Data:
class OnlineShop: def __init__(self, shop_name: str = None, email: str = None, t_shirts: List[TShirt] = None): self.shop_name = shop_name self.email = email self.t_shirts = t_shirts @classmethod def from_json(cls, json_data: Dict[str, Any]) -> OnlineShop: return cls( json_data["shop_name"], json_data["email"], [TShirt.from_json(shirt_json_dict) for shirt_json_dict in json_data["t_shirts"]], ) def to_json(self) -> Dict[str, Any]: return { "shop_name": self.shop_name, "email": self.email, "t_shirts": [tshirt.to_json() for tshirt in self.t_shirts], } class TShirt: def __init__(self, color: str = None, size: str = None, logo: str = None, price: float = None, sold: int = None): self.color = color self.size = size self.logo = logo self.price = price self.sold = sold @classmethod def from_json(cls, json_data: Dict[str, Any]) -> TShirt: return cls(json_data["color"], json_data["size"], json_data["logo"], json_data["price"], json_data["sold"]) def to_json(self) -> Dict[str, Any]: return {"color": self.color, "size": self.size, "logo": self.logo, "price": self.price, "sold": self.sold}
# Creating sample data for the examples in this article: # ====================================================== shop1_tshirts = [ TShirt(color="Red", size="S", logo="Bytes and Beyond", price=25, sold=2), TShirt(color="Red", size="M", logo="Bytes and Beyond", price=25, sold=4), TShirt(color="Blue", size="M", logo="Query Everything", price=28, sold=5), TShirt(color="Green", size="L", logo="Data Driver", price=30, sold=3), ] shop2_tshirts = [ TShirt(color="Blue", size="S", logo="Coffee, Code, Repeat", price=22, sold=12), TShirt(color="Blue", size="M", logo="Coffee, Code, Repeat", price=22, sold=7), TShirt(color="Green", size="M", logo="Big Data Dreamer", price=25, sold=9), TShirt(color="Black", size="L", logo="Data Mining Expert", price=20, sold=11), ] shop3_tshirts = [ TShirt(color="Red", size="S", logo="Bytes of Wisdom", price=18, sold=2), TShirt(color="Blue", size="M", logo="Data Geek", price=20, sold=6), TShirt(color="Black", size="L", logo="Data Revolution", price=15, sold=8), TShirt(color="Black", size="XL", logo="Data Revolution", price=15, sold=10), ] online_shops = [ OnlineShop(shop_name="Shop1", email="sales@shop1.com", t_shirts=shop1_tshirts), OnlineShop(shop_name="Shop2", email="sales@shop2.com", t_shirts=shop2_tshirts), OnlineShop(shop_name="Shop3", email="sales@shop3.com", t_shirts=shop3_tshirts), ] Shops_ByTShirt_Simple().execute(store) Shops_ByTShirt_Fanout().execute(store) Sales_ByTShirtColor_Fanout().execute(store) with store.open_session() as session: for shop in online_shops: session.store(shop) session.save_changes()
Simple index - Single index-entry per document
-
class Shops_ByTShirt_Simple(AbstractIndexCreationTask): class IndexEntry: def __init__(self, colors: List[str] = None, sizes: List[str] = None, logos: List[str] = None): # The index-fields: self.colors = colors self.sizes = sizes self.logos = logos def __init__(self): super().__init__() # Creating a SINGLE index-entry per document: self.map = ( "from shop in docs.OnlineShops " "select new { " # Each index-field will hold a collection of nested values from the document " colors = shop.t_shirts.Select(x => x.color)," " sizes = shop.t_shirts.Select(x => x.size)," " logos = shop.t_shirts.Select(x => x.logo)" "}" )
-
A single index-entry per document
-
The index-entries content is visible from the Studio Query view.
-
Check option: Show raw index-entries instead of Matching documents.
-
Each row represents an index-entry.
The index has a single index-entry per document (3 entries in this example). -
The index-field contains a collection of ALL nested values from the document.
e.g. The third index-entry has the following values in the Colors index-field:
{"black", "blue", "red"}
-
-
# Query for all shop documents that have a red TShirt shops_that_have_red_shirts = list( session.query_index_type(Shops_ByTShirt_Simple, Shops_ByTShirt_Simple.IndexEntry) .contains_any("colors", ["Red"]) .of_type(OnlineShop) )
from index "Shops/ByTShirt/Simple" where Colors == "red"
# Results will include the following shop documents: # ================================================== # * Shop1 # * Shop3
-
-
This type of index structure is effective for retrieving documents when filtering the query by any of the inner nested values that were indexed.
-
However, due to the way the index-entries are generated, this index cannot provide results for a query searching for documents that contain specific sub-objects which satisfy some
and_also
condition.
For example:
# You want to query for shops containing "Large Green TShirts", # aiming to get only "Shop1" as a result since it has such a combination, # so you attempt this query: green_and_large = list( session.query_index_type(Shops_ByTShirt_Simple, Shops_ByTShirt_Simple.IndexEntry) .contains_any("colors", ["green"]) .and_also() .contains_any("sizes", "L") .of_type(OnlineShop) ) # But, the results of this query will include BOTH "Shop1" & "Shop2" # since the index-queries do not keep the original sub-subjects structure.
-
To address this, you must use a Fanout index - as described below.
-
Fanout index - Multiple index-entries per document
-
-
A fanout index is an index that outputs multiple index-entries per document.
A separate index-entry is created for each nested sub-object from the document. -
The fanout index is useful when you need to retrieve documents matching query criteria
that search for specific sub-objects that comply with some logical conditions.
-
-
Fanout index - Map index example:
# A fanout map-index: # =================== class Shops_ByTShirt_Fanout(AbstractIndexCreationTask): class IndexEntry: def __init__(self, color: str = None, size: str = None, logo: str = None): self.color = color self.size = size self.logo = logo def __init__(self): super().__init__() # Creating MULTIPLE index-entries per document, # an index-entry for each sub-object in the TShirts list self.map = ( "from shop in docs.OnlineShops from shirt in shop.t_shirts " "select new {" " color = shirt.color," " size = shirt.size," " logo = shirt.logo" "}" )
class Shops_ByTShirt_JS(AbstractJavaScriptIndexCreationTask): def __init__(self): super().__init__() self.maps = { """ map('OnlineShops', function (shop){ var res = []; shop.t_shirts.forEach(shirt => { res.push({ color: shirt.color, size: shirt.size, logo: shirt.logo }) }); return res; }) """ }
# Query the fanout index: # ======================= shops_that_have_medium_red_shirts = list( session.query_index_type(Shops_ByTShirt_Fanout, Shops_ByTShirt_Fanout.IndexEntry) # Query for documents that have a "Medium Red TShirt" .where_equals("color", "red") .and_also() .where_equals("size", "M") .of_type(OnlineShop) )
from index "Shops/ByTShirt/Fanout" where Color == "red" and Size == "M"
# Query results: # ============== # # Only the 'Shop1' document will be returned, # since it is the only document that has the requested combination within the TShirt list.
-
-
The index-entries content is visible from the Studio Query view.
-
Check option: Show raw index-entries instead of Matching documents.
-
Each row represents an index-entry.
Each index-entry corresponds to an inner item in the TShirt list. -
In this example, the total number of index-entries is 12,
which is the total number of inner items in the TShirt list in all 3 documents in the collection.
-
-
Fanout index - Map-Reduce index example:
-
The fanout index concept applies to map-reduce indexes as well:
class Sales_ByTShirtColor_Fanout(AbstractIndexCreationTask): class IndexEntry: def __init__(self, color: str = None, items_sold: int = None, total_sales: float = None): self.color = color self.items_sold = items_sold self.total_sales = total_sales def __init__(self): super().__init__() # Creating MULTIPLE index-entries per document, # an index-entry for each sub-object in the TShirts list self.map = ( "from shop in docs.OnlineShops from shirt in shop.t_shirts " "select new {" " color = shirt.color, " " items_sold = shirt.sold, " " total_sales = shirt.price * shirt.sold" "}" ) self.reduce = ( "from result in results group result by result.color into g select new {" " color = g.Key," " items_sold = g.Sum(x => x.items_sold)," " total_sales = g.Sum(x => x.total_sales)" "}" )
# Query the fanout index: # ======================= query_result = ( session.query_index_type(Sales_ByTShirtColor_Fanout, Sales_ByTShirtColor_Fanout.IndexEntry) # Query for index-entries that contain "black" .where_equals("color", "black").first() ) # Get total sales for black TShirts black_shirts_sales = query_result.total_sales or 0
from index "Sales/ByTShirtColor/Fanout" where Color == "black"
-
-
Fanout index - Performance hints:
-
Fanout indexes are typically more resource-intensive than other indexes as RavenDB has to index a large number of index-entries. This increased workload can lead to higher CPU and memory utilization, potentially causing a decline in the overall performance of the index.
-
When the number of index-entries generated from a single document exceeds a configurable limit,
RavenDB will issue a High indexing fanout ratio alert in the Studio notification center. -
You can control when this performance hint is created by setting the PerformanceHints.Indexing.MaxIndexOutputsPerDocument configuration key (default is 1024).
-
So, for example, adding another OnlineShop document with a
tShirt
object containing 1025 items
will trigger the following alert:High indexing fanout ratio notification
-
Clicking the 'Details' button will show the following info:
Fanout index, performance hint details
-
-
-
A fanout index has more index-entries than the number of documents in the collection indexed.
Multiple index-entries "point" to the same document from which they originated,
as can be seen in the above index-entries example. -
When making a fanout index query that should return full documents (without projecting results),
theTotalResults
property (available via theQueryStatistics
object) will contain
the total number of index-entries and Not the total number of resulting documents. -
To overcome this when paging results, you must take into account the number of "duplicate"
index-entries that are skipped internally by the server when serving the resulting documents. -
Please refer to paging through tampered results for further explanation and examples.
-