Full-Text Indexing in Nebula Graph 2.0

1. Introduction

Nebula Graph 2.0 helps full-text indexing by utilizing an exterior full-text search engine. To grasp this new function, let’s assessment the structure and storage mannequin of Nebula Graph 2.0.

1.1 Structure of Nebula Graph

Architecture of Nebula Graph

As proven within the previous determine, the Storage Service consists of three layers. The underside one is the Retailer Engine. It’s a standalone native retailer engine, supporting get, put, scan, and delete operations on native information. The related interfaces are within the kvstore/KVEngine.h file. Customers can customise the native retailer plugins to satisfy their very own wants. At present, Nebula Graph offers a RocksDB-based retailer engine.

Above the native retailer engine, the consensus algorithm of multi-group Raft is applied. With this implementation, every partition corresponds to at least one raft group, the place a partition is an information shard in Nebula Graph. Hash-based sharding is utilized in Nebula Graph. For extra details about how the hash features work, see the 1.2.1 Knowledge Storage in Nebula Graph. To create a graph area in Nebula Graph, the variety of partitions is required and it can’t be modified after creation. The variety of partitions should meet your wants of enterprise enlargement.

The highest layer is the storage interfaces. A set of graph-related APIs is applied on this layer. The API requests are translated right into a set of KV operations on the corresponding partitions. This layer makes our storage service actual graph storage. With out it, the Storage Service of Nebula Graph is only a KV storage answer. In Nebula Graph, the KV storage shouldn’t be offered as a separate service. The primary purpose is that quite a lot of computations are required to execute a WHERE clause and the schema of a graph is required for the computations, however the schema shouldn’t be applied within the KV retailer layer. The design applied in Nebula Graph makes computation pushdown simpler.

1.2 Storage of Nebula Graph

In Nebula Graph 2.0, the storage construction containing vertices, edges, and indexes is improved. Now let’s assessment the storage construction of Nebula Graph 2.0, which might assist you perceive the implementation of knowledge scanning and index scanning in Nebula Graph 2.0.

1.2.1 Knowledge Storage in Nebula Graph

Nebula Graph shops vertices and edges based mostly on the key-value storage mannequin. On this part, the storage construction of the keys is launched. The keys are composed of the next objects:

  • Sort: One byte. It represents the important thing kind, corresponding to vertex, edge, index, or system.
  • PartID: Three bytes. It represents a partition. This subject makes it straightforward to scan the whole partition information based mostly on the prefix when the partition is re-balanced.
  • VertexID: n bytes. For an outgoing edge, it represents the ID of the supply vertex. For an incoming edge, it represents the ID of the vacation spot vertex.
  • Edge Sort: 4 bytes. It represents the kind of edge. Whether it is higher than 0, the sting is outgoing. Whether it is lower than 0, the sting is incoming.
  • Rank: Eight bytes. It’s used to determine edges of the identical edge kind and with the identical supply and vacation spot vertices. Customers can use it to symbolize their very own enterprise attributes corresponding to transaction time, transaction serial quantity, or a sorting weight.
  • PlaceHolder: One byte. It’s invisible to customers now. Sooner or later, will probably be used once we implement the distributed transaction.
  • TagID:4 bytes. It represents the kind of tag.

1.2.1.1 Vertex Key Format

Sort (1 byte) PartID (3 bytes) VertexID (n bytes) TagID (4 bytes)

1.2.1.2 Edge Key Format

Sort (1 byte) PartID (3 bytes) VertexID (n bytes) EdgeType (4 bytes) Rank (8 bytes) VertexID (n bytes) PlaceHolder (1 byte)

1.2.2 Index Storage in Nebula Graph

  • props binary (n bytes): (n bytes): It represents the worth of a property of a tag or an edge kind. If the property worth is NULL, 0xFF is used.
  • nullable bitset (2 bytes): It signifies whether or not the worth of a property is NULL. It’s two bytes lengthy, which signifies that an index can include a most of 16 properties.

1.2.2.1 Tag Index Key Format

Sort (1 byte) PartID (3 bytes) IndexID (4 bytes) props binary (n bytes) nullable bitset (2 bytes) VertexID (n bytes)

1.2.2.2 Edge Index Key Format

Sort (1 byte) PartID (3 bytes) IndexID (4 bytes) props binary (n bytes) nullable bitset (2 bytes) VertexID (n bytes) Rank (8 bytes) VertexID (n bytes)

1.3 Why Exterior Full-Textual content Search Engine Is Used?

From the previous determine, you’ll be able to see that if you wish to carry out a fuzzy question of a textual content on a property, a full desk scan or full index scan assertion is required after which the information is filtered row by row, which can compromise the question efficiency. If the quantity of knowledge is massive, out-of-memory could happen earlier than the scanning is finished. In addition to, inverted indexing is in opposition to the preliminary design precept of indexing in Nebula Graph, so it isn’t applied for textual content search. After some analysis and dialogue, to make the full-text search work enormously, we determined to introduce a full-text search engine from a 3rd occasion. It might probably guarantee question efficiency and cut back the event value of the Nebula Graph kernel.

2. Goal

2.1 Functionalities

In Nebula Graph 2.0, solely LOOKUP helps textual content search. It signifies that when an exterior full-text search engine is out there, customers can run a LOOKUP assertion to carry out a textual content search. For an exterior full-text search engine, just some fundamental functionalities, corresponding to inserting information and querying information, are applied. To implement some complicated, plain textual content queries, Nebula Graph must be polished additional. Any options from the Nebula Graph group are welcome. The next are the textual content search expressions which might be supported by Nebula Graph 2.0:

  • Fuzzy search.
  • Prefix search.
  • Wildcard search.
  • Common expression search.

2.2 Efficiency

On this article, I’ll focus on information synchronization efficiency and question efficiency.

  • Knowledge synchronization efficiency: As a result of an exterior full-text search engine is used, it’s essential to retailer a replica of knowledge within the exterior full-text search engine. It has been verified that the import efficiency of an exterior full-text search engine is decrease than that of Nebula Graph. Due to this fact, so as to not lower the information import efficiency of Nebula Graph, we determined to make use of a synchronous synchronization answer to import information to an exterior full-text search engine. For extra data, see the next sections.
  • Question efficiency: As talked about above, if no exterior full-text search engine had been adopted, the full-text search could be a nightmare for Nebula Graph. At current, an exterior full-text search engine, LOOKUP helps textual content search, however the efficiency is inevitably decrease than that of the native index scan of Nebula Graph, even typically the question efficiency of the exterior full-text search engine is low. To resolve this downside, a timeliness mechanism, LIMIT and TIMEOUT, is required to make sure the question efficiency. For extra data, see the next sections.

3. Glossary

Time period That means
Tag Defines the property construction of vertices. Tags are recognized by tagId. A number of tags may be connected to at least one vertex.
Edge Defines the property construction of edges. Edge varieties are recognized by edgetype.
Property Defines the properties of a tag or an edge kind. Its information kind is outlined in a tag or an edge kind.
Partition Represents the smallest logical retailer unit in Nebula Graph. A Storage Engine comprises a number of partitions. The Chief or Follower function may be assigned to a partition. Raftex ensures information consistency between Leaders and Followers.
Graph area Every graph area is an impartial enterprise graph unit. Every graph area has its personal impartial tag and edge kind set. A Nebula Graph cluster can have a number of graph areas.
Index The referred index within the following sections represents the indexes on the properties of vertices and edges in Nebula Graph. Its information kind is set by the tag or edge kind definition.
TagIndex Represents an index on a tag. A tag can have multiple index. Indexes throughout a number of tags haven’t been supported.
EdgeIndex Represents an index on an edge kind. An edge kind can have multiple index. Indexes throughout a number of edge varieties haven’t been supported.
Scan Coverage Defines the index scan coverage. Usually, a question assertion can use a number of index scan insurance policies, and Scan Coverage decides which coverage is used.
Optimizer Optimizes the question circumstances to enhance the question effectivity. For instance, sorting, splitting, and merging sub-expression nodes on the expression tree of the WHERE clause.

4. Implementation

Elasticsearch is the exterior full-text search engine that’s supported by Nebula Graph. On this part, I’ll introduce how Elasticsearch works with Nebula Graph 2.0.

4.1 Storage Construction

4.1.1 DocID

partId(10 bytes) schemaId(10 bytes) encoded_columnName(32 bytes) encoded_val(max 344 bytes)
  • partId: Corresponds to the partition ID of Nebula Graph. Not accessible in Nebula Graph 2.0. Will probably be used for question pushdown and the routing function of Elasticsearch sooner or later.
  • schemaId: Corresponds to the tagId or edgetype in Nebula Graph.
  • encoded_columnName: Corresponds to the property identify of a tag or an edge kind. The MD5 algorithm is used for encoding to keep away from incompatible characters in Elasticsearch docID.
  • encoded_val: The utmost size is 344 bytes. To assist some seen characters within the property values that aren’t supported by Elasticsearch docID, the Base64 algorithm is used to encode the property values, so the utmost size of encoded_val is 344 bytes. Nevertheless, its precise dimension is as much as 256 bytes solely. Why is it 256 bytes? To start with, we simply wished to allow LOOKUP for use to carry out a textual content search. Just like MySQL, the size of an index in Nebula Graph can be restricted and the advisable most size is 256 bytes. Due to this fact, the 256-byte size restrict is also utilized to the exterior search engine. Thus far, full-text seek for lengthy texts has not been supported.
  • The utmost size of Elasticsearch docID is 512 bytes. Thus far, about 100 bytes are reserved.

4.1.2 Doc Fields

  • schema_id: Corresponds to tagId or edgetype in Nebula Graph.
  • column_id: Corresponds to the property code of a tag or an edge kind in Nebula Graph.
  • worth: Corresponds to the property worth of the native index in Nebula Graph.

4.2 Synchronizing Knowledge

4.2.1 Chief and Listener

On this part, I’ll introduce the small print of synchronizing information asynchronously. Understanding Chief and Listener in Nebula Graph will assist you perceive the synchronization mechanism.

  • Chief: Nebula Graph is a horizontally scalable distributed system and the distributed protocol is RAFT. In Nebula Graph, completely different roles may be assigned to a partition, corresponding to Chief, Follower, and Learner. To write down a brand new document to Nebula Graph, the Chief will provoke a WAL synchronization occasion and synchronize the occasion with the Followers and the Learners. When community or disk abnormalities happen, the partition function can be switched accordingly. Such a mechanism ensures the information safety of the distributed database. Leaders, Followers, and Learners are managed by the nebula-storaged course of and the parameters are decided in nebula-storage.conf.
  • Listener: Not like Leaders, Followers, and Learners, Listeners are managed by a separate course of and its configuration parameters are laid out in nebula-storage-listener.conf. As a listener, a Listener passively receives the WAL despatched by the Chief, parses the WAL often, and calls the information insertion API of the exterior full-text search engine to synchronize the information with the exterior engine. Nebula Graph 2.0 helps the PUT and BULK interfaces of Elasticsearch.

Now, let’s see how the information is synchronized:

  1. Vertices or edges are inverted by way of Shopper or Console.
  2. On the Graph Service layer, the associated partition is computed based mostly on Vertex ID.
  3. On the Graph Service layer, the INSERT request is shipped to the Chief of the associated partitions by way of storageClient.
  4. The Chief parses the INSERT request after which synchronizes the WAL with the Listener.
  5. The Listener processes the newly synchronized WAL often, parses the WAL, after which obtains the STRING property values of the tags or edge varieties.
  6. The metadata and the property values of the tags and the sting varieties are assembled to an information construction that’s suitable with that of Elasticsearch.
  7. The info is written into Elasticsearch by way of the PUT or BULK interface.
  8. If writing information fails, return to Step 5 after which strive the failed WAL till the writing succeeds.
  9. When the writing succeeds, the profitable Log ID and Time period ID are recorded because the beginning worth for synchronization of the subsequent WAL.
  10. Goes again to Step 5 to course of the brand new WAL.

Within the previous steps, if the Elasticsearch cluster or the Listener course of crashes, the synchronization of the WAL will cease. When the system is restored, the information synchronization will proceed with the final profitable Log ID. We advocate that DBA ought to monitor the state of the Elasticsearch cluster in real-time by utilizing an exterior monitoring software. If the Elasticsearch cluster is inactive for a very long time, quite a lot of logs can be generated to the Listener and the question can’t be carried out usually.

4.3 Querying Knowledge

Querying Data Flow Diagram

From the previous determine, we will see the important thing steps in textual content search as follows:

  1. Ship Fulltext Scan Request: Generates a search request of the full-text index based mostly on question circumstances, Schema ID, and Property ID, that’s, the CURL command of Elasticsearch is encapsulated.
  2. Fulltext Cluster: Sends a question request to Elasticsearch and obtains the consequence.
  3. Accumulate Fixed Values: Makes use of the returned consequence as a relentless worth to generate an inner question expression of Nebula Graph. For instance, the unique request is to question the property values beginning with “A” for the C1 property, and if the returned consequence comprises each “A1” and “A2”, an expression C1 == "A1" OR C1 == "A2" is generated.
  4. IndexScan Optimizer: In accordance with the newly generated expression, finds the optimum inner index based mostly on RBO for Nebula Graph and generates the optimum execution plan.
  5. Fulltext Cluster: On this step, the question could also be gradual or large information can be returned. Due to this fact, LIMIT and TIMEOUT is adopted to interrupt the question on the Elasticsearch aspect in real-time.

5. Demonstration

5.1 Deploying Exterior Elasticsearch Cluster

I assume that you’re already acquainted with the deployment of an Elasticsearch cluster, so I received’t describe it intimately. It needs to be famous that when the Elasticsearch cluster is efficiently began, it’s essential to create a basic template as follows.


 "template": "nebula*",
  "settings": 
    "index": 
      "number_of_shards": 3,
      "number_of_replicas": 1
    
  ,
  "mappings": 
    "properties" : 
            "tag_id" :  "kind" : "lengthy" ,
            "column_id" :  "kind" : "textual content" ,
            "worth" : "kind" : "key phrase"
        
  

5.2 Deploying Nebula Listener

  • In accordance with the precise setting, modify the configuration parameters in nebula-storaged-listener.conf
  • Run this command to begin the Listener: ./bin/nebula-storaged --flagfile $listener_config_path/nebula-storaged-listener.conf

5.3 Signing In to Textual content Search Shoppers

nebula> SIGN IN TEXT SERVICE (127.0.0.1:9200);
nebula> SHOW TEXT SEARCH CLIENTS;
+-------------+------+
| Host        | Port |
+-------------+------+
| "127.0.0.1" | 9200 |
+-------------+------+
| "127.0.0.1" | 9200 |
+-------------+------+
| "127.0.0.1" | 9200 |
+-------------+------+

5.4 Making a Graph Area of Nebula Graph

CREATE SPACE basketballplayer (partition_num=3,replica_factor=1, vid_type=fixed_string(30));
 
USE basketballplayer;

5.5 Including Listeners

nebula> ADD LISTENER ELASTICSEARCH 192.168.8.5:46780,192.168.8.6:46780;
nebula> SHOW LISTENER;
+--------+-----------------+-----------------------+----------+
| PartId | Sort            | Host                  | Standing   |
+--------+-----------------+-----------------------+----------+
| 1      | "ELASTICSEARCH" | "[192.168.8.5:46780]" | "ONLINE" |
+--------+-----------------+-----------------------+----------+
| 2      | "ELASTICSEARCH" | "[192.168.8.5:46780]" | "ONLINE" |
+--------+-----------------+-----------------------+----------+
| 3      | "ELASTICSEARCH" | "[192.168.8.5:46780]" | "ONLINE" |
+--------+-----------------+-----------------------+----------+

5.6 Creating Tags, Edge Sorts, and Indexes

The identify property needs to be shorter than 256 bytes. If the enterprise permits, the identify property of the participant tag needs to be the fixed_string kind and its size needs to be lower than 256 bytes.

nebula> CREATE TAG participant(identify string, age int);
nebula> CREATE TAG INDEX identify ON participant(identify(20));

5.7 Inserting Knowledge

nebula> INSERT VERTEX participant(identify, age) VALUES 
  "Russell Westbrook": ("Russell Westbrook", 30), 
  "Chris Paul": ("Chris Paul", 33),
  "Boris Diaw": ("Boris Diaw", 36),
  "David West": ("David West", 38),
  "Danny Inexperienced": ("Danny Inexperienced", 31),
  "Tim Duncan": ("Tim Duncan", 42),
  "James Harden": ("James Harden", 29),
  "Tony Parker": ("Tony Parker", 36),
  "Aron Baynes": ("Aron Baynes", 32),
  "Ben Simmons": ("Ben Simmons", 22),
  "Blake Griffin": ("Blake Griffin", 30);

5.8 Querying Knowledge

nebula> LOOKUP ON participant WHERE PREFIX(participant.identify, "B");
+-----------------+
| _vid            |
+-----------------+
| "Boris Diaw"    |
+-----------------+
| "Ben Simmons"   |
+-----------------+
| "Blake Griffin" |
+-----------------+

6. Monitoring and Fixing Issues

Within the technique of establishing the system setting, errors in a step could make the functionalities unable to work usually. Primarily based on consumer suggestions, I summarized three doable error varieties. Right here is find out how to analyze and remedy these issues:

  • Drawback: The Listeners can’t be began or can not work after startup.
    • Do a verify of the Listener configuration file, ensuring that the IP:Port configuration of the Listeners doesn’t battle with that of the prevailing nebula-storaged course of.
    • Do a verify of the Listener configuration file, ensuring that the IP:Port configuration of Meta is per that of the nebula-storaged course of.
    • Do a verify of the Listener configuration file, ensuring that the PIDs listing and the logs listing are impartial, and that they don’t battle with that of the nebula-storaged course of.
    • If the configuration is modified due to its errors after the Listeners are began efficiently and the Listeners can not work usually after a restart, the Meta-related metadata must be cleared. For extra details about the instructions, see Nebula Graph Database Guide.
  • Drawback: The info can’t be synchronized with the Elasticsearch cluster.
    • Be sure that the Listeners have acquired the WAL from the Chief by checking whether or not there are any recordsdata within the listing specified for listener_path within the nebula-storaged-listener.conf file.
    • Open vlog by working UPDATE CONFIGS storage:v=3 and guarantee that the CURL command is executed efficiently. If the execution fails, do a verify of the Elasticsearch configuration or the compatibility between variations.
  • Drawback: There are information within the Elasticsearch cluster however no appropriate result’s returned.
    • Open vlog by working UPDATE CONFIGS graph:v=3 and do a verify of the graph logs to verify the explanations for the CURL command failures.
    • Solely lowercase characters, however not the uppercase ones, may be recognized in the course of the question. It might be attributable to template errors of Elasticsearch. For extra data, see Nebula Graph Database Guide.

7. TODO

  • Creating full-text indexes on specified tags or edge varieties.
  • Rebuilding full-text indexes (REBUILD).

Want to know extra about Nebula Graph? Be part of the Slack channel!


Supply hyperlink

About PARTH SHAH

Check Also

Kidnappers in Nigeria Release 28 Schoolchildren, Another 81 Still Held, Says Negotiator | World News

KADUNA, Nigeria (Reuters) – Kidnappers who raided a boarding college in northern Nigeria earlier this …

Leave a Reply

Your email address will not be published. Required fields are marked *

x