Proposals/Elasticsearch
From Mahara Wiki
< Proposals(Redirected from Developer Area/Specifications in Development/Elasticsearch)
A few notes on the elasticsearch implementation in Mahara
Elasticsearch: installation
A deb package is available so installation is pretty painless.
Check the file htdocs/lib/elastica/README.Mahara to see which version of the Elastica PHP library is currently bundled with Mahara. Each version of Elastica is implemented to specifically support the matching version of Elasticsearch, so for best compatibility you'll want to use that same version, although other versions may work as well.
As of Mahara 1.8, 1.9, and 1.10, the supported Elasticsearch version is thus 0.90.1.0
Nodes, shards, replicas
Development Environment
When creating an index, the first question that we need to ask ourselves is "what number of shards and replicas should I use?". The second question is "But... what are a shard and a replica?". The terms are used a lot in the docs, but it takes hours of googling to find an explanation. This is what I understood (so it might not be technically right):
- Node
- A node is an elasticsearch instance, so 1 node = 1 server
- Shard
- Every index is spread over n shards. So how much is n? To answer that, I watched that video so you don't have to.
- Shards can be used to make the search more efficient, but each shard has a cost. There's no definite rule to define the max size of a shard and the best way is to start with 1 shard and see how that works (the number of shards can not be updated after creation of the index so the index would have to be re-created.)
- The developers outlined 2 major cases where shards become useful:
- user based:
- if users are most likely looking for their own content, it's a good idea to have 1 shard per user. When indexing content, routing will be necessary to send the data to the right shard (look at the docs for more info). Remember that each shard has a cost, so in the case of mahara, with lots of user, but very few pieces of content per user, this is not a good solution.
- date based:
- For example logs will most likely be searched by date, and if would make sense to organize shards that way.
- user based:
- Mahara: Given that the full text search is designed to search everything, there's no strategic way to organize the data. So 1 shard would be enough, although 1.5Gb might be close to the limit that doesn't exist.
- Replica
- is a copy of a shard, meant to work on a second node. If we consider our development environment, with only 1 instance running, we don't need a replica, so replica = 0.
- Cluster
- Elasticsearch is clustered. On our development environment, it doesn't really matter.
Diagram 1: 1 node, 1 shard, 0 replica
Production Environment
Let's set the number of shards to 2
Diagram 1: 1 node, 2 shards, 0 replica
If a new node is created on the same network, Elasticsearch will automatically reorganize the shards of the same cluster.
Diagram 1: 2 nodes, 2 shards, 0 replica
That why it's necessary after installation (and before indexing) to change the default cluster name, otherwise it might get a bit messy if 2 different apps are using elastic search with the same cluster name.
To change the name, just edit the YAML setting file (in /etc/elasticsearch/elasticsearch.yml) and set cluster.name to MaharaCluster for example. Restart needed.
That's where the replica becomes useful. If we set the replica to 1, each shard will be copied to the other instance, which means, we have the full index on both nodes or servers and if one fails, the search is still totally usable.
Diagram 1: 2 nodes, 2 shards, 1 replica
Proposed configuration: Elasticsearch is meant to be installed on a 2+2 server configuration (2 in Wellington + 2 in Auckland). In that case, we could set the number of shards to 4, and the number of replicas to 3, so each of the 4 servers would have a copy of the complete index.
Conclusion
This is meant to be a quick start guide, and I didn't want to go into much details (primary shards, etc) and it might not be 100% correct. There might be more configuration needed to improve performance on production.
At the moment, the number of shards are hard-coded in the elasticsearch/lib.php file (default values). It might be a good idea to add those parameters to the plug-in configuration page.
One useful feature is the ability the specify 2 index names, one for the search, one for the indexing process. It allows administrators to change the search engine parameters and re-index the content while the old index is still available for search.
One very cool app that can be used by developers and sysadmins to monitor the search engine is Bigdesk.