Info Setronica | June 27th, 2024
Elasticsearch is a search engine that has become a go-to solution for storing, searching, and analyzing large volumes of data. At the heart of Elasticsearch lies its indexing mechanism, which plays a crucial role in determining the performance and effectiveness of the search engine.
This mechanism involves creating a structured representation of the data, known as an index. It includes breaking down the raw data into individual terms, analyzing their relevance, and storing them in a highly optimized data structure called an inverted index. By doing so, Elasticsearch can quickly locate and retrieve relevant documents based on user queries, providing near real-time search capabilities.
One of the key advantages of Elasticsearch indexing is its ability to handle various data formats, including structured, semi-structured, and unstructured data. This flexibility allows organizations to index and search through a wide range of data sources, such as logs, documents, databases, and even real-time data streams, enabling them to gain valuable insights and make data-driven decisions.
While Elasticsearch indexing offers numerous benefits, it also presents several challenges that can impact search engine performance if not addressed properly:
As Setronica uses Elasticsearch in its projects, the slow search performance becomes a serious issue. Given that search is one of the most heavily loaded services in apps, and it could take between 3 and 5 seconds or more for response, this situation becomes even more critical.
To uncover the root cause, let’s examine the search architecture of an app that connects users to goods and services nearby:
It was necessary to somehow reduce the redundancy of the data coming from the search service for processing the search query. So, we’ve come up with an idea how to make the responses from Elasticsearch more relevant.
To enhance efficiency, the client search query should incorporate geodata from the outset, allowing the Catalog service to filter out sellers based on distance directly within Elasticsearch.
Elasticsearch distance calculations are based on the great-circle distance, which can sometimes lead to discrepancies when comparing with actual street distances. The example below shows the difference between calculating the distance as the crow flies and the actual street distance.
Generally, this discrepancy is not critical because the straight-line distance is always less than or equal to the actual distance. For example, if the maximum delivery distance is set to 1 km and the straight-line distance from the client to the seller is 1.2 km, the actual distance will be greater, ensuring the seller does not appear in the search results.
However, this can create some redundancy. For instance, if the maximum distance is set to 1.5 km, the seller may pass the Elasticsearch query but will be filtered out later by the Catalog service. Nonetheless, this redundancy should not significantly impact the final results.
To enhance distance filtering, we need the coordinates of sales points within specific operation areas. As part of the recent implementation of the sorting functionality, the merchant index now includes coordinates for all sales points. These coordinates are used to organize search offers, and there is an array of coordinates for sales points across all operation areas.
Given this data, why not use it for distance filtering?
While it might seem logical to use these coordinates for distance filtering to ensure sales points close to the client are not excluded, the maximum distance is a dynamic parameter. It can vary between different areas and can be redefined by the seller for each specific area. This creates a one-to-many relationship between the seller and their delivery settings within a specific delivery area.
Fortunately, this relationship is already reflected in the merchant index structure. For each seller record in the index, there are nested records for each operation area where the seller has at least one active sales point.
Thus, we added two additional fields to the seller index structure:
The third challenge involved ensuring the delivery of up-to-date data on distance settings and sales point coordinates to the index. Updating the sales points was straightforward – simply by updating the Data Transfer Object (DTO) for indexing with the required fields. Once this was done, the data began to flow to the index through the existing indexing mechanism via a queue.
To update delivery distance settings, the mechanisms for modifying delivery parameters within the operation area and the seller’s personal settings were refined. When these settings were changed, events were triggered to reindex the merchants in the affected area, ensuring that the index remained current.
With all the necessary data now in the index, the next step was to apply the appropriate filtering function to achieve the desired results. However, the filtering needed to occur within a subquery, and there was no standard, ready-made solution for filtering by a calculated dynamic value. This meant, a custom script for the query was needed.
Moreover, it became evident that the Elasticsearch arcDistance function could not compute distances for fields that are arrays of coordinates. Therefore, we had to manually implement the minimum distance calculation using the Haversine formula and perform the filtering based on the maximum delivery distance within the same script.
Here is an example of a script that you can use as a base for your solutions. The resulting code will depend on your parameters and data.
public static ScriptQueryBuilder buildFilterMerchantWorktimeByMaxDistanceScript(@Nonnull GeoPointDto geoPointDto) {
String source = ” <<double dist=Double.MAX_VALUE;
for (int i=0; i<doc[‘venues.location’].values.length; i++) {
double lat1 = doc[‘venues.location’][i].lat;
double lon1 = doc[‘venues.location’][i].lon;
double lat2 = 77.096080;
double lon2 = 28.51818;
double TO_METERS = 6371008.7714D;
double TO_RADIANS = Math.PI / 180D;
double x1 = lat1 * TO_RADIANS;
double x2 = lat2 * TO_RADIANS;
double h1 = 1 – Math.cos(x1 – x2);
double h2 = 1 – Math.cos((lon1 – lon2) * TO_RADIANS);
double h = h1 + Math.cos(x1) * Math.cos(x2) * h2;
double cdist = TO_METERS * 2 * Math.asin(Math.min(1, Math.sqrt(h * 0.5)));
dist = Math.min(dist, cdist);
}
return dist;>>”;
Map<String, Object> params = new HashMap<>(2);
params.put(LAT_PARAM, geoPointDto.getLat());
params.put(LON_PARAM, geoPointDto.getLon());
Script script = new Script(Script.DEFAULT_SCRIPT_TYPE, PAINLESS_PROGRAMMING_LANG, source, params);
return QueryBuilders.scriptQuery(script);
}
We sequentially compute the distance from each sales point coordinate to the client. If the distance is less than the specified threshold, we consider the offer relevant. This additional condition is applied to all queries in the merchant index as a must among all other criteria:
This approach allows us to instantly filter out results that are irrelevant in terms of delivery distance for specific sellers.
The product index is separate and does not include coordinates, with data volume being significantly larger. Adding sales point coordinates to product records would be overly complex. Previously, unavailable products were filtered out by checking the sellers of the identified products through another service. However, now this process is streamlined within the Search service itself.
After retrieving products from the index, an additional lightweight query is executed against the merchant index using the product seller identifiers extracted from product index records, applying our custom filtering script. This ensures that products from unavailable sellers are immediately excluded, significantly reducing the data sent to the Catalog service.
One of Setronica’s long-term clients was experiencing slow search performance. To address this, Setronica implemented various features in the client’s app and conducted load testing on a dataset equivalent to the production environment:
The load testing results were so impressive that the feature was enabled in the production environment. After some time in production, an analysis of the search API response times revealed a significant improvement, with clients now receiving responses almost instantly:
The enhancements implemented by Setronica in the client’s app have significantly improved search performance for their long-term client. By refining the filtering mechanism and optimizing the indexing process, Setronica was able to reduce response times dramatically. The systematic approach – starting from load testing in a controlled environment to enabling features in production – demonstrated a clear reduction in database load and a substantial increase in query efficiency.
The results speak for themselves: the client now enjoys near-instantaneous search responses, enhancing user experience and operational efficiency. This case study showcases how thoughtful implementation of advanced search functionalities can solve performance issues and deliver tangible benefits.
Contact us today to discuss your project and see how we can help bring your vision to life. To learn about our team and expertise, visit our ‘About Us‘ webpage.