Cache coherence problems

12/14/2023

One is that the cached data will be inconsistent from server to server across its fleet, manifesting a cache coherence problem. We often implement an on-box cache as an in-memory hash table that is managed through application logic (for example, by explicitly placing results into the cache after the service calls are completed) or embedded in the service client (for example, by using a caching HTTP client).ĭespite the benefits and seductive simplicity of in-memory caches, they do come with several downsides. In contrast to external caches, they come with no additional operational overhead, so they are fairly low-risk to integrate into an existing service. On-box caches are often the first approach implemented and evaluated once the need for caching is identified. On-box caches, commonly implemented in process memory, are relatively quick and easy to implement and can provide significant improvements with minimal work. Service caches can be implemented either in memory or external to the service. For example, relatively static or slow-changing data can be cached for longer periods of time. The rate of change of the source data, as well as the cache policy for refreshing data, will determine how inconsistent the data tends to be. Cached data necessarily grows inconsistent with the source over time, so caching can only be successful if both the service and its clients compensate accordingly. A second consideration is how tolerant a team’s service and its clients are toĮventual consistency. If each request typically requires a unique query to the dependent service with unique-per-request results, then a cache would have a negligible hit rate and the cache does no good. That is, results of calls to the dependency can be used across multiple requests or operations. Data from this dependency is a good candidate for caching if such a cache would have a goodĬache hit ratio across requests. We’ve found it helpful to consider caching when we encounter uneven request patterns that lead to hot-key/hot-partition throttling. For example, this could be when we determine that a dependency might start throttling or otherwise be unable to keep up with the anticipated load. Many times this begins with an observation about a dependency's latency or efficiency at a given request rate. Several factors lead us to consider adding a cache to our system. The remainder of this article describes our lessons learned, best practices, and considerations for using caches. We have experienced both the benefits and challenges of caching in the course of building and operating services at Amazon. An unanticipated shift in the distribution of this modal behavior can potentially lead to disaster. At the heart of this issue is the modal behavior introduced by the cache, with differing behavior depending on whether a given object is cached. The cache has been inadvertently elevated from a helpful addition to the service to a necessary and critical part of its ability to operate. We’ve just described a service that has become addicted to its cache. This in turn could cause a surge of traffic to downstream services that can lead to outages both in our dependencies and in our service. There could be changes in traffic patterns, failure of the cache fleet, or other unexpected circumstances that could lead to a cold or otherwise unavailable cache. Just when everything appears to be going well, the service could be poised for disaster. Dependencies reduce their fleet sizes accordingly, and the database is scaled down. After a while, no one can remember life before the cache. We observe that request latency is down, costs are reduced, and small downstream availability drops are smoothed over. We add a cache and our service appears much improved. We also notice that many requests are using the same downstream resource or the same query results, so we think that caching this data could be the answer to our problems. The problem might be that calls to this other service are slow or that the database is expensive to scale out as call volume increases. In simple tests or at low request rates the service works great, but we notice a problem on the horizon. Perhaps these calls are to a relational database, or an AWS service like Amazon DynamoDB, or to another internal service. Over years of building services at Amazon we’ve experienced various versions of the following scenario: We build a new service, and this service needs to make some network calls to fulfill its requests.

0 Comments

Cache coherence problems

Leave a Reply.

Author

Archives

Categories