[improvement](fe) Improve external catalog meta cache observability#63809
Conversation
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: The catalog_meta_cache_statistics table exposed cumulative eviction count for external catalog meta cache entries, but did not provide a direct replacement frequency metric. Users had to derive replacement pressure manually when checking whether the configured meta cache capacity was too small. This change adds EVICTION_RATE, computed as eviction_count / request_count with zero returned when there are no requests, and exposes it beside EVICTION_COUNT in the information schema result.
### Release note
Add EVICTION_RATE to information_schema.catalog_meta_cache_statistics for observing catalog meta cache replacement frequency.
### Check List (For Author)
- Test: Unit Test
- ./run-fe-ut.sh --run org.apache.doris.datasource.metacache.MetaCacheEntryTest
- Behavior changed: Yes (catalog_meta_cache_statistics now includes EVICTION_RATE)
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: The default Hive partition metadata cache capacity was too small for common external catalog workloads, which could cause frequent evictions even without explicit capacity tuning. This change raises the single Hive partition metadata cache default from 10,000 to 100,000 and raises the Hive partitioned-table values cache default from 1,000 to 10,000. External fuzzy testing now includes the new default-size candidates. While checking similar cache entries, MaxCompute partition_values was found to cache table-level partition value structures but reused the single-partition Hive capacity; it now follows the table-level partition values capacity instead.
### Release note
Increase default Hive partition meta cache capacities to reduce frequent evictions. MaxCompute partition_values cache now uses the table-level partition values capacity setting.
### Check List (For Author)
- Test: Unit Test
- ./run-fe-ut.sh --run org.apache.doris.datasource.hive.HiveMetaStoreCacheTest,org.apache.doris.datasource.maxcompute.MaxComputeExternalMetaCacheTest
- Behavior changed: Yes (default Hive partition meta cache capacities are larger, and MaxCompute partition_values uses the table-level partition values capacity)
- Does this need documentation: No
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
TPC-H: Total hot run time: 31577 ms |
TPC-DS: Total hot run time: 172185 ms |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
FE UT Coverage ReportIncrement line coverage |
|
/review |
There was a problem hiding this comment.
Review opinion: no blocking issues found in the PR diff.
Critical checkpoint conclusions:
- Goal/test: The PR adds EVICTION_RATE, increases Hive partition metadata cache defaults, and adjusts MaxCompute partition_values capacity. The changed code and added unit tests cover these goals.
- Scope: The changes are small and focused on cache configuration and information_schema metadata exposure.
- Concurrency/lifecycle: No new concurrency or special lifecycle risk found; cache specs are still captured during catalog cache initialization as before.
- Configuration: Existing config defaults are changed, with fuzzy config updated accordingly. No new dynamic config behavior is introduced.
- Compatibility: The information_schema table gains a new trailing column; row generation was updated in the same order.
- Parallel paths: Hive and MaxCompute partition-values cache paths were checked; the MaxCompute path now uses the intended table-level capacity.
- Tests: I attempted to run ./run-fe-ut.sh --run org.apache.doris.datasource.metacache.MetaCacheEntryTest, but the runner lacks thirdparty/installed/bin/protoc, so generated-source setup failed before the test executed. I did not run the remaining FE unit tests for the same environment reason.
- Observability: The new EVICTION_RATE is exposed through catalog_meta_cache_statistics and computed safely as 0 when there are no requests.
User focus: No additional user-provided review focus was present.
FE Regression Coverage ReportIncrement line coverage |
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#63809 Problem Summary: The catalog_meta_cache_statistics system table added EVICTION_RATE in FE schema and result rows, but the BE information_schema schema scanner still used the old column list. Querying information_schema.catalog_meta_cache_statistics through schema scan failed during slot matching with no match column for EVICTION_RATE. This change adds EVICTION_RATE to the BE scanner column descriptors in the same order as the FE schema. ### Release note None ### Check List (For Author) - Test: Manual test - Ran git diff --check - Compared FE and BE catalog_meta_cache_statistics column order - build-support/check-format.sh could not run because local clang-format is version 22 while Doris requires version 16 - Behavior changed: No - Does this need documentation: No
|
run buildall |
|
/review |
There was a problem hiding this comment.
Review result: request changes.
Critical checkpoint conclusions:
- Goal/test coverage: The PR adds eviction-rate observability for external meta cache entries and adjusts partition-cache capacities. Unit tests cover the new stats value and default capacities, but the new information_schema column lacks mixed-version compatibility coverage.
- Scope/focus: The implementation is otherwise small and focused.
- Concurrency/lifecycle: No new mutable shared state or lifecycle-sensitive objects were added; stats are read from existing Caffeine snapshots and immutable snapshots.
- Configuration: Existing capacity configs are reused; no new config item was added.
- FE/BE protocol compatibility: Blocking issue found. The new CATALOG_META_CACHE_STATISTICS column is added to both FE and BE, but FE emits rows unconditionally while BE strictly checks the returned column count. Mixed old/new FE/BE during rolling upgrade can fail this schema-table query.
- Parallel paths: Hive and MaxCompute partition-value capacity paths were updated consistently.
- Tests: Relevant unit tests were added, but no compatibility test covers old/new FE/BE column-count behavior.
- Observability: The added EVICTION_RATE metric is useful once compatibility is handled.
- User focus: No additional user-provided review focus was specified.
| trow.addToColumnValue(new TCell().setDoubleVal( | ||
| entryStats.getAverageLoadPenaltyNanos() / TimeUnit.MILLISECONDS.toNanos(1))); | ||
| trow.addToColumnValue(new TCell().setLongVal(entryStats.getEvictionCount())); // EVICTION_COUNT | ||
| trow.addToColumnValue(new TCell().setDoubleVal(entryStats.getEvictionRate())); // EVICTION_RATE |
There was a problem hiding this comment.
Adding this value unconditionally makes CATALOG_META_CACHE_STATISTICS fail across mixed FE/BE versions. The BE scanner sends its _s_tbls_columns in columns_name and then rejects any FE response whose TRow.column_value.size() differs from its local column count. During rolling upgrade, an old BE querying a new FE will expect the old column count but receive this extra cell; a new BE querying an old FE will expect this cell but receive one fewer. Please make this schema-table path honor the requested column list (or otherwise tolerate missing/extra columns) before adding the new column, and cover the old/new count mismatch with a test.
There was a problem hiding this comment.
Thanks for the review. For this information_schema system table, we do not require FE/BE mixed-version schema compatibility. A version mismatch on this path will not cause a coredump: the BE scanner returns an InternalError when the returned column count does not match, and FE-side fetchSchemaTableData catches exceptions and returns an error result. The actual new-FE/new-BE failure was fixed by syncing the BE scanner column descriptor with the new EVICTION_RATE column.
FE UT Coverage ReportIncrement line coverage |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 29000 ms |
TPC-DS: Total hot run time: 170347 ms |
|
PR approved by at least one committer and no changes requested. |
For this information_schema system table, we do not require FE/BE mixed-version schema compatibility. A version mismatch on this path will not cause a coredump: the BE scanner returns an InternalError when the returned column count does not match, and FE-side fetchSchemaTableData catches exceptions and returns an error result. The actual new-FE/new-BE failure was fixed by syncing the BE scanner column descriptor with the new EVICTION_RATE column.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
External catalog meta cache statistics exposed cumulative eviction count, but did not provide a direct replacement frequency metric for judging whether cache capacity is too small. This PR adds
EVICTION_RATEtoinformation_schema.catalog_meta_cache_statistics, calculated aseviction_count / request_countand returned as0when there are no requests.Hive partition metadata cache defaults were also too small for common external catalog workloads, causing frequent evictions without explicit tuning. This PR increases the default Hive single-partition cache capacity from 10,000 to 100,000 and the Hive partitioned-table values cache capacity from 1,000 to 10,000. While checking similar cache entries, MaxCompute
partition_valueswas found to cache table-level partition value structures but use the Hive single-partition capacity; it now follows the table-level partition values capacity.Release note
Add
EVICTION_RATEtoinformation_schema.catalog_meta_cache_statistics, increase default Hive partition meta cache capacities, and make MaxComputepartition_valuesuse the table-level partition values capacity.Check List (For Author)
Test
./run-fe-ut.sh --run org.apache.doris.datasource.metacache.MetaCacheEntryTest./run-fe-ut.sh --run org.apache.doris.datasource.hive.HiveMetaStoreCacheTest,org.apache.doris.datasource.maxcompute.MaxComputeExternalMetaCacheTestBehavior changed:
catalog_meta_cache_statisticsincludesEVICTION_RATE; default Hive partition meta cache capacities are larger; MaxComputepartition_valuesuses the table-level partition values capacity.Does this need documentation?
Check List (For Reviewer who merge this PR)