pax_global_header00006660000000000000000000000064151516154010014511gustar00rootroot0000000000000052 comment=12accdabe18b8a856edb1b8b1e47aa1109770fb1 pgnodemx-2.0.1/000077500000000000000000000000001515161540100133325ustar00rootroot00000000000000pgnodemx-2.0.1/.gitignore000066400000000000000000000000111515161540100153120ustar00rootroot00000000000000*.o *.so pgnodemx-2.0.1/LICENSE.md000066400000000000000000000250331515161540100147410ustar00rootroot00000000000000 Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS Copyright 2020-2022 Crunchy Data Solutions, Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. pgnodemx-2.0.1/Makefile000066400000000000000000000023421515161540100147730ustar00rootroot00000000000000ifdef USE_PGXS PG_CONFIG = pg_config datadir := $(shell $(PG_CONFIG) --sharedir) else top_builddir = ../.. include $(top_builddir)/src/Makefile.global endif REGRESS = pgnodemx_regress MODULE_big = pgnodemx OBJS = pgnodemx.o cgroup.o envutils.o fileutils.o genutils.o kdapi.o parseutils.o procfunc.o PG_CPPFLAGS = -I$(libpq_srcdir) PATH_TO_FILE = $(datadir)/extension/pg_proctab.control ifeq ($(shell test -e $(PATH_TO_FILE) && echo -n yes),yes) EXTENSION = pgnodemx pg_proctab--0.0.10-compat else EXTENSION = pgnodemx pg_proctab--0.0.10-compat pg_proctab endif DATA = pgnodemx--1.0--1.1.sql pgnodemx--1.1--1.2.sql pgnodemx--1.2--1.3.sql pgnodemx--1.3--1.4.sql pgnodemx--1.4--2.0.sql pgnodemx--1.7--2.0.sql pgnodemx--2.0.sql pg_proctab--0.0.10-compat.sql GHASH := $(shell git rev-parse --short HEAD) ifdef USE_PGXS PG_CONFIG = pg_config PGXS := $(shell $(PG_CONFIG) --pgxs) include $(PGXS) else subdir = contrib/pgnodemx top_builddir = ../.. include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk endif ifeq ($(strip $(VSTR)),) ifneq ($(strip $(GHASH)),) override CPPFLAGS += -DGIT_HASH=\"$(GHASH)\" else override CPPFLAGS += -DGIT_HASH=\"none\" endif else override CPPFLAGS += -DGIT_HASH=\"$(VSTR)\" endif pgnodemx-2.0.1/README.md000066400000000000000000000370551515161540100146230ustar00rootroot00000000000000# pgnodemx ## Overview SQL functions that allow management and capture of node OS metrics from PostgreSQL ## Security Executing role must have been granted pg_monitor membership. ## cgroup Related Functions For detailed information about the various virtual files available on the cgroup file system, see: * cgroup v1: https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v1/index.html * cgroup v2: https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html ### cgroup v2 Notes cgroup v2 is the default on RHEL/Rocky 9+, Debian 11 (Bullseye)+, and Ubuntu 22.04+. On older releases (RHEL/Rocky 8, Debian 10 (Buster), Ubuntu 20.04) it must be enabled by adding `systemd.unified_cgroup_hierarchy=1` to the kernel boot parameters. If no controllers are delegated to the PostgreSQL cgroup (which can occur when running in an unprivileged systemd user session), pgnodemx will log a warning and disable cgroup support rather than fail to load. See the [systemd cgroup delegation documentation](https://systemd.io/CGROUP_DELEGATION/) for details. Note that tests require a fully functional cgroup v2 environment with controllers delegated. ### General Access Functions cgroup virtual files fall into (at least) the following general categories, each with a generic SQL access function: * BIGINT single line scalar values - ```SELECT cgroup_scalar_bigint(filename);``` * cgroup v1 examples: blkio.leaf_weight, blkio.weight, cpuacct.usage, cpuacct.usage_percpu, cpuacct.usage_percpu_sys, cpuacct.usage_percpu_user, cpuacct.usage_sys, cpuacct.usage_user, cpu.cfs_period_us, cpu.cfs_quota_us, cpu.rt_period_us, cpu.rt_runtime_us, cpu.shares, cpuacct.usage, memory.failcnt, memory.kmem.failcnt, memory.kmem.limit_in_bytes, memory.kmem.max_usage_in_bytes, memory.kmem.tcp.failcnt, memory.kmem.tcp.limit_in_bytes, memory.kmem.tcp.max_usage_in_bytes, memory.kmem.tcp.usage_in_bytes, memory.kmem.usage_in_bytes, memory.limit_in_bytes, memory.max_usage_in_bytes, memory.memsw.failcnt, memory.memsw.limit_in_bytes, memory.memsw.max_usage_in_bytes, memory.memsw.usage_in_bytes, memory.move_charge_at_immigrate, memory.soft_limit_in_bytes, memory.usage_in_bytes, net_cls.classid, net_prio.prioidx * cgroup v2 examples: cgroup.freeze, cgroup.max.depth, cgroup.max.descendants, cpu.weight, cpu.weight.nice, memory.current, memory.high, memory.low, memory.max, memory.min, memory.oom.group, memory.swap.current, memory.swap.max, pids.current, pids.max * FLOAT8 single line scalar values - ```SELECT cgroup_scalar_float8(filename);``` * cgroup v1 examples: (none known) * cgroup v2 examples: cpu.uclamp.max, cpu.uclamp.min * TEXT single line scalar values - ```SELECT cgroup_scalar_text(filename);``` * cgroup v1 examples: (none known) * cgroup v2 examples: cgroup.type * SETOF(BIGINT) multiline scalar values - ```SELECT * FROM cgroup_setof_bigint(filename);``` * cgroup v1 examples: cgroup.procs * cgroup v2 examples: cgroup.procs, cgroup.threads * SETOF(TEXT) multiline scalar values - ```SELECT * FROM cgroup_setof_text(filename);``` * cgroup v1 examples: (none known) * cgroup v2 examples: (none known) * ARRAY[BIGINT] space separated values - ```SELECT cgroup_array_bigint(filename);``` * cgroup v1 examples: (none known) * cgroup v2 examples: cpu.max * ARRAY[TEXT] space separated values - ```SELECT cgroup_array_text(filename)``` * cgroup v1 examples: cpuacct.usage_all (sort of) * cgroup v2 examples: cgroup.controllers, cgroup.subtree_control * SETOF(TEXT, BIGINT) flat keyed - ```SELECT * FROM cgroup_setof_kv(filename);``` * cgroup v1 examples: cpuacct.stat, cpu.stat, cpuacct.stat, memory.oom_control, memory.stat, net_prio.ifpriomap, blkio.io_merged, blkio.io_merged_recursive, blkio.io_queued, blkio.io_queued_recursive, blkio.io_service_bytes, blkio.io_service_bytes_recursive, blkio.io_serviced, blkio.io_serviced_recursive, blkio.io_service_time, blkio.io_service_time_recursive, blkio.io_wait_time, blkio.io_wait_time_recursive * cgroup v2 examples: cgroup.events, cgroup.stat, cpu.stat, io.pressure, io.weight, memory.events, memory.events.local, memory.stat, memory.swap.events, pids.events * SETOF(TEXT, TEXT, BIGINT) key/subkey/value space separated - ```SELECT * FROM cgroup_setof_ksv(filename);``` * cgroup v1 examples: blkio.throttle.io_service_bytes, blkio.throttle.io_serviced * cgroup v2 examples: (none known) * SETOF(TEXT, TEXT, FLOAT8) nested keyed - ```SELECT * FROM cgroup_setof_nkv(filename);``` * cgroup v1 examples: (none known) * cgroup v2 examples: memory.pressure, cpu.pressure, io.max, io.stat In each case, the filename must be in the form ```.```, e.g. ```memory.stat```. ### Get status of cgroup support ``` SELECT current_setting('pgnodemx.cgroup_enabled'); ``` * Returns boolean result ("on"/"off"). * This value may be explicitly set in postgresql.conf * However the extension will disable it at runtime if the location pointed to by pgnodemx.cgrouproot does not exist or is not a valid cgroup v1 or v2 mount. ### Get current cgroup mode ``` SELECT cgroup_mode(); ``` * Returns the current cgroup mode. Possible values are "legacy", "unified", "hybrid", and "disabled". These correspond to cgroup v1, cgroup v2, mixed, and disabled, respectively. * Currently "hybrid" mode is not supported; it might be in the future. ### Determine if Running Containerized ``` SELECT current_setting('pgnodemx.containerized'); ``` * Returns boolean result ("on"/"off"). The extension attempts to heuristically determine whether PostgreSQL is running under a container, but this value may be explicitly set in postgresql.conf to override the heuristically determined value. The value of this setting influences the cgroup paths which are used to read the cgroup controller files. ### Get cgroup Paths ``` SELECT controller, path FROM cgroup_path(); ``` * Returns the path to each supported cgroup controller. ### Get cgroup process count ``` SELECT cgroup_process_count(); ``` * Returns the number of processes assigned to the cgroup * For cgroup v1, based on the "memory" controller cgroup.procs file. For cgroup v2, based on the unified cgroup.procs file. ## Environment Variable Related Functions ### Get Environment Variable as TEXT ``` SELECT envvar_text('PGDATA'); ``` * Returns the value of requested environment variable as TEXT ### Get Environment Variable as BIGINT ``` SELECT envvar_bigint('PGPORT'); ``` * Returns the value of requested environment variable as BIGINT ## ```/proc``` Related Functions For more detailed information about the /proc file system virtual files, please see: https://www.kernel.org/doc/html/latest/filesystems/proc.html ### Get "/proc/diskstats" as a virtual table ``` SELECT * FROM proc_diskstats(); ``` ### Get "/proc/self/mountinfo" as a virtual table ``` SELECT * FROM proc_mountinfo(); ``` ### Get "/proc/meminfo" as a virtual table ``` SELECT * FROM proc_meminfo(); ``` ### Get "/proc/self/net/dev" as a virtual table ``` SELECT * FROM proc_network_stats(); ``` ### Get "/proc/\/io" for all PostgreSQL processes as a virtual table ``` SELECT * FROM proc_pid_io(); ``` ### Get the full command line, uid, and username for all PostgreSQL processes as a virtual table ``` SELECT * FROM proc_pid_cmdline(); ``` ### Get "/proc/\/stat" for all PostgreSQL processes as a virtual table ``` SELECT * FROM proc_pid_stat(); ``` ### Get first line of "/proc/stat" as a virtual table ``` SELECT * FROM proc_cputime(); ``` ### Get first line of "/proc/loadavg" as a virtual table ``` SELECT * FROM proc_loadavg(); ``` ## pg_proctab Compatibility Functions for use with pg_top Five functions are provided in an extension that match the SQL interface presented by the pg_proctab extension. ``` CREATE EXTENSION pg_proctab VERSION "0.0.10-compat"; SELECT * FROM pg_cputime(); SELECT * FROM pg_loadavg(); SELECT * FROM pg_memusage(); SELECT * FROM pg_diskusage(); SELECT * FROM pg_proctab(); ``` These functions are not installed by default. They may be installed by installing pg_proctab VERSION "0.0.10-compat" after installing the pgnodemx extension. ## System Information Related Functions ### Get file system information as a virtual table ``` SELECT * FROM fsinfo(path text); ``` * Returns major_number, minor_number, type, block_size, blocks, total_bytes, free_blocks, free_bytes, available_blocks, available_bytes, total_file_nodes, free_file_nodes, and mount_flags for the file system on which ```path``` is mounted. Note: Some filesystems can return unexpected values (like MAX_UINT64) if these numbers cannot be determined. ### Get current FIPS mode ``` SELECT fips_mode(); ``` * Returns TRUE if openssl is currently running in FIPS mode, otherwise FALSE. ### Get openssl version string ``` select openssl_version(); ``` ### Get source C library path for a function symbol ``` SELECT symbol_filename(sym_name text); ``` * Returns the source C library from whence the C function sym_name comes. Returns NULL on any errors. ### Convert number of kernel memory pages to bytes ``` SELECT kpages_to_bytes(num_k_pages numeric); ``` ## Kubernetes DownwardAPI Related Functions For more detailed information about the Kubernetes DownwardAPI please see: https://kubernetes.io/docs/tasks/inject-data-application/downward-api-volume-expose-pod-information/ ### Get status of kdapi_enabled ``` SELECT current_setting('pgnodemx.kdapi_enabled'); ``` * Returns boolean result ("on"/"off"). * This value may be explicitly set in postgresql.conf * However the extension will disable it at runtime if the location pointed to by pgnodemx.kdapi_path does not exist. ### Access "key equals quoted value" files ``` SELECT * FROM kdapi_setof_kv('filename'); ``` ### Get scalar BIGINT from file ``` SELECT kdapi_scalar_bigint('filename text'); ``` ## General Information Functions ### Get pgnodemx version information ``` SELECT pgnodemx_version(); ``` * If VSTR environment variable is set at compile time, returns that value * Otherwise returns the value of the short git hash * If not compiling from the git repository and VSTR is unset, returns "none" ### Get currently running PostgreSQL executable path ``` SELECT exec_path(); ``` ### Get uid, username, gid, groupname, and filemode for a file ``` SELECT * FROM stat_file(filename); ``` ## Configuration * Add pgnodemx to shared_preload_libraries in postgresql.conf. ``` shared_preload_libraries = 'pgnodemx' ``` * The following custom parameters may be set. The values shown are defaults. If the default values work for you, there is no need to add these to ```postgresql.conf```. ``` # enable or disable the cgroup facility pgnodemx.cgroup_enabled = on # force use of "containerized" assumptions for cgroup file paths pgnodemx.containerized = off # specify location of cgroup mount pgnodemx.cgrouproot = '/sys/fs/cgroup' # enable cgroup functions pgnodemx.cgroup_enabled = on # enable or disable the Kubernetes DownwardAPI facility pgnodemx.kdapi_enabled = on # specify location of Kubernetes DownwardAPI files pgnodemx.kdapi_path = '/etc/podinfo' ``` Notes: * If pgnodemx.cgroup_enabled is defined in ```postgresql.conf```, and set to ```off``` (or ```false```), then all cgroup* functions will return NULL, or zero rows, except cgroup_mode() which will return "disabled". * If ```pgnodemx.containerized``` is defined in ```postgresql.conf```, that value will override pgnodemx heuristics. When not specified, pgnodemx heuristics will determine if the value should be ```on``` or ```off``` at runtime. * If the location specified by ```pgnodemx.cgrouproot```, default or as set in ```postgresql.conf```, is not accessible (does not exist, or otherwise causes an error when accessed), then pgnodemx.cgroup_enabled is forced to ```off``` at runtime and all cgroup* functions will return NULL, or zero rows, except cgroup_mode() which will return "disabled". * If the location specified by ```pgnodemx.kdapi_path```, default or as set in ```postgresql.conf```, is not accessible (does not exist, or otherwise causes an error when accessed), then pgnodemx.kdapi_enabled is forced to ```off``` at runtime and all kdapi* functions will return NULL, or zero rows. ## Installation ### Compatibility * PostgreSQL version 10 or newer is required. ### Compile and Install Clone PostgreSQL repository: ```bash $> git clone https://github.com/postgres/postgres.git ``` Checkout REL_12_STABLE (for example) branch: ```bash $> git checkout REL_12_STABLE ``` Make PostgreSQL: ```bash $> ./configure $> make install -s ``` Change to the contrib directory: ```bash $> cd contrib ``` Clone ```pgnodemx``` extension: ```bash $> git clone https://github.com/crunchydata/pgnodemx ``` Change to ```pgnodemx``` directory: ```bash $> cd pgnodemx ``` Build ```pgnodemx```: ```bash $> make ``` Install ```pgnodemx```: ```bash $> make install ``` #### Using PGXS If an instance of PostgreSQL is already installed, then PGXS can be utilized to build and install ```pgnodemx```. Ensure that PostgreSQL binaries are available via the ```$PATH``` environment variable then use the following commands. ```bash $> make USE_PGXS=1 $> make USE_PGXS=1 install ``` ### Configure The following bash commands should configure your system to utilize pgnodemx. Replace all paths as appropriate. It may be prudent to visually inspect the files afterward to ensure the changes took place. ###### Initialize PostgreSQL (if needed): ```bash $> initdb -D /path/to/data/directory ``` ###### Create Target Database (if needed): ```bash $> createdb ``` ###### Install ```pgnodemx``` functions: Edit postgresql.conf and add ```pgnodemx``` to the shared_preload_libraries line, and change custom settings as mentioned above. Finally, restart PostgreSQL (method may vary): ``` $> service postgresql restart ``` Install the extension into your database: ```bash psql CREATE EXTENSION pgnodemx; ``` ## TODO * Map more ```/proc``` files to virtual tables * Add support for "hybrid" cgroup mode ## cgroup v2 Subtree Management (Beta) **Beta feature** — API and configuration may change in future releases. Moves each backend into a dedicated cgroup v2 subtree at authentication time, enabling per-database or per-role resource isolation. Requires cgroup v2 and `pgnodemx.subtree_enabled = on`. ``` SELECT set_subtree(subtree_name text); ``` * Moves the calling backend into the named subtree under the PostgreSQL cgroup. * Execute privilege is revoked from PUBLIC; grant explicitly as needed. * Per-database and per-role subtrees are assigned automatically via `pgnodemx.databases_with_subtrees` / `pgnodemx.roles_with_subtrees` and companion GUCs `database__session.subtree` / `role__session.subtree`. Role settings override database settings. * The `default/` subtree is created at startup; named subtrees on first use. * cgroup v2 only. If initialization fails at startup, pgnodemx warns and disables subtree support rather than preventing the server from starting. ### Configuration ``` pgnodemx.subtree_enabled = off # requires restart pgnodemx.default_subtree = 'default' # requires restart pgnodemx.delegated_controllers = '+cpu +cpuset +memory +io' # requires restart pgnodemx.default_subtree_views_parent_path = on # per-session pgnodemx.databases_with_subtrees = '' # reloadable pgnodemx.roles_with_subtrees = '' # reloadable ``` ### Running subtree regression tests `make installcheck REGRESS=pgnodemx_subtree_regress` requires PostgreSQL to be running inside a cgroupv2 cgroup owned by the `postgres` OS user, with controllers enabled in `cgroup.subtree_control` and test subtree directories pre-created. On systemd hosts use `Delegate=yes` in the service unit. On GitHub Actions the script `ci/setup-subtree-cgroups-github.sh` handles setup automatically; it is not intended for general use. pgnodemx-2.0.1/cgroup.c000066400000000000000000001010351515161540100147750ustar00rootroot00000000000000/* * cgroup.c * * Functions specific to capture and manipulation of cgroup virtual files * * Joe Conway * * This code is released under the PostgreSQL license. * * Portions Copyright 2020-2022 Crunchy Data Solutions, Inc. * Portions Copyright 2025, PostgreSQL Global Development Group * * Permission to use, copy, modify, and distribute this software and its * documentation for any purpose, without fee, and without a written * agreement is hereby granted, provided that the above copyright notice * and this paragraph and the following two paragraphs appear in all copies. * * IN NO EVENT SHALL CRUNCHY DATA SOLUTIONS, INC. BE LIABLE TO ANY PARTY * FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, * INCLUDING LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS * DOCUMENTATION, EVEN IF THE CRUNCHY DATA SOLUTIONS, INC. HAS BEEN ADVISED * OF THE POSSIBILITY OF SUCH DAMAGE. * * THE CRUNCHY DATA SOLUTIONS, INC. SPECIFICALLY DISCLAIMS ANY WARRANTIES, * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY * AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS * ON AN "AS IS" BASIS, AND THE CRUNCHY DATA SOLUTIONS, INC. HAS NO * OBLIGATIONS TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR * MODIFICATIONS. */ #include "postgres.h" #include #include #ifndef CGROUP2_SUPER_MAGIC #define CGROUP2_SUPER_MAGIC 0x63677270 #endif #include #include #include #if PG_VERSION_NUM >= 110000 #include "catalog/pg_type_d.h" #else #include "catalog/pg_type.h" #endif #include "fmgr.h" #if PG_VERSION_NUM >= 130000 #include "lib/qunique.h" #else /* did not exist prior to pg13; use local copy */ #include "qunique.h" #endif #include "lib/stringinfo.h" #include "miscadmin.h" #include "utils/builtins.h" #include "utils/guc_tables.h" #if PG_VERSION_NUM < 150000 #include "utils/int8.h" #endif #include "utils/memutils.h" #include "utils/varlena.h" #include "fileutils.h" #include "genutils.h" #include "parseutils.h" #include "cgroup.h" #define DEFCONTROLLER "memory" /* static functions */ static void create_default_cgpath(char *str, int curlen); static void init_or_reset_cgpath(void); static StringInfo candidate_controller_path(char *controller, char *r); static StringInfo check_and_fix_controller_path(char *controller, char *r); static int set_subtree(char *subtree); static char*get_subtree_cf(char *subtree, char *cf); static int set_subtree_controls(char *subtree); static int write_cgroup_file(char *fullpath, char *value); /* exported vars */ bool containerized = false; char *cgrouproot = NULL; bool cgroup_enabled = true; bool subtree_enabled = false; char *default_subtree = NULL; bool default_views_parent = true; char *delegated_controllers = NULL; char *cgmode = NULL; kvpairs *cgpath = NULL; char *databases_subtree_string = NULL; char *roles_subtree_string = NULL; char *delegated_options[] = { "memory.high", "memory.low", "cpu.max", "io.max" }; char *delegated_options_values[NUM_DELEGATED_OPTIONS]; /* static vars */ static char *pg_cgroot = NULL; static bool subtree_inited = false; static bool current_subtree_is_new = false; /* * Take input filename from caller, make sure it is acceptable * (not absolute, no relative parent references, caller belongs * to correct role), and concatenates it with the path to the * related controller in the cgroup filesystem. The returned * value is a "fully qualified" path to the file of interest * for the purposes of cgroup virtual files. */ char * get_fq_cgroup_path(FunctionCallInfo fcinfo) { StringInfo ftr = makeStringInfo(); char *fname = convert_and_check_filename(PG_GETARG_TEXT_PP(0), false); char *p = strchr(fname, '.'); Size len; char *controller; if (!p) ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("pgnodemx: missing \".\" in filename %s", PROC_CGROUP_FILE))); if (is_cgroup_v2 && subtree_enabled && default_views_parent) { size_t deflen = strlen(default_subtree); size_t rawlen; char *rawstr; char *ptr; /* in cgroup v2 there should only be one entry */ rawstr = read_one_nlsv(PROC_CGROUP_FILE); rawlen = strlen(rawstr); ptr = rawstr + (rawlen) - deflen; /* determine if we are in the default subtree */ if (strcmp(ptr, default_subtree) == 0) { /* * If we are in the default subtree with default_views_parent true, * we must have a unified controller hierarchy. That in turn means * we can simply append fname to pg_cgroot and be done with it. */ appendStringInfo(ftr, "%s/%s", pg_cgroot, fname); } else { /* not using default subtree */ len = (p - fname); controller = pnstrdup(fname, len); appendStringInfo(ftr, "%s/%s", get_cgpath_value(controller), fname); } } else { /* Fastpath the case where we are not using subtrees */ len = (p - fname); controller = pnstrdup(fname, len); appendStringInfo(ftr, "%s/%s", get_cgpath_value(controller), fname); } return pstrdup(ftr->data); } /* * Find out all the pids in a cgroup. * * In cgroup v2 (at least) cgroup.procs is not sorted or guaranteed unique. * Remedy that. *pids is set to point to a palloc'd array containing * distinct pids in sorted order. The length of the array is the * function result. Cribbed from aclmembers. */ int cgmembers(int64 **pids) { int64 *list; int i; StringInfo ftr = makeStringInfo(); int nlines; char **lines; appendStringInfo(ftr, "%s/%s", get_cgpath_value("cgroup"), "cgroup.procs"); lines = read_nlsv(ftr->data, &nlines); if (nlines == 0) { /* * This should never happen, by definition. If it does * die horribly... */ ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("pgnodemx: no cgroup procs found in file %s", ftr->data))); } /* Allocate the worst-case space requirement */ list = (int64 *) palloc(nlines * sizeof(int64)); /* * Walk the string array collecting PIDs. */ for (i = 0; i < nlines; i++) { bool success = false; int64 result; success = scanint8(lines[i], true, &result); if (!success) ereport(ERROR, (errcode_for_file_access(), errmsg("contents not an integer, file \"%s\"", ftr->data))); list[i] = result; } /* Sort the array */ qsort(list, nlines, sizeof(int64), int64_cmp); /* * We could repalloc the array down to minimum size, but it's hardly worth * it since it's only transient memory. */ *pids = list; /* Remove duplicates from the array, returns new size */ return qunique(list, nlines, sizeof(int64), int64_cmp); } /* * Determine whether running inside a container. * * Of particular interest to us is whether our cgroup vfs has been mounted * at /sys/fs/cgroup for us. Inside a container that is what we expect, * but outside of a container it will be where PROC_CGROUP_FILE tells * us to find it. */ void set_containerized(void) { /* * If containerized was explicitly set in postgresql.conf, allow that * value to preside. */ struct config_generic *record; record = FIND_OPTION("pgnodemx.containerized"); if (record->source == PGC_S_FILE) return; /* * Check to see if path referenced in PROC_CGROUP_FILE exists. * If it does, we are presumably not in a container, else we are. * In either case, the important distinction is whether we will * find the controller files in that location. If the location * does not exist the files are found under cgrouproot directly. */ if (is_cgroup_v1 || is_cgroup_v2) { StringInfo str = makeStringInfo(); /* cgroup v1 and v2 will have differences we need to account for */ if (is_cgroup_v1) { int nlines; char **lines = read_nlsv(PROC_CGROUP_FILE, &nlines); if (nlines > 0) { int i; for (i = 0; i < nlines; ++i) { /* use the DEFCONTROLLER controller path to test with */ char *line = lines[i]; char *p = strchr(line, ':'); /* advance past the colon */ if (p) p += 1; if (strncmp(p, DEFCONTROLLER, 6) == 0) { p = strchr(p, ':'); /* advance past the colon and "/" */ if (p) p += 2; appendStringInfo(str, "%s/%s/%s", cgrouproot, DEFCONTROLLER, p); break; } } } else ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("pgnodemx: no cgroup paths found in file %s", PROC_CGROUP_FILE))); if (access(str->data, F_OK) != -1) containerized = false; else containerized = true; } else if (is_cgroup_v2) { char *rawstr; /* in cgroup v2 there should only be one entry */ rawstr = read_one_nlsv(PROC_CGROUP_FILE); appendStringInfo(str, "%s/%s", cgrouproot, (rawstr + 4)); } if (access(str->data, F_OK) != -1) containerized = false; else containerized = true; return; } else { /* hybrid mode; means not in a container */ containerized = false; return; } } /* * Determine whether running with cgroup v1, v2, or systemd hybrid mode */ bool set_cgmode(void) { /* * From: https://systemd.io/CGROUP_DELEGATION/ * * To detect which of three modes is currently used, use statfs() * on /sys/fs/cgroup/. If it reports CGROUP2_SUPER_MAGIC in its * .f_type field, then you are in unified mode. If it reports * TMPFS_MAGIC then you are either in legacy or hybrid mode. To * distinguish these two cases, run statfs() again on * /sys/fs/cgroup/unified/. If that succeeds and reports * CGROUP2_SUPER_MAGIC you are in hybrid mode, otherwise not. */ struct statfs buf; int ret; /* * If requested, directly set cgmode to disabled before * doing anything else. */ if (!cgroup_enabled) { cgmode = MemoryContextStrdup(TopMemoryContext, CGROUP_DISABLED); return false; } ret = statfs(cgrouproot, &buf); if (ret == -1) { /* * If we have an error trying to stat cgrouproot, there is not * much else we can do besides disabling cgroup access. */ ereport(WARNING, (errcode_for_file_access(), errmsg("pgnodemx: statfs error on cgroup mount %s: %m", cgrouproot), errdetail("disabling cgroup virtual file system access"))); cgmode = MemoryContextStrdup(TopMemoryContext, CGROUP_DISABLED); return false; } if (buf.f_type == CGROUP2_SUPER_MAGIC) /* cgroup v2 */ { char *ftr = PROC_CGROUP_FILE; int nlines; /* * From what I have read, this should not ever happen. * However it was reported from the field, so apparently * it *can* happen. * * In any case, it seems to indicate hybrid mode is in effect. */ read_nlsv(ftr, &nlines); if (nlines != 1) { cgmode = MemoryContextStrdup(TopMemoryContext, CGROUP_HYBRID); return false; } cgmode = MemoryContextStrdup(TopMemoryContext, CGROUP_V2); return true; } else if (buf.f_type == TMPFS_MAGIC) { StringInfo str = makeStringInfo(); appendStringInfo(str, "%s/%s", cgrouproot, "unified"); ret = statfs(str->data, &buf); if (ret == 0 && buf.f_type == CGROUP2_SUPER_MAGIC) /* hybrid mode */ { cgmode = MemoryContextStrdup(TopMemoryContext, CGROUP_HYBRID); return false; } else /* cgroup v1 */ { cgmode = MemoryContextStrdup(TopMemoryContext, CGROUP_V1); return true; } } else { /* * If cgrouproot is not actually a cgroup mount, there is not * much else we can do besides disabling cgroup access. */ ereport(WARNING, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("pgnodemx: unexpected mount type on cgroup root %s", cgrouproot), errdetail("disabling cgroup virtual file system access"))); cgmode = MemoryContextStrdup(TopMemoryContext, CGROUP_DISABLED); return false; } } /* * Expand cgpath by one element and populate with a default * path. str is the path to use for the default and curlen * is the pre-expanded number of kv pairs. */ static void create_default_cgpath(char *str, int curlen) { /* add room */ cgpath->nkvp = curlen + 1; cgpath->keys = (char **) repalloc(cgpath->keys, cgpath->nkvp * sizeof(char *)); cgpath->values = (char **) repalloc(cgpath->values, cgpath->nkvp * sizeof(char *)); /* create the default record */ cgpath->keys[cgpath->nkvp - 1] = MemoryContextStrdup(TopMemoryContext, "cgroup"); if (str != NULL) cgpath->values[cgpath->nkvp - 1] = MemoryContextStrdup(TopMemoryContext, str); else cgpath->values[cgpath->nkvp - 1] = MemoryContextStrdup(TopMemoryContext, "Default_Controller_Not_Found"); } static void init_or_reset_cgpath(void) { if (cgpath == NULL) { /* initialize in TopMemoryContext */ cgpath = (kvpairs *) MemoryContextAlloc(TopMemoryContext, sizeof(kvpairs)); cgpath->nkvp = 0; cgpath->keys = (char **) MemoryContextAlloc(TopMemoryContext, 0); cgpath->values = (char **) MemoryContextAlloc(TopMemoryContext, 0); } else { int i; /* deep clear any existing info */ for (i = 0; i < cgpath->nkvp; ++i) { if (cgpath->keys[i]) pfree(cgpath->keys[i]); if (cgpath->values[i]) pfree(cgpath->values[i]); } if (cgpath->keys) cgpath->keys = (char **) repalloc(cgpath->keys, 0); if (cgpath->values) cgpath->values = (char **) repalloc(cgpath->values, 0); cgpath->nkvp = 0; } } /* * Take an array of int, and copy it. */ static int * intarr_copy(int *oldarr, size_t oldsize) { int *newarr; int sizebytes = oldsize * sizeof(int); Assert(oldarr != NULL); Assert(sizebytes != 0); newarr = (int *) palloc(sizebytes); memcpy(newarr, oldarr, sizebytes); return newarr; } static void swap (int *arr, int a, int b) { int t = arr[a]; arr[a] = arr[b]; arr[b] = t; } /* * Generate permutations of origarr indexes, and return them as an array * of integer array. Each array represents the indexes for a different * permutation. Use's "heap's algorithm". * See https://en.wikipedia.org/wiki/Heap%27s_algorithm */ static void heap_permute(int *origarr, size_t origarrsize, size_t level, int **arrofpermarr, int *nrow) { int i; if (level == 1) { /* * We have recursed to the end of the original array of indexes, * so attach our permutation to the array of arrays and return it. */ arrofpermarr[*nrow] = intarr_copy(origarr, origarrsize); ++(*nrow); } else { /* * Generate permutations with levelth unaltered * Initially level == length(origarr) == origarrsize */ heap_permute(origarr, origarrsize, level - 1, arrofpermarr, nrow); for (i = 0; i < level - 1; ++i) { /* Swap choice dependent on parity of level (even or odd) */ if (level % 2 == 0) { /* * If level is even, swap ith and * (level-1)th i.e (last) element */ swap(origarr, i, level-1); } else { /* * If level is odd, swap 0th i.e (first) and (level-1)th * i.e (last) element */ swap(origarr, 0, level-1); } heap_permute(origarr, origarrsize, level - 1, arrofpermarr, nrow); } } } /* * Accept a string list (comma delimited list of items) * and return an array of strings representing all of the * different permutation of the original string list. */ #define MAX_PERM_ARRLEN 10 static char *** get_list_permutations(char *controller, int ncol, int *nrow) { char *rawstring = pstrdup(controller); List *origlist = NIL; ListCell *l; int *origarr = NULL; char **origarr_str = NULL; size_t origarrsize = 0; int **arrofpermarr = NULL; int i; char ***values; int cntr; int fact = 1; StringInfo str = makeStringInfo(); /* * If the controller name includes one or more ",", we need * to check all orderings to see which is the actual path. * * Parse the list into individual tokens */ if (!SplitIdentifierString(rawstring, ',', &origlist)) { elog(WARNING, "failed to parse controller string: %s", controller); return NULL; } origarrsize = list_length(origlist); if (origarrsize > MAX_PERM_ARRLEN) { elog(WARNING, "too many elements in controller string: %s", controller); return NULL; } origarr_str = (char **) palloc(origarrsize * sizeof(char *)); i = 0; foreach(l, origlist) { origarr_str[i] = pstrdup((char *) lfirst(l)); ++i; } origarr = (int *) palloc(origarrsize * sizeof(int)); for (i = 0; i < origarrsize; ++i) origarr[i] = i; /* precalculate how many permutations we should get back */ for (cntr = 1; cntr <= origarrsize; cntr++) fact = fact * cntr; /* make space for the permutation arrays */ arrofpermarr = (int **) palloc(fact * sizeof(int *)); /* get list of permutation indexes */ heap_permute(origarr, origarrsize, origarrsize, arrofpermarr, nrow); if (*nrow != fact) elog(WARNING, "expected %d permutations, got %d", fact, *nrow); /* make space for the return tuples */ values = (char ***) palloc((*nrow) * sizeof(char **)); /* map the original list back to the permuted indexes */ for (i = 0; i < (*nrow); ++i) { int *pidx = arrofpermarr[i]; int j; resetStringInfo(str); for(j = 0; j < origarrsize; ++j) { char *tok = origarr_str[pidx[j]]; if (j == 0) appendStringInfo(str, "%s", tok); else appendStringInfo(str, ",%s", tok); } values[i] = (char **) palloc(ncol * sizeof(char *)); values[i][0] = pstrdup(str->data); pfree(arrofpermarr[i]); } pfree(arrofpermarr); return values; } /* * Create candidate path based on controller string taking into account * whether we are "containerized" or not. */ static StringInfo candidate_controller_path(char *controller, char *r) { StringInfo str = makeStringInfo(); if (!containerized) { /* * not containerized: controller files are in path contained * in PROC_CGROUP_FILE concatenated to "//" */ appendStringInfo(str, "%s/%s/%s", cgrouproot, controller, r); } else { /* * containerized: controller files are in path contained * in "//" directly */ appendStringInfo(str, "%s/%s", cgrouproot, controller); } return str; } /* * Attempt to determine and return a valid path for a cgroup controller. * * If no directories are found, return "Controller_Not_Found" as the * path. If we were to raise an ERROR it would prevent Postgres from starting * since this extension is preloaded, which seems less friendly than causing * later queries to generate errors. For example: * * could not open file "Controller_Not_Found/cpuacct.usage" * for reading: No such file or directory * * At least would clue us in that something went wrong without causing an * outage of postgres itself. */ static StringInfo check_and_fix_controller_path(char *controller, char *r) { StringInfo str = candidate_controller_path(controller, r); if (strchr(controller, ',') == NULL) { /* * The controller name does not include "," and is therefore * a single controller. */ /* * Should not happen (I think), but if the directory does * not exist, mark it as such for debugging purposes. * But avoid throwing an error, which would prevent Postgres * from starting up entirely. */ if (access(str->data, F_OK) != 0) { resetStringInfo(str); appendStringInfoString(str, "Controller_Not_Found"); } return str; } else { /* * The controller name includes "," and is therefore a list * of controllers. It turns out that the list ordering in * /proc/self/cgroup might not match the list ordering used * for the cgroupfs in some circumstances. But first check the * proposed path based on /proc/self/cgroup to see if it * actually exists. If so, return that. */ if (access(str->data, F_OK) == 0) return str; else { /* if not, try the alternative orderings */ char ***values; int nrow = 0; int ncol = 1; int i; values = get_list_permutations(controller, ncol, &nrow); for (i = 0; i < nrow; ++i) { char *pcontroller = values[i][0]; resetStringInfo(str); str = candidate_controller_path(pcontroller, r); if (access(str->data, F_OK) == 0) return str; } /* none of the candidates were valid */ resetStringInfo(str); appendStringInfoString(str, "Controller_Not_Found"); return str; } } } /* CREATE FUNCTION permute_list(TEXT) RETURNS SETOF TEXT AS '$libdir/pgnodemx', 'pgnodemx_permute_list' LANGUAGE C STABLE STRICT; */ /* function return signatures */ Oid cg_text_sig[] = {TEXTOID}; /* debug function */ PG_FUNCTION_INFO_V1(pgnodemx_permute_list); Datum pgnodemx_permute_list(PG_FUNCTION_ARGS) { char *controller = text_to_cstring(PG_GETARG_TEXT_PP(0)); char ***values; int nrow = 0; int ncol = 1; values = get_list_permutations(controller, ncol, &nrow); return form_srf(fcinfo, values, nrow, ncol, cg_text_sig); } void set_cgpath(void) { char *ftr = PROC_CGROUP_FILE; init_or_reset_cgpath(); /* obtain a list of cgroup controllers */ if (is_cgroup_v1) { /* * In cgroup v1 the active controllers for the * cgroup are listed in PROC_CGROUP_FILE. We will * need to read these whether "containerized" or not, * in order to get a complete list of controllers * available. */ int nlines; char **lines; StringInfo str; int i; char *defpath = NULL; lines = read_nlsv(ftr, &nlines); if (nlines == 0) ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("pgnodemx: no cgroup paths found in file %s", ftr))); cgpath->nkvp = nlines; cgpath->keys = (char **) repalloc(cgpath->keys, cgpath->nkvp * sizeof(char *)); cgpath->values = (char **) repalloc(cgpath->values, cgpath->nkvp * sizeof(char *)); for (i = 0; i < nlines; ++i) { /* * The lines in PROC_CGROUP_FILE look like: * #::/ * e.g. 2:memory:/foo/bar * Sometimes the part is further divided * into key-value, e.g. "name=systemd" in which case * "systemd" actually corresponds to the directory name. * * Sometimes the part is further divided * into a list of controllers, e.g. "cpu,cpuacct" in which case * the directory name might be based on either ordering of * "cpu" and "cpuacct". In this case more work is required to * discover the actual path in use. */ char *line = lines[i]; char *p = strchr(line, ':'); char *r; char *q; Size len; char *controller; if (p) p += 1; else ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("pgnodemx: malformed cgroup path found in file %s", ftr))); r = strchr(p, ':'); /* advance past the ":" and also the "/" */ if (r) r += 2; else ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("pgnodemx: malformed cgroup path found in file %s", ftr))); len = ((r - p) - 2); controller = pnstrdup(p, len); q = strchr(controller, '='); if (q) controller = q + 1; /* get valid path to controller */ str = check_and_fix_controller_path(controller, r); cgpath->keys[i] = MemoryContextStrdup(TopMemoryContext, controller); cgpath->values[i] = MemoryContextStrdup(TopMemoryContext, str->data); if (strcasecmp(controller, DEFCONTROLLER) == 0) defpath = cgpath->values[i]; } create_default_cgpath(defpath, nlines); } else if (is_cgroup_v2) { /* * In v2 the active controllers for the * cgroup are listed in cgroup.controllers */ StringInfo fname = makeStringInfo(); StringInfo str = makeStringInfo(); int nvals; int nlines; char **controllers; char *rawstr; char *defpath = NULL; int i; /* read PROC_CGROUP_FILE, which for v2 has one line */ rawstr = read_one_nlsv(ftr); if (!containerized) { /* * not containerized: controller files are in path contained * in PROC_CGROUP_FILE * * cgroup v2 PROC_CGROUP_FILE has one line * that always starts "0::/", so skip that * in order to get the relative path to the * unified set of cgroup controllers */ appendStringInfo(str, "%s/%s", cgrouproot, (rawstr + 4)); defpath = str->data; } else { /* containerized: controller files in cgrouproot directly */ defpath = cgrouproot; } /* * In cgroup v2 all the controllers are in the * same cgroup dir, but we need to determine which * controllers are present in the current cgroup. * It is simpler to just repeat the same path for * each controller in order to maintain consistency * with the cgroup v1 case. */ appendStringInfo(fname, "%s/%s", defpath, "cgroup.controllers"); read_nlsv(fname->data, &nlines); if (nlines == 0) { ereport(WARNING, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("pgnodemx: no cgroup v2 controllers available in %s, disabling cgroup support", fname->data))); cgroup_enabled = false; return; } controllers = parse_space_sep_val_file(fname->data, &nvals); cgpath->nkvp = nvals; cgpath->keys = (char **) repalloc(cgpath->keys, cgpath->nkvp * sizeof(char *)); cgpath->values = (char **) repalloc(cgpath->values, cgpath->nkvp * sizeof(char *)); for (i = 0; i < cgpath->nkvp; ++i) { cgpath->keys[i] = MemoryContextStrdup(TopMemoryContext, controllers[i]); cgpath->values[i] = MemoryContextStrdup(TopMemoryContext, defpath); } create_default_cgpath(defpath, nvals); } else /* unsupported */ { ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("pgnodemx: unsupported cgroup configuration"))); } } /* * Look up the cgroup path by controller name * Since this should never be a long list, just * do brute force lookup. */ char * get_cgpath_value(char *key) { int i; for (i = 0; i < cgpath->nkvp; ++i) { char *p; char *controller = cgpath->keys[i]; char *path = cgpath->values[i]; /* * If controller name cgpath->keys[i] includes ",", * split into multiple subkeys and check each one. */ p = strchr(controller, ','); if (!p) { /* no subkeys, just do it */ if (strcmp(controller, key) == 0) return pstrdup(path); } else { /* * Multiple subkeys. Check each one, but first get a * copy we can mutate. */ char *buf = pstrdup(controller); char *token; char *lstate; for (token = strtok_r(buf, ",", &lstate); token; token = strtok_r(NULL, ",", &lstate)) { if (strcmp(token, key) == 0) return pstrdup(path); } } } /* bad request if not found */ ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("failed to find controller %s", key))); /* unreached */ return "unknown"; } #define NUM_BUSY_RETRIES 10000 /* * Set the cgroup v2 subtree to the requested subtree. * * If necessary create the default subtree and move all postgres pids * there first. * * Also create the target subtree if it does not exist. * * Return false on error and let the caller decide what to do * rather than throwing an ERROR (or FATAL) here. */ bool set_cgroup_subtree(char *subtree) { int ret; char *subtree_control; int i = 0; /* * fastpath exit if cgroup mode is not v2 or * not subtree enabled */ if (!is_cgroup_v2 || !subtree_enabled) return false; /* * Create requested subtree if it does not exist, * and place ourselves into it. */ ret = set_subtree(subtree); if (ret != 0) return false; /* * We need to write delegated controllers to cgroup.subtree_control * in the parent dir. * Note: it would be better to do this only once globally, * but there was a timing issue when trying to do it during * postmaster init, and it should not hurt anything to write * the same list of delegated controllers once for each session. * This depends on delegated controllers GUC remaining PGC_POSTMASTER. */ subtree_control = psprintf("%s/%s", pg_cgroot, "cgroup.subtree_control"); /* need to retry for EBUSY */ do { i++; ret = write_cgroup_file(subtree_control, delegated_controllers); if (ret == 0) break; if (ret != EBUSY || i > NUM_BUSY_RETRIES) return false; elog(WARNING, "Control file %s busy, retrying", subtree_control); usleep((useconds_t) 1000); } while (ret == EBUSY); /* * if this subtree was just created, we need to set any * default parameters associated with the subtree * (e.g. cpu limit, memory high/low, etc.). The check for * just created happens in set_subtree_controls() */ ret = set_subtree_controls(subtree); if (ret != 0) return false; /* reset the cgpath now that our subtree is set */ set_cgpath(); return true; } /* * Create default subtree if it does not exist, * and move postmaster pid there. */ bool init_default_subtree(void) { char *ftr = PROC_CGROUP_FILE; char *rawstr; struct stat stat_struct; char *default_subtree_path; char *subtree_procs; char *value; int ret; /* only init once */ if (subtree_inited) return true; /* * Before doing anything, capture the original * root of the Postgres cgroup path */ /* read PROC_CGROUP_FILE, which for v2 has one line */ rawstr = read_one_nlsv(ftr); /* * cgroup v2 PROC_CGROUP_FILE has one line * that always starts "0::/", so skip that * in order to get the relative path to the * unified set of cgroup controllers */ pg_cgroot = MemoryContextStrdup(TopMemoryContext, psprintf("%s/%s", cgrouproot, (rawstr + 4))); default_subtree_path = psprintf("%s/%s", pg_cgroot, default_subtree); /* If the default subtree is not found, create it */ if (stat(default_subtree_path, &stat_struct) < 0) mkdir(default_subtree_path, 0700); /* Now move postmaster proc by appending to cgroup.procs in the subtree */ subtree_procs = psprintf("%s/%s", default_subtree_path, "cgroup.procs"); value = psprintf("%d", MyProcPid); ret = write_cgroup_file(subtree_procs, value); if (ret != 0) return false; subtree_inited = true; return true; } /* * Create default subtree if it does not exist, * and move all pids there. */ static int set_subtree(char *subtree) { struct stat stat_struct; char *subtree_path; char *subtree_procs; char *value; /* setup relevant paths */ subtree_path = psprintf("%s/%s", pg_cgroot, subtree); subtree_procs = psprintf("%s/%s", subtree_path, "cgroup.procs"); /* if not found, create it */ if (stat(subtree_path, &stat_struct) < 0) { mkdir(subtree_path, 0700); current_subtree_is_new = true; } else current_subtree_is_new = false; value = psprintf("%d", MyProcPid); return write_cgroup_file(subtree_procs, value); } static char* get_subtree_cf(char *subtree, char *cf) { return psprintf("%s/%s/%s", pg_cgroot, subtree, cf); } /* * Set any requested parameters associated with the subtree * (e.g. cpu limit, memory high/low, etc.) */ static int set_subtree_controls(char *subtree) { int i; /* * if this subtree was just created, we need to set any * default parameters associated with the subtree * (e.g. cpu limit, memory high/low, etc.). If it is not new * then we should not mess with whatever had already been set. */ if (current_subtree_is_new) { for (i = 0; i < NUM_DELEGATED_OPTIONS; ++i) { char *value = delegated_options_values[i]; /* skip if not set */ if (value) { char *subtree_cf; int ret; subtree_cf = get_subtree_cf(subtree, delegated_options[i]); ret = write_cgroup_file(subtree_cf, value); if (ret != 0) return ret; } } } else ereport(DEBUG1, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("cannot set controls for pre-existing subtree %s", subtree))); return 0; } static int write_cgroup_file(char *fullpath, char *value) { FILE *fp = NULL; int ret = 0; fp = fopen(fullpath, "we"); if (!fp) { ret = errno; goto out; } ret = fprintf(fp, "%s", value); if (ret < 0) { ret = errno; goto out; } ret = fflush(fp); if (ret) { ret = errno; goto out; } fclose(fp); return 0; out: ereport(WARNING, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("failed to write \"%s\" to \"%s\" with errno %d", value, fullpath, ret))); if (fp) fclose(fp); return ret; } PG_FUNCTION_INFO_V1(pgnodemx_set_one_control); Datum pgnodemx_set_one_control(PG_FUNCTION_ARGS) { char *fqpath; char *cname = text_to_cstring(PG_GETARG_TEXT_PP(0)); char *cvalue = text_to_cstring(PG_GETARG_TEXT_PP(1)); int i; if (!cgroup_enabled) PG_RETURN_NULL(); if (!current_subtree_is_new) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("cannot set control for pre-existing subtree"))); /* check it is in allowed list */ for (i = 0; i < NUM_DELEGATED_OPTIONS; ++i) { if (strcmp(cname, delegated_options[i]) == 0) { int ret; fqpath = get_fq_cgroup_path(fcinfo); ret = write_cgroup_file(fqpath, cvalue); if (ret != 0) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("control set failed"))); PG_RETURN_TEXT_P(cstring_to_text("OK"));; } } ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("failed to find control %s in allow list", cname))); } pgnodemx-2.0.1/cgroup.h000066400000000000000000000051201515161540100150000ustar00rootroot00000000000000/* * cgroup.h * * Functions specific to capture and manipulation of cgroup virtual files * * Joe Conway * * This code is released under the PostgreSQL license. * * Portions Copyright 2020-2022 Crunchy Data Solutions, Inc. * Portions Copyright 2025, PostgreSQL Global Development Group * * Permission to use, copy, modify, and distribute this software and its * documentation for any purpose, without fee, and without a written * agreement is hereby granted, provided that the above copyright notice * and this paragraph and the following two paragraphs appear in all copies. * * IN NO EVENT SHALL CRUNCHY DATA SOLUTIONS, INC. BE LIABLE TO ANY PARTY * FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, * INCLUDING LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS * DOCUMENTATION, EVEN IF THE CRUNCHY DATA SOLUTIONS, INC. HAS BEEN ADVISED * OF THE POSSIBILITY OF SUCH DAMAGE. * * THE CRUNCHY DATA SOLUTIONS, INC. SPECIFICALLY DISCLAIMS ANY WARRANTIES, * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY * AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS * ON AN "AS IS" BASIS, AND THE CRUNCHY DATA SOLUTIONS, INC. HAS NO * OBLIGATIONS TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR * MODIFICATIONS. */ #ifndef CGROUP_H #define CGROUP_H #include "fmgr.h" #include "parseutils.h" #define PROC_CGROUP_FILE "/proc/self/cgroup" #define CGROUP_V1 "legacy" #define CGROUP_V2 "unified" #define CGROUP_HYBRID "hybrid" #define CGROUP_DISABLED "disabled" #define is_cgroup_v1 (strcmp(cgmode, CGROUP_V1) == 0) #define is_cgroup_v2 (strcmp(cgmode, CGROUP_V2) == 0) #define is_cgroup_hy (strcmp(cgmode, CGROUP_HYBRID) == 0) extern bool set_cgmode(void); extern void set_containerized(void); extern void set_cgpath(void); extern int cgmembers(int64 **pids); extern char *get_cgpath_value(char *key); extern char *get_fq_cgroup_path(FunctionCallInfo fcinfo); extern bool set_cgroup_subtree(char *subtree); extern bool init_default_subtree(void); /* exported globals */ extern char *cgmode; extern kvpairs *cgpath; extern char *cgrouproot; extern bool containerized; extern bool cgroup_enabled; extern bool subtree_enabled; extern char *default_subtree; extern bool default_views_parent; extern char *delegated_controllers; extern char *databases_subtree_string; extern char *roles_subtree_string; extern char *session_subtree; /* keep in sync with delegated_options[] */ #define NUM_DELEGATED_OPTIONS 4 extern char *delegated_options[]; extern char *delegated_options_values[]; #endif /* CGROUP_H */ pgnodemx-2.0.1/ci/000077500000000000000000000000001515161540100137255ustar00rootroot00000000000000pgnodemx-2.0.1/ci/setup-subtree-cgroups-github.sh000077500000000000000000000127501515161540100220400ustar00rootroot00000000000000#!/bin/bash # # ci/setup-subtree-cgroups-github.sh -- cgroup v2 subtree setup for GitHub Actions # # Usage: setup-subtree-cgroups-github.sh # # This script is SPECIFIC to the GitHub Actions environment (pgxn/pgxn-tools # container, --privileged Docker, cgroupns private). It is NOT a general # setup script. See "Running subtree regression tests" in README.md for the # general prerequisites that any environment must satisfy. # # Why this script exists # ---------------------- # GitHub Actions runs jobs in Docker containers. Docker's default cgroupns # mode ("private") makes the container see "/" as its cgroup root, but on the # host that root is actually a non-root cgroup deep in the host hierarchy. # The cgroupv2 no-internal-process constraint (NIPC) therefore applies: the # kernel refuses writes to cgroup.subtree_control on any cgroup that already # has processes directly inside it, even for root. # # Three steps are required to work around this before starting PostgreSQL: # # 1. Empty the container-root cgroup by moving all current processes into # an init/ sub-cgroup, so the root cgroup itself has no direct members. # # 2. Enable the desired controllers on the now-empty root cgroup by writing # to /sys/fs/cgroup/cgroup.subtree_control. # # 3. Create a postgres/ cgroup, delegate it to the postgres OS user, and # start PostgreSQL inside it via a one-shot subshell. The subshell # writes its own PID into postgres/cgroup.procs before exec-ing # pg_ctlcluster, so only the PostgreSQL postmaster (and its children) # end up in the postgres/ cgroup. init_default_subtree() then moves # the postmaster into postgres/default/, leaving postgres/ empty so # that subsequent auth-hook calls can write postgres/cgroup.subtree_control # without hitting NIPC. # # On a real machine or in a systemd-managed container the right approach is # entirely different: use systemd's cgroup delegation (Delegate=yes in the # PostgreSQL service unit, or systemd-run --scope) so that systemd hands a # proper sub-cgroup to the postmaster at startup. # # IMPORTANT: this script must be invoked with "exec bash