|
| 1 | +# Understanding the cluster_id Parameter |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +The `cluster_id` parameter is a critical configuration setting when using the AWS Advanced Python Wrapper to **connect to multiple database clusters within a single application**. This parameter serves as a unique identifier that enables the wrapper to maintain separate caches and state for each distinct database cluster your application connects to. |
| 6 | + |
| 7 | +## What is a Cluster? |
| 8 | + |
| 9 | +Understanding what constitutes a cluster is crucial for correctly setting the `cluster_id` parameter. In the context of the AWS Advanced Python Wrapper, a **cluster** is a logical grouping of database instances that should share the same topology cache and monitoring services. |
| 10 | + |
| 11 | +A cluster represents one writer instance (primary) and zero or more reader instances (replicas). These make up shared topology that the wrapper needs to track, and are the group of instances the wrapper can reconnect to when a failover is detected. |
| 12 | + |
| 13 | +### Examples of Clusters |
| 14 | + |
| 15 | +- Aurora DB Cluster (one writer + multiple readers) |
| 16 | +- RDS Multi-AZ DB Cluster (one writer + two readers) |
| 17 | +- Aurora Global Database (when supplying a global db endpoint, the wrapper considers them as a single cluster) |
| 18 | + |
| 19 | +> **Rule of thumb:** If the wrapper should track separate topology information and perform independent failover operations, use different `cluster_id` values. If instances share the same topology and failover domain, use the same `cluster_id`. |
| 20 | +
|
| 21 | +## Why cluster_id is Important |
| 22 | + |
| 23 | +The AWS Advanced Python Wrapper uses the `cluster_id` as a **key for internal caching mechanisms** to optimize performance and maintain cluster-specific state. Without proper `cluster_id` configuration, your application may experience: |
| 24 | + |
| 25 | +- Cache collisions between different clusters |
| 26 | +- Incorrect topology information |
| 27 | +- Degraded performance due to cache invalidation |
| 28 | + |
| 29 | +## Why Not Use AWS DB Cluster Identifiers? |
| 30 | + |
| 31 | +Host information can take many forms: |
| 32 | + |
| 33 | +- **IP Address Connections:** `10.0.1.50` ← No cluster info! |
| 34 | +- **Custom Domain Names:** `db.mycompany.com` ← Custom domain |
| 35 | +- **Custom Endpoints:** `my-custom-endpoint.cluster-custom-abc.us-east-1.rds.amazonaws.com` ← Custom endpoint |
| 36 | +- **Proxy Connections:** `my-proxy.proxy-abc.us-east-1.rds.amazonaws.com` ← Proxy, not actual cluster |
| 37 | + |
| 38 | +In fact, all of these could reference the exact same cluster. Therefore, because the wrapper cannot reliably parse cluster information from all connection types, **it is up to the user to explicitly provide the `cluster_id`**. |
| 39 | + |
| 40 | +## How cluster_id is Used Internally |
| 41 | + |
| 42 | +The wrapper uses `cluster_id` as a cache key for topology information and monitoring services. This enables multiple connections to the same cluster to share cached data and avoid redundant db meta-data. |
| 43 | + |
| 44 | +### Example: Single Cluster with Multiple Connections |
| 45 | + |
| 46 | +The following diagram shows how connections with the same `cluster_id` share cached resources: |
| 47 | + |
| 48 | + |
| 49 | + |
| 50 | +**Key Points:** |
| 51 | +- Three connections use different connection strings (custom endpoint, IP address, cluster endpoint) but all specify **`cluster_id="foo"`** |
| 52 | +- All three connections share the same Topology Cache and Monitor Threads in the wrapper |
| 53 | +- The Topology Cache stores a key-value mapping where `"foo"` maps to `["instance-1", "instance-2", "instance-3"]` |
| 54 | +- Despite different connection URLs, all connections monitor and query the same physical database cluster |
| 55 | + |
| 56 | +**The Impact:** |
| 57 | +Shared resources eliminate redundant topology queries and reduce monitoring overhead. |
| 58 | + |
| 59 | +### Example: Multiple Clusters with Separate Cache Isolation |
| 60 | + |
| 61 | +The following diagram shows how different `cluster_id` values maintain separate caches for different clusters. |
| 62 | + |
| 63 | + |
| 64 | + |
| 65 | +**Key Points:** |
| 66 | +- Connection 1 and 3 use **`cluster_id="foo"`** and share the same cache entries |
| 67 | +- Connection 2 uses **`cluster_id="bar"`** and has completely separate cache entries |
| 68 | +- Each `cluster_id` acts as a key in the cache dictionary |
| 69 | +- Topology Cache maintains separate entries: `"foo"` → `[instance-1, instance-2, instance-3]` and `"bar"` → `[instance-4, instance-5]` |
| 70 | +- Monitor Cache maintains separate monitor threads for each cluster |
| 71 | +- Monitors poll their respective database clusters and update the corresponding topology cache entries |
| 72 | + |
| 73 | +**The Impact:** |
| 74 | +This isolation prevents cache collisions and ensures correct failover behavior for each cluster. |
| 75 | + |
| 76 | +## When to Specify cluster_id |
| 77 | + |
| 78 | +### Required: Multiple Clusters in One Application |
| 79 | + |
| 80 | +You **must** specify a unique `cluster_id` for every DB cluster when your application connects to multiple database clusters: |
| 81 | + |
| 82 | +```python |
| 83 | +from aws_advanced_python_wrapper import AwsWrapperConnection |
| 84 | +from psycopg import Connection |
| 85 | + |
| 86 | +# Source cluster connection |
| 87 | +with AwsWrapperConnection.connect( |
| 88 | + Connection.connect, |
| 89 | + "host=source-db.us-east-1.rds.amazonaws.com dbname=mydb user=admin password=pwd", |
| 90 | + cluster_id="source-cluster", |
| 91 | + autocommit=True |
| 92 | +) as source_conn: |
| 93 | + source_cursor = source_conn.cursor() |
| 94 | + source_cursor.execute("SELECT * FROM users") |
| 95 | + rows = source_cursor.fetchall() |
| 96 | + |
| 97 | +# Destination cluster connection - different cluster_id! |
| 98 | +with AwsWrapperConnection.connect( |
| 99 | + Connection.connect, |
| 100 | + "host=dest-db.us-west-2.rds.amazonaws.com dbname=mydb user=admin password=pwd", |
| 101 | + cluster_id="destination-cluster", |
| 102 | + autocommit=True |
| 103 | +) as dest_conn: |
| 104 | + dest_cursor = dest_conn.cursor() |
| 105 | + # ... migration logic |
| 106 | + |
| 107 | +# Connecting to source-db via IP - same cluster_id as source_conn |
| 108 | +with AwsWrapperConnection.connect( |
| 109 | + Connection.connect, |
| 110 | + "host=10.0.0.1 dbname=mydb user=admin password=pwd", |
| 111 | + cluster_id="source-cluster", |
| 112 | + autocommit=True |
| 113 | +) as source_ip_conn: |
| 114 | + pass |
| 115 | +``` |
| 116 | + |
| 117 | +### Optional: Single Cluster Applications |
| 118 | + |
| 119 | +If your application only connects to one cluster, you can omit `cluster_id` (defaults to `"1"`): |
| 120 | + |
| 121 | +```python |
| 122 | +from aws_advanced_python_wrapper import AwsWrapperConnection |
| 123 | +from psycopg import Connection |
| 124 | + |
| 125 | +# Single cluster - cluster_id defaults to "1" |
| 126 | +with AwsWrapperConnection.connect( |
| 127 | + Connection.connect, |
| 128 | + "host=my-cluster.us-east-1.rds.amazonaws.com dbname=mydb user=admin password=pwd", |
| 129 | + autocommit=True |
| 130 | +) as conn: |
| 131 | + cursor = conn.cursor() |
| 132 | + cursor.execute("SELECT 1") |
| 133 | +``` |
| 134 | + |
| 135 | +This also includes if you have multiple connections using different host information: |
| 136 | + |
| 137 | +```python |
| 138 | +# cluster_id defaults to "1" |
| 139 | +with AwsWrapperConnection.connect( |
| 140 | + Connection.connect, |
| 141 | + "host=my-cluster.us-east-1.rds.amazonaws.com dbname=mydb user=admin password=pwd", |
| 142 | + autocommit=True |
| 143 | +) as url_conn: |
| 144 | + pass |
| 145 | + |
| 146 | +# "10.0.0.1" -> IP address of my-cluster. Same cluster, so default cluster_id "1" is fine. |
| 147 | +with AwsWrapperConnection.connect( |
| 148 | + Connection.connect, |
| 149 | + "host=10.0.0.1 dbname=mydb user=admin password=pwd", |
| 150 | + autocommit=True |
| 151 | +) as ip_conn: |
| 152 | + pass |
| 153 | +``` |
| 154 | + |
| 155 | +## Critical Warnings |
| 156 | + |
| 157 | +### NEVER Share cluster_id Between Different Clusters |
| 158 | + |
| 159 | +Using the same `cluster_id` for different database clusters will cause serious issues: |
| 160 | + |
| 161 | +```python |
| 162 | +# ❌ WRONG - Same cluster_id for different clusters |
| 163 | +source_conn = AwsWrapperConnection.connect( |
| 164 | + Connection.connect, |
| 165 | + "host=source-db.us-east-1.rds.amazonaws.com dbname=db user=admin password=pwd", |
| 166 | + cluster_id="shared-id" # ← BAD! |
| 167 | +) |
| 168 | + |
| 169 | +dest_conn = AwsWrapperConnection.connect( |
| 170 | + Connection.connect, |
| 171 | + "host=dest-db.us-west-2.rds.amazonaws.com dbname=db user=admin password=pwd", |
| 172 | + cluster_id="shared-id" # ← BAD! Same ID for different cluster |
| 173 | +) |
| 174 | +``` |
| 175 | + |
| 176 | +**Problems this causes:** |
| 177 | +- Topology cache collision (dest-db's topology could overwrite source-db's) |
| 178 | +- Incorrect failover behavior (wrapper may try to failover to wrong cluster) |
| 179 | +- Monitor conflicts (Only one monitor instance for both clusters will lead to undefined results) |
| 180 | + |
| 181 | +**Correct approach:** |
| 182 | +```python |
| 183 | +# ✅ CORRECT - Unique cluster_id for each cluster |
| 184 | +source_conn = AwsWrapperConnection.connect( |
| 185 | + Connection.connect, |
| 186 | + "host=source-db.us-east-1.rds.amazonaws.com dbname=db user=admin password=pwd", |
| 187 | + cluster_id="source-cluster" |
| 188 | +) |
| 189 | + |
| 190 | +dest_conn = AwsWrapperConnection.connect( |
| 191 | + Connection.connect, |
| 192 | + "host=dest-db.us-west-2.rds.amazonaws.com dbname=db user=admin password=pwd", |
| 193 | + cluster_id="destination-cluster" |
| 194 | +) |
| 195 | +``` |
| 196 | + |
| 197 | +### Always Use Same cluster_id for Same Cluster |
| 198 | + |
| 199 | +Using different `cluster_id` values for the same cluster reduces efficiency: |
| 200 | + |
| 201 | +```python |
| 202 | +# ⚠️ SUBOPTIMAL - Different cluster_ids for same cluster |
| 203 | +conn1 = AwsWrapperConnection.connect( |
| 204 | + Connection.connect, |
| 205 | + "host=my-cluster.us-east-1.rds.amazonaws.com dbname=db user=admin password=pwd", |
| 206 | + cluster_id="my-cluster-1" |
| 207 | +) |
| 208 | + |
| 209 | +conn2 = AwsWrapperConnection.connect( |
| 210 | + Connection.connect, |
| 211 | + "host=my-cluster.us-east-1.rds.amazonaws.com dbname=db user=admin password=pwd", |
| 212 | + cluster_id="my-cluster-2" # Different ID for same cluster |
| 213 | +) |
| 214 | +``` |
| 215 | + |
| 216 | +**Problems this causes:** |
| 217 | +- Duplication of caches |
| 218 | +- Multiple monitoring threads for the same cluster |
| 219 | + |
| 220 | +**Best practice:** |
| 221 | +```python |
| 222 | +# ✅ BEST - Same cluster_id for same cluster |
| 223 | +CLUSTER_ID = "my-cluster" |
| 224 | + |
| 225 | +conn1 = AwsWrapperConnection.connect( |
| 226 | + Connection.connect, |
| 227 | + "host=my-cluster.us-east-1.rds.amazonaws.com dbname=db user=admin password=pwd", |
| 228 | + cluster_id=CLUSTER_ID |
| 229 | +) |
| 230 | + |
| 231 | +conn2 = AwsWrapperConnection.connect( |
| 232 | + Connection.connect, |
| 233 | + "host=my-cluster.us-east-1.rds.amazonaws.com dbname=db user=admin password=pwd", |
| 234 | + cluster_id=CLUSTER_ID # Shared cache and resources |
| 235 | +) |
| 236 | +``` |
| 237 | + |
| 238 | +## Summary |
| 239 | + |
| 240 | +The `cluster_id` parameter is essential for applications connecting to multiple database clusters. It serves as a cache key for topology information and monitoring services. Always use unique `cluster_id` values for different clusters, and consistent values for the same cluster to maximize performance and avoid conflicts. |
0 commit comments