patroni.ha module

class patroni.ha.Failsafe(dcs: patroni.dcs.AbstractDCS)

Bases: object

is_active()bool

Is used to report in REST API whether the failsafe mode was activated.

On primary the self._last_update is set from the set_is_active() method and always returns the correct value.

On replicas the self._last_update is set at the moment when the primary performs POST /failsafe REST API calls. The side-effect - it is possible that replicas will show failsafe_is_active values different from the primary.

property leader
set_is_active(value: float)None
update(data: Dict[str, Any])None
update_cluster(cluster: patroni.dcs.Cluster)patroni.dcs.Cluster
class patroni.ha.Ha(patroni: patroni.__main__.Patroni)

Bases: object

acquire_lock()bool
bootstrap()str
bootstrap_standby_leader() → Optional[bool]

If we found ‘standby’ key in the configuration, we need to bootstrap not a real primary, but a ‘standby leader’, that will take base backup from a remote member and start follow it.

call_failsafe_member(data: Dict[str, Any], member: patroni.dcs.Member)bool
cancel_initialization()None
check_failsafe_topology()bool
check_timeline()bool
Returns

True if should check whether the timeline is latest during the leader race.

clone(clone_member: Optional[Union[patroni.dcs.Leader, patroni.dcs.Member]] = None, msg: str = '(without leader)') → Optional[bool]
delete_future_restart()bool
demote(mode: str) → Optional[bool]

Demote PostgreSQL running as primary.

Parameters

mode – One of offline, graceful, immediate or immediate-nolock. offline is used when connection to DCS is not available. graceful is used when failing over to another node due to user request. May only be called running async. immediate is used when we determine that we are not suitable for primary and want to failover quickly without regard for data durability. May only be called synchronously. immediate-nolock is used when find out that we have lost the lock to be primary. Need to bring down PostgreSQL as quickly as possible without regard for data durability. May only be called synchronously.

enforce_follow_remote_member(message: str)str
enforce_primary_role(message: str, promote_message: str)str

Ensure the node that has won the race for the leader key meets criteria for promoting its PG server to the ‘primary’ role.

evaluate_scheduled_restart() → Optional[str]
failsafe_is_active()bool
fetch_node_status(member: patroni.dcs.Member) → patroni.ha._MemberStatus

This function perform http get request on member.api_url and fetches its status :returns: _MemberStatus object

fetch_nodes_statuses(members: List[patroni.dcs.Member]) → List[patroni.ha._MemberStatus]
follow(demote_reason: str, follow_reason: str, refresh: bool = True)str
future_restart_scheduled() → Dict[str, Any]
get_effective_tags() → Dict[str, Any]

Return configuration tags merged with dynamically applied tags.

get_failover_candidates(check_sync: bool = False) → List[patroni.dcs.Member]

Return list of candidates for either manual or automatic failover.

Mainly used to later be passed to Ha.is_failover_possible().

Parameters

check_sync – if True, also check against the sync key members

Returns

a list of Member ojects or an empty list if there is no candidate available

get_remote_member(member: Optional[Union[patroni.dcs.Leader, patroni.dcs.Member]] = None)patroni.dcs.RemoteMember

Get remote member node to stream from.

In case of standby cluster this will tell us from which remote member to stream. Config can be both patroni config or cluster.config.data.

handle_long_action_in_progress()str

Figure out what to do with the task AsyncExecutor is performing.

handle_starting_instance() → Optional[str]

Starting up PostgreSQL may take a long time. In case we are the leader we may want to fail over to.

has_lock(info: bool = True)bool
is_failover_possible(members: List[patroni.dcs.Member], check_synchronous: Optional[bool] = True, cluster_lsn: Optional[int] = 0)bool

Checks whether one of the members from the list can possibly win the leader race.

Parameters
  • members – list of members to check

  • check_synchronous – consider only members that are known to be listed in /sync key when sync replication.

  • cluster_lsn – to calculate replication lag and exclude member if it is laggin

Returns

True if there are members eligible to be the new leader

is_failsafe_mode()bool
Returns

True if failsafe_mode is enabled in global configuration.

is_healthiest_node()bool

Performs a series of checks to determine that the current node is the best candidate.

In case if manual failover/switchover is requested it calls manual_failover_process_no_leader() method.

Returns

True if the current node is among the best candidates to become the new leader.

is_lagging(wal_position: int)bool

Returns if instance with an wal should consider itself unhealthy to be promoted due to replication lag.

Parameters

wal_position – Current wal position.

:returns True when node is lagging

is_leader()bool
is_paused()bool
Returns

True if in maintenance mode.

is_standby_cluster()bool
Returns

True if global configuration has a valid “standby_cluster” section.

is_sync_standby(cluster: patroni.dcs.Cluster)bool
Returns

True if the current node is a synchronous standby.

is_synchronous_mode()bool
Returns

True if synchronous replication is requested.

load_cluster_from_dcs()None
manual_failover_process_no_leader() → Optional[bool]

Handles manual failover/switchover when the old leader already stepped down.

Returns

  • True if the current node is the best candidate to become the new leader

  • None if the current node is running as a primary and requested candidate doesn’t exist

notify_citus_coordinator(event: str)None
post_bootstrap()str
post_recover() → Optional[str]
primary_stop_timeout() → Optional[int]
Returns

“primary_stop_timeout” from the global configuration or None when not in synchronous mode.

process_healthy_cluster()str
process_manual_failover_from_leader() → Optional[str]

Checks if manual failover is requested and takes action if appropriate.

Cleans up failover key if failover conditions are not matched.

Returns

action message if demote was initiated, None if no action was taken

process_sync_replication()None

Process synchronous standby beahvior.

Synchronous standbys are registered in two places postgresql.conf and DCS. The order of updating them must be right. The invariant that should be kept is that if a node is primary and sync_standby is set in DCS, then that node must have synchronous_standby set to that value. Or more simple, first set in postgresql.conf and then in DCS. When removing, first remove in DCS, then in postgresql.conf. This is so we only consider promoting standbys that were guaranteed to be replicating synchronously.

process_unhealthy_cluster()str

Cluster has no leader key

recover()str

Handle the case when postgres isn’t running.

Depending on the state of Patroni, DCS cluster view, and pg_controldata the following could happen:

  • if primary_start_timeout is 0 and this node owns the leader lock, the lock will be voluntarily released if there are healthy replicas to take it over.

  • if postgres was running as a primary and this node owns the leader lock, postgres is started as primary.

  • crash recover in a single-user mode is executed in the following cases:

    • postgres was running as primary wasn’t shut down cleanly and there is no leader in DCS

    • postgres was running as replica wasn’t shut down in recovery (cleanly) and we need to run pg_rewind to join back to the cluster.

  • pg_rewind is executed if it is necessary, or optinally, the data directory could

    be removed if it is allowed by configuration.

  • after crash recovery and/or pg_rewind are executed, postgres is started in recovery.

Returns

action message, describing what was performed.

reinitialize(force: bool = False) → Optional[str]
release_leader_key_voluntarily(last_lsn: Optional[int] = None)None
restart(restart_data: Dict[str, Any], run_async: bool = False) → Tuple[bool, str]

conditional and unconditional restart

restart_matches(role: Optional[str], postgres_version: Optional[str], pending_restart: bool)bool
restart_scheduled()bool
run_cycle()str
schedule_future_restart(restart_data: Dict[str, Any])bool
set_is_leader(value: bool)None
set_start_timeout(value: Optional[int])None

Sets timeout for starting as primary before eligible for failover.

Must be called when async_executor is busy or in the main thread.

should_run_scheduled_action(action_name: str, scheduled_at: Optional[datetime.datetime], cleanup_fn: Callable[[], Any])bool
shutdown()None
static sysid_valid(sysid: Optional[str])bool
touch_member()bool
update_cluster_history()None
update_failsafe(data: Dict[str, Any]) → Optional[str]
update_lock(write_leader_optime: bool = False)bool
wakeup()None

Trigger the next run of HA loop if there is no “active” leader watch request in progress.

This usually happens on the leader or if the node is running async action

watch(timeout: float)bool
while_not_sync_standby(func: Callable[[], Any]) → Any

Runs specified action while trying to make sure that the node is not assigned synchronous standby status.

Tags us as not allowed to be a sync standby as we are going to go away, if we currently are wait for leader to notice and pick an alternative one or if the leader changes or goes away we are also free.

If the connection to DCS fails we run the action anyway, as this is only a hint.

There is a small race window where this function runs between a primary picking us the sync standby and publishing it to the DCS. As the window is rather tiny consequences are holding up commits for one cycle period we don’t worry about it here.