chyli. : The Fuwa Fuwa-est of all time logo

chyli. : The Fuwa Fuwa-est of all time

We boutta Rick the Matrix mate!

/ EDIT :

Oi mate, I do hosting a small Matrix instance using Synapse for a small group of frens, and we just stump opon a big fahking Federation issues and its fahking annoying so i do made this blog as a reassemble of the fix we walking through as much as possible

So, this blog has 2 parts, for 2 different problem set as follows:


auth_error in significant event causing Synapse to reject any new Federated Messages

Before we getting into anything deeper, Let’s see how Matrix Protocol validate events first.

PDUs

Each room event is contained in a PDU(Persistent Data Unit), and EACH PDU CONTAINS EXACTLY ONE EVENT. And with prev_events that reference to previous events thus creating a “Directed Acyclic Graph”(DAG) Linking all events in the room1.

Each PDU is signed with originating server’s private key before getting wrapped inside a transaction sent to each homeserver2. And in order to validate the PDU that was sent to our Homeserver, it has to pass the following mechanism as defined in Matrix Specification v1.183 :

  • [v1.16++] Is a valid event? otherwise it is dropped. For event to be valid, it has to comply with Format of the Room version
  • Passes signature checks? dropped
  • Passes hash checks? redacted before being processed further
  • Passes authorization rules based on the event’s auth events? rejected
  • Passes authorization rules based on the state before the event? rejected
  • Passes authorization rules based on the current state of the room? “soft failed”
  • [v1.18++] Is validated by the Policy Server? if the room is using a Policy Server, otherwise it is “soft failed”

Authorization rules

Now that PDU has passed the validation, by protocol, it still requires an additional check, Authorization rules, these are to determine whether or not an event is authorized. For our scope, we will focus on the auth_events field of the PDU, that references a set of events that give sender permission to send the event. Except for m.room.create event, the auth_events typically include following room state4:

Rejection

if event is rejected5:

Despite being rejected, its block data and event should still kept6 in the database


Now that’s all we should know to be able to identify the clause, and fix of the problem that going to be describe next.

So…. It all start with a single m.room.power_levels event from one of our mod, somehow get rejected with auth_error by my Homeserver. And i started to missing majority of new messages in that group chat, except some homeserver that i still able to communicate with, despite that i’m still able to send messages/attachment to federation.

Investigation

WARNING - _process_incoming_pdus_in_room_inner-311-random_event - While checking auth of '<'FrozenEventV3 event_id=random_event, type=m.room.message, state_key=None, outlier=False> against auth_events: 403: During auth for event random_event: found rejected event power_levels_event in the state

These are samples from hundreds of error logs that occur like a chain. But all error events lead to the same power_levels_event, so we know what we should look into now

Before we get into the Synapse’s database. The table below represents events table schema in the database, which was taken from the SQLite 7 schema file. While there are 2 types of schema files available, SQLite and PostgreSQL, there shouldn’t be any difference in term of functionality. View full schema Here

Matrix Synapse events schema

As of Synapse version v1.151.0: SQLite

NAME TYPE CONSTRAINTS / DEFAULTS
stream_orderingINTEGERPRIMARY KEY
topological_orderingBIGINTNOT NULL
event_idTEXTNOT NULL, UNIQUE
typeTEXTNOT NULL
room_idTEXTNOT NULL
contentTEXT
unrecognized_keysTEXT
processedBOOLNOT NULL
outlierBOOLNOT NULL
depthBIGINTNOT NULL, DEFAULT 0
origin_server_tsBIGINT
received_tsBIGINT
senderTEXT
contains_urlBOOLEAN
instance_nameTEXT
state_keyTEXTDEFAULT NULL
rejection_reasonTEXTDEFAULT NULL

The initial auth_error:

Figure 1: The initial auth_error event that causing auth_error chain. Image via events table in Synapse database

Figure 1: The initial auth_error event that causing auth_error chain. Image via events table in Synapse database

  1. As marked in an image, the initial m.room.power_levels event by room admin, and somehow accepted by other homeserver but not ours

  2. As a result, new federated event now reference to rejected m.room.power_levels event also getting rejected too

We also learned that rejected event are only indicated by rejection_reason column and in rejections table

Data inside rejected event:

Figure 2: Sample json block from one of event in auth_error chain. Image via json column in event_json table in Synapse database

Figure 2: Sample json block from one of event in auth_error chain. Image via json column in event_json table in Synapse database

This explains why most of the new federated events are getting rejected. This is due to how the Matrix protocol works as defined in an earlier state of the blog.

Not just that, this also indicated that the event and event data are still being kept in the database even if the event itself is rejected. This paves the way for the possibility of recovering event states from the so-called Tashbin!

Despite all that, there are still 2 edge cases for this theory, one is matrix.org event still being accepted together with 1 independent homeserver. Another event sent from our homeserver is still being accepted in federation in spite of the auth_error chain going on.

  1. How matrix.org and 1 independent homeserver event are accepted?

    With love to the power of federation and decentralization of the Matrix Protocol, matrix.org instance still references the globally accepted m.room.power_levels. While the independent homeserver is using Tuwunel and still reference to the initial m.room.power_levels event of the room created years ago, which can neither confirm whether these are expected or anti-rejection event mechanism. But as a result, all events from those 2 homeservers are accepted nevertheless

  2. How event sent from our homeserver still being globally accept during the auth_error chain

    Also, thanks to the power of the federation, rejected state events are not getting updated to the current one, so the event is still being sent using the old m.room.power_levels. That made the new event outgoing from our homeserver still being accept globally despite incoming not being!

Database Doctoring

Since there aren’t any documented ways to fix auth_error, or it wasn’t intended to be fixed. Whatever, in this final part, I’ll walk through my steps of how i recovered the rejected event, for it to finally be exposed to the client in chat history and to prevent newer federate event from getting rejected again! And as there aren’t any documented fix, we are going to do it the old school way. And that’s direct SQL manipulation! 😀. Our plan is to nullify any reference to rejection in the events table and remove the entry in the rejection table, as if nothing ever happened.

Starting from this point are gonna be dangerous and could bring irreversible damage to your database IF NOT BACKUP BEFORE DOING ANYTHING, and also don’t believe everything I said, as it might work for me, but it may not work for YOU!

⚠️YOU HAVE BEEN WARNED⚠️

NOTE: Parts of this guide is using function that is exclusive to SQLITE(only effected the strftime)

1. Check for auth_error in our room from dates before incident

First, we start by checking for auth_error event within our database. I decided to check from dates before incident to be able to see bigger visual of the event states, like the actual initial event of the error chains

We did this because in our case, m.room.power_levels was changed more than once after initial error event, and with more and more times we let auth_error events happen under the radar, more complex clean up operation gonna be.

Likewise Synapse error log only referred to current m.room.power_levels events in console, it is more reliable if we look into the error chain from bigger pictues instead

Now before going through with step 2, don't forget to pick your target event_id which we will perform a forcefully removal of rejection status, by start from there

SELECT *
FROM events
WHERE room_id = 'room_id'
  AND rejection_reason = 'auth_error'
  AND origin_server_ts > strftime('%s', 'start_date') * 1000;

I leave it here just in case you want to looking into specific event_id

SELECT *
FROM events
WHERE room_id = 'room_id'
  AND event_id = 'event_id';

2. Check for rejections in rejections table

Not just in events table, Synapse also keep rejection event within rejections table! Which idk why, but it does contained almost exactly same data as in events table but with an additional last_check column

This gonna be ridiculous step for SQL wizard outhere, but since there is no room_id in the rejections table and i don't want to go that deep in coding for this task since it a small instance and i trusted my mate, so what we are planning to do is just gonna remove all rejection altogether starting from our event_id. But before going through with that in the next step. I'd like to get a glimpse of what we gonna remove first 😂

NOTE: it is possible to just limit destruction to a single room either by SQL query tweaking or Programming, but it depends on the seriousness on the instance :)
SELECT * FROM rejections WHERE event_id = 'event_id';

SELECT *
FROM rejections
WHERE rowid > (
    SELECT MIN(rowid)
    FROM rejections
    WHERE event_id LIKE 'event_id%'
)
ORDER BY rowid;

3. set rejection_reason to NULL in events table

as stated earlier, one of the way to get missed federated event shown up and not just new event, is to erase any traces of auth_error chain in the room entirely

and by setting rejection_reason to NULL for the event, theoretically should make events expose to client and is no longer deem as invalid events

UPDATE events
SET rejection_reason = NULL
WHERE room_id = 'room_id'
  AND rejection_reason = 'auth_error'
  AND origin_server_ts > strftime('%s', 'start_date') * 1000;

4. Delete rejected events from rejections table

While old federated messages might starts to shown up, as a result from previous step. But rejections table still has entries for m.room.power_levels event, that made new federated event still get reject as auth_error

DELETE FROM rejections WHERE event_id = 'event_id';

DELETE FROM rejections
WHERE rowid > (
    SELECT MIN(rowid)
    FROM rejections
    WHERE event_id LIKE 'event_id%'
);

In closing, I’d like to apologize for not being able to determine the actual cause of the initial error, as several days had passed before the investigation began. Additionally, due to Docker Compose’s default behavior, logs are overwritten on each restart—and this instance had already been restarted multiple times beforehand T_T.

I promise to handle it better next time if this rare issue occurs again

But, being able to recover the rejected events and successfully accept federated events again is still a win 🙂



Migrate old events from old homeserver to the new one

I had 2 Synapse homeserver, one in domain A, another in domain B, both of them in the same room, but in different timeline.

I want to ditch domain A, but i’d lost ~100k of events prior to me joining the room from domain B, and I don’t want that

The concept for Room DAG in Synapse is that Events are normally sorted by both topological_ordering and stream_ordering where stream_ordering == depth. Also stream_ordering is auto increment of 1, while backfill is decrement of 1 after -18

And events schema tell us exactly what we needed. So waste no more times, our migration plans is,

In our implementation, the database is SQLite. The code for this implementation can be seen here, also FULL DISCLOSURE, it was written by Claude under my guidance. And for full Transparency, prompt is here9.

Figure 3: The script final stats. Image via sync script

Figure 3: The script final stats. Image via sync script

Now, for the record, over ~142k events were successfully imported into the new database. And while I cannot provide any room images with respect to members’ privacy. But I’d say chat logs dating back to who knows when are safe in the new database and are able to be accessed from Element! Trust me bro


  1. https://spec.matrix.org/v1.18/server-server-api/#pdus ↩︎

  2. https://spec.matrix.org/v1.18/server-server-api/

    Like email, it is the responsibility of the originating server of a PDU to deliver that event to its recipient servers. However PDUs are signed using the originating server’s private key so that it is possible to deliver them through third-party servers.

     ↩︎
  3. https://spec.matrix.org/v1.18/server-server-api/#checks-performed-on-receipt-of-a-pdu ↩︎

  4. https://spec.matrix.org/v1.18/server-server-api/#auth-events-selection ↩︎

  5. https://spec.matrix.org/v1.18/server-server-api/#rejection ↩︎

  6. https://spec.matrix.org/v1.18/server-server-api/#rejection

    Info:

    This means that events may be included in the room DAG even though they should be rejected.

     ↩︎
  7. https://github.com/element-hq/synapse/blob/v1.151.0/synapse/storage/schema/main/full_schemas/72/full.sql.sqlite ↩︎

  8. https://element-hq.github.io/synapse/latest/development/room-dag-concepts.html?highlight=backfill#depth-and-stream-ordering ↩︎

  9. Claude prompt

    Write scripts to compare two SQLite3 databases that should have the same schema (Matrix Synapse). Then filter the process to a specific room. In the `events` table, the `event_id` column refers to the events we may need, and the events are ordered by `topological_ordering` with the condition `WHERE room_id = "ourroomid"`. What we need to do is compare `dbinput` with `dbtarget`. For each `topological_ordering`, check whether it exists in `dbtarget`. If it does not, copy that event into the target database. **Important:** When copying, do **not** copy the original `stream_ordering`. Instead, assign a new `stream_ordering` value by incrementing (`+1`) from the current highest index in the `events` table of the target database. Then, copy the corresponding row from `event_json` where the `event_id` matches. Repeat this process for all `event_id` values across every relevant table, which those are `event_json` `event_edges` `event_auth` `state_events` `event_to_state_groups` `event_search`
    
     ↩︎

#Blog #Migration #Matrix