this post was submitted on 27 Aug 2023
53 points (94.9% liked)

Asklemmy

42489 readers
2517 users here now

A loosely moderated place to ask open-ended questions

Search asklemmy πŸ”

If your post meets the following criteria, it's welcome here!

  1. Open-ended question
  2. Not offensive: at this point, we do not have the bandwidth to moderate overtly political discussions. Assume best intent and be excellent to each other.
  3. Not regarding using or support for Lemmy: context, see the list of support communities and tools for finding communities below
  4. Not ad nauseam inducing: please make sure it is a question that would be new to most members
  5. An actual topic of discussion

Looking for support?

Looking for a community?

~Icon~ ~by~ ~@Double_[email protected]~

founded 5 years ago
MODERATORS
 

I can understand patch updates, but what else are the devs doing?

top 12 comments
sorted by: hot top controversial new old
[–] [email protected] 46 points 10 months ago

They could be upgrading hosting infrastructure - sometimes this requires servers to be shut down or restarted. They might also be applying database changes such as migrating data from one server to another, or updating the structure of the database to improve performance or support new features.

Honestly, there are quite a number of reasons for planned downtime.

Unplanned downtime is a different story. Usually that's because something unexpected went wrong and there will be engineers trying to get things back up and running ASAP.

[–] [email protected] 19 points 10 months ago (1 children)

They interrogate the player characters 1 by 1 and question if their human has any suspicious activities.

[–] [email protected] 1 points 10 months ago

This is great

[–] [email protected] 12 points 10 months ago* (last edited 10 months ago) (1 children)

Over a decade ago, I worked in a big tech company that had a scheduled downtime on one Saturday a month. That was for database schema changes.

When you're changing the structure of how you keep track of customer data, you need to make sure that no customers are making changes at that same time. So you take the whole customer-facing service down for a little while, make the schema changes, test them, and then bring the customer-facing service back up. Ideally this takes a few minutes ... but you're prepared for it to take hours.

As the technology improved, and as the developers learned better how to make changes to the system without requiring deep interventions, long downtime for schema changes became less necessary ... for that particular business.

Every tech company pretty much has to learn how to do these sorts of changes for themselves, though.

[–] [email protected] 1 points 10 months ago* (last edited 10 months ago)

This is the most informed answer in this thread. It really does come down to schema changes. There are even ways to avoid downtime during schema changes, but it's often complicated. For example, you don't see YouTube go offline for schema changes, but they're willing to make this effort and investment, even for very large databases.

Lots of other database tasks can happen while remaining online. For backups, use a read-only connection. For upgrades, you should have a distributed and scaled database, so take them down in sections during upgrades. For "cleaning up," you can do vacuum operations on part of your database while it's live. Etc etc.

Ultimately, there is almost never a technical reason why a database has to go offline. It's a matter of devotion to the stability and uptime of your infra. Toss enough engineering hours at a database problem and you can pretty much have 100% uptime in the scope of maintenance (not incidents, of course). But even with incidents, there are fail-over plans, replicas, and a ton of other things you can do to stay online. Instead of downtime, you have degraded performance that the users may not even notice.

[–] [email protected] 11 points 10 months ago

Not just database migrations as others have mentioned, but database state. Databases can result in a lot of dead data, because of how transactions and locks work. Cleaning that up can cause usage of the database to be blocked for a short time. It's easiest to do this periodically if there's down time

[–] [email protected] 9 points 10 months ago

Database schemas can be updated, new services and special functionalities can be first activated and afterwards tested with specific accounts, among a myriad of other things, depending on the game and the update.

[–] [email protected] 7 points 10 months ago

databases are weirdly mechanical in that you have to shut them off now and then to sort of straighten out the rows and columns, and chuck out abandoned or corrupted files.. maybe add some grease in the form of optimizations and then fire it back up so users can get it all messy again inside.. mostly because they're all written just well enough to function..

[–] [email protected] 5 points 10 months ago

The servers run on regular operating systems. They might wish to back up the storage (and databases), update the OS, or update their game server software, all of which is a lot easier if the service is stopped.

[–] [email protected] 4 points 10 months ago

To add to others’ posts. It can be a huge variety of things that risk making the service unstable, unresponsive, and worst case could corrupt data in flight.

Customers view scheduled maintenance as minor inconvenience. Unplanned outage as an annoyance, and loss of data as a dealbreaker.

So any time there was a chance that what we need to do would limit functionality - or otherwise make the system unstable - best to take the system offline for scheduled maintenance.

[–] [email protected] 4 points 10 months ago

Maintenance.

[–] [email protected] 2 points 10 months ago

I'm just picturing Lakitu as a gaming systems operator picking the sever up out of the lake and putting it back on the track facing the right direction.

Lots of other folks covered a lot of the details, it I just want to point out that the answer really depends on the architecture of the system that the game runs inside of. Cloud native architecture has component failure as an expected characteristic and it is planned for, which means that architecture avoids unplanned outages better than legacy architectures. Even then, sometimes you end up with a component where you have a choice of investing thousands of engineering hours to avoid downtime during an upgrade, or you just tolerate a few minutes of downtime and spend those thousands of engineering hours on something with a better ROI.