Yorhel's VNDB Development ramblings

Posted in

#1 by yorhel
2019-09-19 at 09:44
< report >This is just a thread where I can publically ramble about the things I'm working on and thinking about; Feel free to ignore or to engage, whatever floats your boat.

I increased the font size and the size of a few boxes a few days ago. Last time I did that I ended up breaking a few things, but I only got feedback from 1 person this time so it doesn't look like many people noticed. Stealthy changes are good changes, I hope.

I'm currently working on rewriting parts of the backend code with the hope that database access and forms will be more reliable and easily maintainable. I also plan to rewrite most of the Javascript into Elm in the process. This rewrite will be done gradually and is expected to take a while, but if all goes right, you shouldn't notice a thing (read: I will be breaking a lot). The repo readme has more information on the state of the codebase, for anyone curious.

Also thinking of swapping the "title" and "original" fields, so that the "title" will hold the title in whatever script, and then replace the "original' field with a "latin" field for the (optional) romanization. This was already part of the (not yet implemented) VN title proposal in t12465, but I think it'd make sense to do that for all entries.
#2 by roadi
2019-09-19 at 10:56
< report >
I also plan to rewrite most of the Javascript into Elm in the process.

Have you considered RIIR? :P

Jokes aside, rewriting (some of) the Perl parts in Rust would (hopefully) ease contributing, as there's less fear of actually breaking something in the process.
Dynamic languages, like Perl, are especially notorious in this regard with implicit contracts sprinkled here and there.
#3 by yorhel
2019-09-19 at 11:13
< report >
Have you considered RIIR?
You may be kidding, but yes, I have. :)

I have evaluated a bunch of different languages and frameworks some time ago; Primarily Haskell and OCaml, for the same reasons you listed. Rust wasn't web enough yet at the time, but that's improving (I *really* like the markup crate, for example, and have great hopes for projects such as sauron).

But as much as I like static typing, I've found that I'm at least 10x as productive in dynamic languages, and my bug rate isn't even that bad. It'd take at least a year of full-time effort for me to rewrite VNDB in Rust - VNDB is just too large. That 10x factor isn't something that outside contributors are going to be able to compensate for and it's not like it's impossible to contribute to a Perl codebase, so long as it's clean and readable (which is what I'm hoping to improve - feedback is always welcome of course).
#4 by roadi
2019-09-19 at 15:52
< report >
It'd take at least a year of full-time effort for me to rewrite VNDB in Rust
I was afraid that might be the case.

Do you see incremental/partial rewrite as a nonviable option?
Hypothetically, if you did, what parts would you rewrite first, say, in Rust?

I've found that I'm at least 10x as productive in dynamic languages
Does that include refactoring/additions/modifications?

I'm not really an expert in Rust (yet :P), but what I've gathered thus far is that even extensive refactoring is somewhat of a pleasure to carry out.
Ironing out the structure might take some trying with the move/borrow semantics, but this grew on me quite quickly.
This is hardly so with e.g. Python, where things tend to blow up with even the slightest modification to some 'clever' code, and only after having hit that one corner case that should not even have happened in the first place.

But yes, sorry about the evangelism. :D
#5 by yorhel
2019-09-19 at 16:17
< report >
Do you see incremental/partial rewrite as a nonviable option?
That would involve either a lot of duplicated code or a lot of glue code to make Perl and Rust interact with each other. The latter is possible - I've done that with manned.org - but for non-trivial interactions the amount of glue code is easily going to eclipse any benefit you might gain.

Does that include refactoring/additions/modifications?
Yes, VNDB is not a complex project, it is merely a large project. There are a few shared modules that are used across the codebase and making changes there can be a bit tricky (though I've done that often enough without any problems), but most of the code consists of generating the individual pages and the code for different pages have very little to do with each other.

That's not to say some compile-time code checks can't help in certain areas, they definitely can (especially when making changes to the database schema, finding all the affected code can be quite a challenge!), but I wouldn't say the tradeoffs make a lot of sense for a project like VNDB.

Of course, if you feel like proving me wrong and you have a LOT of time on your hands, by all means start a VNDB rewrite in Rust. I'd be happy to throw away my Perl code and use that if it ticks all the boxes. :)
#6 by roadi
2019-09-19 at 17:07
< report >
by all means start a VNDB rewrite in Rust
At the moment, I'm completely Perl-illiterate and web applications are certainly not my forte.
Then again, DBs are much fun, so I'll probably have a look at this at some point(TM).

Do you think I should use v 2/-rw or v 3 as the basis, or to _improvise_?Last modified on 2019-09-19 at 17:08
#7 by yorhel
2019-09-19 at 17:16
< report >If you don't want to focus too much on the front-end, I'd recommend reusing the v2 HTML+CSS; As for the rest: Best improvise, I've no clue whether either structure is a good approach for Rust.

...but if you want a fun Rust project with a higher chance of success, I just thought of something else: People have been asking/complaining for a HTTP-based API for a while, and that could be implemented as a completely independent project from the rest of VNDB. Rust is even a pretty good candidate for it. Downside, though: We're going to have to think pretty hard about what that API is going to look like. I have a random assortment of requirements and ideas but there are a lot of details to work out.
#8 by roadi
2019-09-19 at 17:36
< report >Are we talking about a de/serialization api, e.g., GET/POST me some JSON?
If so, the de/serialization part would seem quite trivial to implement with serde.
#9 by yorhel
2019-09-19 at 18:54
< report >Uhm, yes, serde is pretty easy. There's quite a bit more to an API, but I'll let you figure that out. :3
#10 by roadi
2019-09-19 at 19:45
< report >
There's quite a bit more to an API
Yes, quite obviously.

I'd like to avoid a guessing game, so if you can give some requirements, that would be great.
Is this api supposed to cover everything the website allows, or perhaps some specific subset (e.g., read-only access)?

A simple query interface would seem quite redundant, but I guess that's where to start from.
#11 by yorhel
2019-09-20 at 07:01
< report >Continued in t3599.218
#12 by yorhel
2019-09-26 at 05:59
< report >Just had another little idea: How about using linear interpolation in the vote graphs? E.g. When someone votes a 9.4, that would count as 0.6 points for the '9' bar and 0.4 for the '10' bar. That might give a more accurate distribution of the votes, but has the downside of being a bit less intuitive - I mean, there would be fractional vote counts...

I came up with this idea when thinking up a way to cache the vote graphs in the database - they're pretty slow to generate for popular VNs. I could cache the graph counts as they are now, but such a cache couldn't be used to calculate the raw average. So I'd either have to cache that separately or... use linear interpolation, which has the nice the side effect that if you take the weighted sum and divide by the number of votes, you do get the correct average.
#13 by yorhel
2019-09-27 at 09:38
< report >Some minor changes: I just (finally) added the Malay language and I changed the filter configuration for history pages a bit. It doesn't look quite as messy as before (I hope), you can change multiple filters without reloading the page and you can now enable/disable individual db entry types. Unfortunately, it does seem to require more clicks in many cases.
#14 by yorhel
2019-10-03 at 12:57
< report >Bah, I'm breaking stuff again. >.>

Sorting the user list by vote count resulted in an extraterrestrial server error for a bit, fixed now.

If you had the default spoiler level set to "Show all spoilers" in your profile, then that setting didn't apply correctly. Also fixed now.

And I have some more changes coming up, let's see how much more I can break!
#15 by yorhel
2019-10-07 at 10:25
< report >While I was fixing t12750.9 I also made a few other changes. The scores and spoiler flag of parent tags/traits could have changed a bit, but should not be significant.

What *is* significant is that VNs and chars now immediately show up on the tag/trait pages, so you don't have to wait for the cache to update. Unfortunately, I wasn't able to also fix the updating of the VNs/chars counters on the tag/trait pages, so they may still lag behind. I really don't like it when counters are inconsistent with displayed data, so I'll see if I can improve that somehow...

Any changes to the tag or trait entries themselves may still need up to 24 hours to propagate. I'll see if I can reduce that time a bit, but I can't entirely eliminate it. :(
#16 by yorhel
2019-10-14 at 16:36
< report >While I converted the user list to the new coding style, I also updated the pagination to allow skipping pages and to go directly to the last page. I'd like to apply this change to all listings when performance allows for it. A "result count" feature has been requested often enough, and this gets pretty close.

I also tried to apply this to the edit history listings when I was converting those, but it was far too slow and I couldn't find a way to optimize it. :(
#17 by skorpiondeath
2019-10-14 at 18:00
< report >
when performance allows for it
what pagination performance issue do you have in specific?
#18 by yorhel
2019-10-14 at 18:22
< report >There are two problems with implementing that kind of pagination:
- You need to count the total number of results. A simple SQL count(*) takes over a second when there are 700k rows (that's how many database edits we have). In the case of the main "recent changes" listing I could probably maintain a few caches to get a faster count, but that'll complicate the code quite a bit.
- Even with that fixed, fetching the result set of the final page (i.e. an SQL OFFSET clause) is also going to be very slow. I don't even know how to fix that.

On the upside, we only have like 25k visual novels in the DB, so I'm more optimistic about this working out for the VN listings.
#19 by skorpiondeath
2019-10-14 at 19:05
< report >Here a simple test: link

For example tables like tags_vn where you have lot of records will benefit from a single indexed serial key.Last modified on 2019-10-14 at 19:09
#20 by yorhel
2019-10-14 at 19:23
< report >Hmm, caching a row number like that will work, but only if no filters are applied. Using that approach with filters probably requires a separate row number column for every possible combination of filters. :/
#21 by skorpiondeath
2019-10-14 at 19:34
< report >you need to use dense_rank in case of strange combination becase raw_number doesn't work in difficult cases. I'll write a query later if you try to send me an example of filtered hard queryLast modified on 2019-10-14 at 19:35
#22 by skorpiondeath
2019-10-14 at 20:22
< report >I just added a dense_rank() example link (Example 4), not that different the main difficulty is to find the group of fields to order and paginate by.

Since I cannot read information_schema I cannot find info on primary keys, they usually are a good starting point to make groups. It's not a necessity but it's usually good to have a primary serial (identity) key on every table to have a clustered index on it.Last modified on 2019-10-14 at 20:22
#23 by yorhel
2019-10-15 at 10:34
< report >Meh. I was writing a long response with all sorts of benchmarks, but then my laptop ran out of memory and Firefox crashed. I guess that's what I get for using a cheap and old laptop. Anyway, so no benchmarks, but a summary.

Fortunately, I don't need the grouping stuff, all I really need is fast count(*) and limit/offset. I'm not really sure what you intend to do with the CTEs - I can't really imagine that materializing the entire result set in memory or on disk and then querying over it is going to be very performant with 500k+ rows. So that's when I started benchmarking various approaches for the "recent changes" listing and drew the following conclusions:

For unfiltered (i.e. no WHERE clause) listings:
- SELECT count(*) FROM changes; is actually fast enough (~25ms)
- SELECT * FROM changes ORDER BY id DESC LIMIT 100 OFFSET $large; is also actually fast enough (~150ms, somewhat to my surprise).
- Attempting to combine the above queries by adding a 'count(*) over () AS number_of_rows' is slower than running the two queries separately.
- Materializing the result set in a CTE and querying over that is slower than the above.

Adding a WHERE clause slows all of those queries down, but not all filters are equally bad. The "Hide automated edits" filter, which is literally 'requester <> 1', is fast enough, but the "Only deleted" filter, which does some subquerying, slows some of those queries down to >1s. That's definitely not fast enough for casual browsing (nor for the server load, really, but those listings aren't all *that* popular anyway). I've managed to get away with those slow WHERE clauses because apparently they're fast enough when you only fetch the first few pages, but attempting to fetch the entire result set is where things get really slow (not very surprising, actually).

So, my conclusion: Full and proper pagination is actually quite doable, but it needs faster WHERE clauses, which in turn will involve more caching and more indices.
#24 by skorpiondeath
2019-10-15 at 14:28
< report >Ok I'll start by saying that probably you already know all of this, I'm just sharing my point of view.

I found information schema but I cannot find primary keys and indexes. The only reason i can come up with a single count(*) over 900k records taking > 1s is a missing clustered index, but a single count(*) should take much less. I just linked (link) a query I did over information_schema is that correct?
Obviously if you miss a primary key then data are saved in an unordered structure called a heap which is slower.

From my understanding CTE should be equivalent to a normal query in term of performance probably they are slower becuase my solution called CTE multiple times thus making the overall performance worse.
As a note a collegue of mine state that CTE are a bit slower (I don't share is opinion but anyway :P) so if you can do it without it is better, but they are not saved in ram or disk, I think only the execution plan is saved so it's just a way to reference query among themselves. If spooling on disk happens (and you need a tool with execution plan to see) then usually the problem in in the query itself.

Adding a WHERE clause slows all of those queries down
That depends if the filter is over indexed fields or not.Last modified on 2019-10-15 at 14:53
#25 by yorhel
2019-10-15 at 15:03
< report >I'm not too familiar with the information_schema, but the schema available in that query DB is generated with the import.sql script included in the DB dumps. It doesn't add any indices apart from a few primary keys, so it's not a super great way to measure query performance. The actual database, defined in /util/sql/ has a few more tables, indices and caches.

CTEs are, in the version of PostgreSQL that I'm using, indeed often a bit slower than nested queries. This is because Postgres will first run the CTE query in its entirety, save the results to RAM or disk, and then run the other query. This has been improved in PostgreSQL 12, but only if you refer to the CTE at most once...

You can actually inspect the query plan in SQLPad, just put an `EXPLAIN ANALYZE` before the query. Unfortunately, SQLPad doesn't render the query plan very nicely. :(