Opening Up

Posted in

#1 by yorhel
2019-07-22 at 08:21
< report >There have been many requests for VNDB data over the years and that's why we have a few limited database dumps and an API. Despite being limited in scope, those exports have already been used for many awesome projects, including VNStat, VNDB Android and various data-oriented research projects. Still, there is value in providing more. After mulling it over for a while, I've decided to provide dumps of the complete database. Or almost complete, at least. Forum posts, change histories and non-public user information is excluded.

Info and download.

Most of the data (but see licensing for details) is available under the Open Database License, which allows for free usage and sharing of the data, provided that any projects using this data will also share *their* data and keep it open.

Hopefully these dumps also provide an answer to the sustainability of VNDB: If something happens to me or if I end up doing something spectacularly stupid to the site, this will give the community the opportunity to fork the site and start anew - with most of the data still intact. Of course, I'll try to make sure that will never be necessary.

As a reminder: *PLEASE* do not crawl this site or scrape the HTML in other ways. If these dumps still don't provide what you need, let me know about your use case and I'm sure we can work something out.


Finally, I also had the idea (and in fact, an implementation as well) to provide weekly updated tarballs of all images referenced from the database dumps. Unfortunately, I'm convinced that my poor server isn't going to be happy with serving a 20G file to some hundreds of people, so we'll need to work out a better distribution channel. If you're an experienced sysadmin with a fair bit of bandwidth to spare, please get in touch. If that doesn't work out, we'll probably have to play around with Bittorrent. But scripting torrents is kind of annoying...

Update 2019-08-04: Images are available now, see #9 and d14#6.Last modified on 2019-08-04 at 07:44
#2 by kurothing
2019-07-22 at 11:01
< report >I would love to see a more... Common file type used for the database dump...

As it stands, as a primarily windows user, there isn't really a nice way to open ".zst" files.

Sure, there is the command line app released by facebook (No thanks) and the 7zip fork that requires ending explorer.exe / restarting windows to completely uninstall... But wouldn't a gzip tar like the other files be fine?

That's my only request in regards to the file. While the active format being weird as it is (A tab seperated, new line delimited file... Seems a bit out there, maybe for a few hundred kb extra a series of json arrays would have been easier?), I'm honestly happy to see that in the unfornate event that something goes horribly wrong, the time and effort put in via the community to vndb would not be wasted.

And reading that back, it sounds really mean! I'd be perfectly happy for vndb, and you yourself Yorhel to continue to operate as-is! Especially with all the free time you have dedicated to this project, I cannot be more grateful.
#3 by yorhel
2019-07-22 at 11:11
< report >
7zip fork
Wait, main 7zip doesn't support .zst yet? That'd be pretty damning, and I understand your frustration then. That said, gzip turned out to be a bottleneck in creating the dump, zstd does it twice as fast *and* saves an additional 8MB. Here's hoping it'll get more mainstream soon.

A tab seperated, new line delimited file
It's the PostgreSQL COPY format, provides easy exporting and importing (more so than JSON) and isn't terribly uncommon (in fact, I pretty much copied the format of the MusicBrainz dumps).
#4 by hi117
2019-07-22 at 11:17
< report >I have a little bit of free time that I could spend on something like this. As far as experience goes you can check my other posts on Reddit (u/hi117). I currently run most of the public facing web infrastructure at Malwarebytes.

I have a few ideas that can make this pretty easy let me know if you're interested.
#5 by yorhel
2019-07-22 at 12:17
< report >@hi117: Mail me at contact@vndb.org, I'm interested in hearing those ideas!
#6 by roadi
2019-07-22 at 16:36
< report >Thank you, this is most welcome.

Now, if you'll excuse me, I'll go do some perversely convoluted queries just for the sake of it. :P
#7 by rampaa
2019-07-22 at 19:37
< report >
Alternatively, a script is provided to load the data into a PostgreSQL database for easy querying. See import.sql for options and usage information
I couldn't find "import.sql" in "vndb-db-latest.tar.zst".
#8 by yorhel
2019-07-22 at 19:44
< report >
I couldn't find "import.sql" in "vndb-db-latest.tar.zst".
Crap, my bad. That file got lost when I changed some tar args. Here's the script: link
It'll be in the next dump, too.
#9 by yorhel
2019-08-04 at 06:09
< report >The images are up: d14#6

Decided to go with an rsync server. If the number of downloads of the image database are anywhere near that of the main database dumps, then the server ought to be able to handle the load just fine. It's currently limited to 500 KiB/s and 10 concurrent connections, so downloading an initial copy may take a full day and the server may not always have free connections during the first few weeks. I expect to relax these limits as the initial surge of downloads dies out after a while.
#10 by yorhel
2021-02-25 at 16:38
< report >I don't intend to announce every breaking change to the public database schema as I never promised it to be a stable format in the first place, but here's a heads-up on an upcoming change that will likely affect everyone using the dumps.

I'm working on converting all database entry identifiers that are currently stored as an 'integer' type to the special 'vndbid' type. This type has already been in use for a while for image identifiers and the import.sql script has instructions for using it. What this changes in practice is that some queries will break, for example:

# This will throw an error:
SELECT * FROM vn WHERE id = 17;
# Use this instead:
SELECT * FROM vn WHERE id = 'v17';

# This will result in a bad URL:
SELECT 'https://vndb.org/v'||id FROM vn;
# Remove the 'v' to make it work:
SELECT 'https://vndb.org/'||id FROM vn;

Queries that don't assume that identifiers are integers will work as before. Joins and subqueries aren't affected, they'll join just fine on the new type. Additionally, if you import the database dump without also importing the 'vndbid' extension, identifiers will be treated as strings and that will affect sorting with ORDER BY. The query database has extension installed, of course.

These changes won't affect the site, API or dumps other than the "near-complete" one in any way. Unfortunately, this will still break a lot of saved queries. :(

Rollout plan: Identifiers for VNs, releases, characters, producers, staff and users will be changed in one go in the next few days/weeks or so (sorry, I suck at time estimates). Tag & trait identifiers will take a bit longer, I'll probably convert those when adding edit history support.

(Oh and I also changed the column order for many tables yesterday, but you shouldn't be relying on that in the first place)
#11 by some-guy
2021-03-01 at 12:00
< report >Thanks for the headsup - I do have a question regarding your remark though:
"(Oh and I also changed the column order for many tables yesterday, but you shouldn't be relying on that in the first place)"

What would be the suggested way in that case? I have something set up that gets the vndb dump once per week and updates user vote content based on that. Due to the changes that meant I was now getting a date instead of the vote, which wasn't too good for my database obviously :D.
As the format of the dump doesn't seem to be anything that can directly create a mysql database (at least I'm not aware how) I'm not quite sure what a more stable way to do that would be. I'm thinking about writing a parser that goes through the header to create a table and then filling the table by iterating through the non-header file and cross-checking the column name with the header, though in that case I still don't know what format the content has and I can basically only assign "text" or other large placeholders. Could you suggest a better approach or am I even missing some easy way to fill my own database with that dump?
#12 by yorhel
2021-03-01 at 12:51
< report >Regarding the vndbid conversion: I applied the changes to the database this morning. Tomorrow's database dumps will be in the new format.

I'm thinking about writing a parser that goes through the header [..]
If you're not using the import.sql script, that's the way to go. It's why I created those header files in the first place. :)
(I wanted to put the header in the same file, as is fairly common with CSV, but Postgres doesn't support that)

I don't know if MySQL has an option somewhere to read PostgresSQL's COPY format files, if it does then all you need is a conversion script for the CREATE TABLE commands in import.sql. They might already be largely compatible. All the column, order and type information is in there.

A more heavyweight approach is import into PostgreSQL and then use tooling to convert that to MySQL. pg_dump has an option to create 'INSERT INTO' statements that are largely compatible with other databases, but I don't know how well that holds up in practice.
#13 by some-guy
2021-03-01 at 13:22
< report >Wow I didn't even notice this file when extracting and only saw the db folder :S. That definitely looks like a more stable way, though as you mentioned converting this to a MySQL format might give me an additional challenge. I guess for now I just go through the header files first and save the index of the columns I'm looking for instead of hardcoding them as a quick fix.

Tomorrow's database dumps will be in the new format.

Good to know, thanks!
#14 by some-guy
2021-03-01 at 14:24
< report >I have one additional question:
It seems I am not able to fetch votes anymore using the default API - I have a python method that does the following:
def getUserVotes(socket, userid, vnid):
#get vn = command; basic = flag (enough for title), \x04 is end byte
request = "get ulist basic (uid = " + str(userid) + " and vn = " + str(vnid) + ")\x04"
logging.debug("request: " + request)
socket.send(request.encode("utf-8"))
logging.debug("Sending get votes")
data = socket.recv(BUFFER_SIZE)
#read until we have the full message
while not data.endswith(b'\x04'):
newData = socket.recv(BUFFER_SIZE)
data += newData
decodedData = data.decode("utf-8")
#get rid of VNDB formatting: substring of outermost json object
jsonPart = decodedData[decodedData.index("{"):decodedData.rfind("}")+1]
parsedJson = json.loads(jsonPart)
logging.debug(parsedJson)
#Answer will have an array "items" of which we get entry 0, then fetch the vote
if parsedJson.get('fullwait', 'empty') != 'empty':
logging.debug("Reached limit - waiting...")
time.sleep(int(parsedJson.get('fullwait'))+1)
return getUserVotes(socket,userid, vnid)
items = parsedJson['items']
vote = -1
if len(items) > 0:
vote = items[0]['vote'];
if vote != None and vote > 0:
logging.debug("Got vote: " + str(vote))
return vote

Using this like before, I get:
request: get ulist basic (uid = 143413 and vn = 21668)
Sending get votes
{'msg': 'SQL error', 'id': 'internal'}

changing the input to vn = vXXXX, I get:
{'msg': 'Invalid JSON value in filter expression: malformed JSON string, neither tag, array, object, number, string or atom', 'id': 'parse'}

And finally putting the vXXXX in quotes, I get:
{'id': 'filter', 'field': 'vn', 'op': '=', 'value': 'v21668', 'msg': 'Wrong field/operator/expression type combination'}

Is this a bug with the changes or did these requests change after all and the documentation is not up-to-date?Last modified on 2021-03-01 at 14:25
#15 by yorhel
2021-03-01 at 15:44
< report >"internal" errors are always a bug, and yup, looks like I forgot to update the "vn" filter for list commands with the recent database changes. Should be fixed now.
(Not the first bug I introduced today... unsurprisingly, considering the scope of the change >.>)
#16 by some-guy
2021-03-01 at 15:52
< report >Awesome, looks good! I will continue testing in a few hours and let you know in case I stumble upon other issues with other requests. Thanks!
#17 by some-guy
2021-03-01 at 19:09
< report >Looks like all requests work for me now, just FYI :).
#18 by yorhel
2021-03-01 at 20:28
< report >Thanks, much appreciated. I don't have automated testing for the API so it's an area where bugs can creep in pretty easily.
#19 by wilk220
2021-03-04 at 03:55
< report >Might be related, get staff works with id = 1 but fails with id = [1] and arrays of ids in general (
{"id": "filter", "field": "id", "op": "=", "value": ["1"], "msg": "Integer out of range"}
).
#20 by yorhel
2021-03-04 at 06:35
< report >Fixed.
#21 by wilk220
2021-03-04 at 22:23
< report >It's still failing, now with {"msg": "SQL error", "id": "internal"}, when an array with more than one ID is passed.
#22 by yorhel
2021-03-05 at 06:17
< report >Fixed. For real now...
#23 by some-guy
2021-03-05 at 06:49
< report >Looks good to me now,
get staff basic (id = [1,2])
gives two results.

Bit of an elephant in the room, but also a warning for others using the db dump: Keep in mind that not only the "id" field has this change but also references in other tables to other ids. I forgot about those :S. I am now using something like this which is still relatively failsafe if you accidentally pass non-identifiers to it:
def getId(entry):
match = re.search('^[a-z][0-9]+$', entry)
if match != None:
return entry[1:]
else:
return entry

yorhel, will these identifiers eventually also find their way into the API or is that not planned? Maybe having the updated format when requesting at a different port or something? I find it a bit odd that identifiers are now in different formats depending on where you request them. Might be reasonable for developers to start changing their own databases working with vndb to this format as well.Last modified on 2021-03-05 at 07:14
#24 by yorhel
2021-03-05 at 06:55
< report >
yorhel, will these identifiers eventually also find their way into the API or is that not planned?
I do intend to eventually convert every use of identifiers to the new format, but that may take a while. I'm not sure yet if I'll provide such an option to the existing API or if I'll instead provide a new HTTP-based API and deprecate the old one instead.
("deprecation" meaning "will continue to work for another decade", seeing how projects tend to be slow to update...)
#25 by micah686
2021-03-05 at 07:11
< report >What do you plan to have different about the new HTTP-based API?