Openclipart SVG Restoration Update

The Openclipart SVG collection completed processing last night, after six straight days of running. To be clear what I mean by processing. I already had all the files, but I was processing them to make sure the files were valid SVG files, had no errors and then I minified them for use on the website. That means that the SVG files you can download here are not the original files. They have been minified to save on space. All the original files, including whatever metadata they contain, are available on Google Drive. Note: This data has been removed to save on space and $$$.

Here are the stats on the files for those that might want to know.

Good Files

Original Files: 157,692

Size: 82.5GB

49,856 had meta data of some kind. Title, Description, Author, Date, or Tags. Some had all metadata attributes, some had only one of those attributes.

107,836 had no metadata in the SVG file. However it is probably possible to create a title from the filename from 87,230 of those files.

That is a total of 137,086 files that are probably to recover fairly accurate titles for.

Bad Files

Bad Files: 718

Size: 6.03GB

These files failed for any number of reasons. Some failed XML checks, some files were just bad. They might still be recoverable but I will not look at them again for awhile.

Website SVG Files

Files: 157,692

Size: 80.1GB

By minifying the SVG files the website is able to save some space on storing and hosting the SVG files. Minification was done using SVG Sanitizer, a fantastic project BTW. By running the files through SVG Sanitizer is also how many of the bad files were identified and now moved to lower priority.

While 2.4GB of space saving might not seem like a great deal every little bit helps.

Now that all the files are done processing I will be continuing to add them to the site so they can be searched using the Search API. I also have been testing a POSTing API to add new files to the site, but it is still in the early stages and not ready for the live site yet.

0 thoughts on “Openclipart SVG Restoration Update

  • will you be releasing the sourcecode and DB or is this just going to be another example of a site controlled by a single person, also I see freesvg.org have already released all of the images they have retrieved from OCAL, will you be doing the same. I see you mentioned clipartzero which appears to share the same values as yourself, is there a reason why you weren’t able to collaborate with the developer of that sit.e

    • The source code to what? The website? Or to the SVG data? I’ll answer both.

      This site uses ClassicPress which is a fork of WordPress and is released under the GPL. The clipart posts themselves are a custom post type that use custom fields to hold the data for each clipart SVG. A Google search on how to create a custom post type and custom fields will show you how I did it, there is also numerous plugins that will do that as well. I don’t really feel the need to release that since it is pretty easy to Google and find.

      As for the SVG files, they are being released on Google Drive, similar to what FreeSVG is doing. These are the original files I have and not the minified ones. The reason they are the original downloaded files I have is because they contain (or 49,856 contain) the metadata in the SVG XML file itself describing what the SVG file is, who created it, when it was created and tags. Of the files that did not contain metadata in the files themselves 87,230 of those files have fairly descriptive filenames describing what the file is about and some have the creator as well.

      By releasing the original files, the data about those files is being released, as much as there is. The files are the database and anyone with the files from Google Drive can read the filenames, and/or the XML metadata and recreate the data about the files. That is a long answer to am I releasing the DB? Yes, because the DB comes from the files. There is one additional thing I added to the original files, I included the original Openclipart IDs at the beginning of the filenames which no one else seems to have done or thought about.

      I like the Clipart Zero project, but I felt I could not ask someone else to take on some specific needs for me. Specifically I needed an API that had the Openclipart IDs, and as much as possible the creators of the clipart and the original tags.

      I think in this case it will not hurt the clipart community to have multiple versions of the data available. One thing we have all learned is that we cannot trust one site to have and hold all the data so actions need to speak louder than words. My actions are to make this data all available and let the community have it, all the previous files and any new files that I collect. The side benefit for me is I get to use those files for my own projects.

  • By source code I mean, the whole code structure of the site, including your changes, accessible to all so that people can further develop or fork, and no the database I am referring to is the one you are using to index the files in classic press, the one which is also used for tagging, comments and any additional site content or functionality (MySQL, MariaDB, postgresql). Does that information make it back into the original files or is it held in your database, only available to those using your site. It seems, you are taking what you need from the original files then they are abandoned, will you be adding the META data you are gathering like additional tags for instance and placing it back into the files, so it is useful to future developers / generations, or is this information strictly yours to keep.

    I know nothing of clipart zero so cannot comment, and do not know much about the openclipart id’s either, I do know I wont be searching by the id’s though, as if I had this information then I probably also have the files some place also.

    Your site seems like a good project, but I would have preferred to see an open source project where anyone can contribute, from what I can gather openclipart died at the hand of one individual so I had hoped lesson were learned from this.

    It’s good to see you made a copy of the original files though, even if they are are no further use to you after you have taken what you need from them, I only hope they remain available to everyone.

    Thanks a lot for sharing.

    • Ok let’s deal with this in sections because I think there is confusion between the ClassicPress database and the SVG data.

      By source code I mean, the whole code structure of the site, including your changes, accessible to all so that people can further develop or fork, and no the database I am referring to is the one you are using to index the files in classic press, the one which is also used for tagging, comments and any additional site content or functionality (MySQL, MariaDB, postgresql).

      No I will not release the ClassicPress database and code in its entirety for several reasons. First is because the code that powers the site is already GPL and open source, so it is already released. The additions I made, are all easily found in Google. Search for creating a custom post type and create custom rest route and you will find numerous examples. The other reason I would not release the entire ClassicPress database is because it would contain potentially personal information including usernames and IP addresses. The other reason to not release it the format it is in is because it would be essentially useless to most people. It would be a ClassicPress (WordPress) SQL file and useless unless it was being restored to another ClassicPress (WordPress) website.

      Does that information make it back into the original files or is it held in your database, only available to those using your site. It seems, you are taking what you need from the original files then they are abandoned, will you be adding the META data you are gathering like additional tags for instance and placing it back into the files, so it is useful to future developers / generations, or is this information strictly yours to keep.

      Does any new data make it back to the original files? No, that would be too resource intensive at this point. However, creating a CSV file of the clipart data would be possible. It could contain any of the relevant information on the clipart files and once the site allows editing of the data, that new data would then be available via the API and the CSV file. This could be a ways off though considering it takes time to import the clipart. I make no claims on that data about the clipart. It was part of the files and as far as I am concerned released under creative commons along with the files. All I did was read the files, so I have no problem releasing it.

      Your site seems like a good project, but I would have preferred to see an open source project where anyone can contribute, from what I can gather openclipart died at the hand of one individual so I had hoped lesson were learned from this.

      Rome was not built in a day. Things take time to build and create. The first step is to get the files back out there and available to all.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>