We had a news site that had 10’s of thousands of Articles. These articles date back nearly two decades and to the readers of the site are an invaluable research/archive information.
We were having a problem with performance across the site - one could observe that the Umbraco cache file was huge - and that each publish would rewrite the cache to disc. In theory the Umbraco cache file is asynchronously flushed to disc from time to time, but you can’t control the interval - and if you have a long running publish and slow storage (e.g. Azure webapps) then the whole cache file gets written every publish.
We also suspected some horrible Linq queries were in play - the kind that would load thousands of content items into memory and traverse all over them. (See https://our.umbraco.org/documentation/reference/Common-Pitfalls/)
Then we thought, do we need all of this content in the memory cache? Most articles from the 90’s are accessed via search once every couple of months - There is no need to hold them in cache. All we really want in the cache are the few dozen nodes that make up the page furniture of the site.
By removing the news articles from the site, we immediately saw a big performance gain - and our Umbraco cache file shrunk to less than 3MB.
Next we put all of the news articles into Azure search using our Umbraco plugin: (See https://github.com/darrenferguson/UmbracoAzureSearch)
When a document was Unpublished we wrote an event to check whether it had an Archive document type property - and if it did, to write the content XML to a database table:
The gist here shows how you can turn Umbraco content into XML: https://gist.github.com/darrenferguson/ec5e4fe681e6403ab35e63abcfb1aa02
Once in the Archive table - the content is no longer available on it’s URL because we unpublish it to remove it from the content cache - though it is available via the Examine Internal index - or in our case the Azure Search index.
To make the archived item available via the original URL when it isn’t in the Umbraco cache we use an IContentFinder to search by URL in the archive database table, and return the stored XML (it could also be looked up from Azure Search or Examine internal index).
The idea here is it’s ok for archived content to take slightly longer to load via a database request, as it is accessed so infrequently, and the benefits of a smaller cache file to the frequently accessed recent content is worth the trade off.
The gist for the IContentFinder is here: https://gist.github.com/darrenferguson/f613a42f86155bfefe832f2749b6d59a
The last piece of the Jigsaw was a little bit tricky - and quite hacky, but some of the internals of Umbraco weren’t open for extension. The Umbraco experts amongst you will have noticed that we can return an implementation of IPublishedContent called MyPublishedContent. This allows you to work with the archived content in a View/Template as you would with any other piece of Umbraco content.
The following Gist for MyPublishedContent: https://gist.github.com/darrenferguson/0566dfba66a4a85ace413dbb553fbf04
MyPublishedContent is almost a complete copy of XmlPublishedContent in the Umbraco core, we couldn’t extend it as it is internal.
We modified the implementation of the Parent node accessor - and we can’t support children of archived content (easily), but we figured that most types of content that would get archived live at the bottom of the content tree. (eg News Articles!)
What is the point of all of this?
An Umbraco site with lots of infrequently accessed content can now start up really quickly, as it can build a cache using only what is essential for rendering page layouts. Infrequently accessed content is loaded from the database on demand.
What needs to be done in the core to support this?
To allow Umbraco to start up faster - the Examine indexes should be abstracted so you can use Azure Search or another provider - that can index independently of the Webapp. Currently each instance in a set of load balanced Umbraco instances needs to create it’s own Examine index
XmlPublishedContent could be changed to not be internal - so the changes that we made can be implemented without duplicating code.
The notion of archived could be reflected in the Umbraco UI - along with the current statues of published/unpublished - content could be set to archived in the back office so it doesn’t live in the memory cache.
For example to an editor, archived content appears unpublished, it is difficult to tell the difference between the two states, and of course Link To Document will say ‘This Item is not in the published cache’ - when user testing we found it was the absence of this link that caused the most problems to editors, so we added a custom property editor:
That displays the link to where the archived page is still available on the site, and with a collapsible note to explain what on earth is going on.
We also had a featured news article picker, that won’t allow an archived News Article to be picked (as it is unpublished) but we figured if an article is to be ‘featured’ it should be published and not archive anyway! - It’s these kinds of issues that having the concept of ‘archived’ content in the core, baked into the UI that would make the experience of serving content from an archive seamless to the editor.