Académique Documents
Professionnel Documents
Culture Documents
com/tag/infrastructure/
5 Estimate your documents data volume and how it will scale up in the future (this is a key point that will drive your backup strategy). 6 Study carefully the backup and the restore strategy (DRP). Also define if you want to be able to restore a single file. 7 Involve your internal infrastructure team (and Alfresco service team) to finalize the design of your infrastructure. I will detail each of these items in the next blog post. But the conclusion for us, at this step of the H.A design study, is clearly that there is no perfect architecture, but that the final infrastructure design is only the best compromise between all your internal company constraints (functional and technical). Once again, I can say that most of these compromises are not due to the Alfresco software limitation itself, but rather to the huge volume of documents that we have to manage. Sogoto thenext blog postif you want to know the next episode of the story
Of course, in most cases, using the latest version is likely to be the best choice. But for some technical reasons (e.g CPU/memory limitation of your current production infra), you might have to migrate as soon as possible to the current available version (which is 3.1). The purpose of this post is not to do a comparative study of 3.1 and 3.2, but you should be aware that even between 2 minor releases there might be significant evolutions (as Alfresco developers are working very hard). For instance, changes you can do through the JMX console (e.g changing any server configuration parameter) can be done at runtime with both 3.1 and 3.2, but changes are persistent only in 3.2So if you need more advanced administration capabilities (like sub-system, JMX, LDAP synchronization, etc) 3.2 is clearly the best target. Please note however that most of the major clustering features and options are already included in 3.0 version, and that there are no significant changes between 3.0 to 3.2 regarding clustering. The next major evolutions (sharing lucene indexes between cluster nodes) is planned for 3.3 (or even 4.0) only. 2 Work with Alfresco service team at the early beginning of the study: To start our study, we have involved the Alfresco service team. This is clearly the best advice I can give you for your own projectIndeed, even after a single consulting day (or 2), you will not only clarify the main Alfresco cluster principles, but also (and this is more important) learn that some of the most common documentation on this topic does not necessarily reflect the limitation of the clustering or HA methodology in a real life context. I especially do think about the Replicating Content Store (RCS), which initially seems to be the most common way to replicate your documents between cluster nodes (assuming you need to have 2 content stores for failover), but which is actually a very limited process. In our case, we plan to build a failover site, which must be the mirror of the live site in term of data (content store and meta-data) to avoid loosing data as much as possible when switching to the failover site is required. At the beginning of the project we thought that using the Replicating Content Store could be a good option to perform the document (content store) synchronization. But the problem is that the RCS component does not ensure that the failover site will contain the exact same documents. For instance, if failover site is unavailable (e.g maintenance operation), then the RCS will not be able to re-run the transaction which have not been executed during downtime period. This is not the only limitation of RCS, so clearly it is not well suited in our context. So the service team clearly told us that the RCS is usually not used by customers for documents replication purpose. Insteed, a tool like rsync is usually recommended by the Alfresco editor. So we will study if rsync can be a good fit for our datacenter infrastructure. But, with this advice at the very beginning of the project, we have avoided a big design mistake So once again, involving Alfresco service team is a good investment ! 3 Define and validate the SLA with your functional team. Common indicators are RTO (Recovery Time Objective) and RPO (Recovery Point Objective): One of the main technical constraints that will drive the design of your HA infrastructure are RTO (Recovery Time Objective) and RPO (Recovery Point Objective) . Basically RTO is the time required to re-activate the service in case of major crash of your production system. So lets assume your Alfresco server crashed for any reason (unexpected power shutdown, corruption of database, etc, etc), you usually need to have a DRP (Disaster Recovery Plan) to be able to restore the system as soon as possible. In our case, we will use a failover site (in another
geographical location) and we will switch the users on this new system. Our functional team has specified that our RTO must not be more than 2 hours, so that means we must be able to start the failover Alfresco instance and switch the load-balancer in less than 2 hours. Once your failover site is started, another indicator is how much data have been lost during the crash and the switch operation ? This usually depends on the frequency of your backup. In our case, our plan is to synchronize the documents content store every 30 mn (using rsync), and the database every 60 mn (using Oracle archive log feature). That means that we should not lost more than 60 mn of data creation time. Choosing a RTO of 2 hours and RPO of 1 hour is already a challenge, but I think this is something technically (and humanly) feasible. Depending on the criticity of your documents, this might of course not be enough, or not acceptable for your end-usersbut you should know that the highest your expectation are, and the biggest the infrastructure cost will beFor instance, being able to reduce RTO to less than 30 mn might need to involve some monitoring solution coupled to advanced mechanism to automatically start your Alfresco instance (and database) on the failover site. Such solution already exist (likeVeritas Storage Foundation HA), but the price of such solution is usually significant. We have studied this kind of very advanced solution, but finally we have decided (mainly for cost reason) to start with a pragmatic and more simple HA solution. To give you an overview of what our Alfresco HA infrastructure will be, here is a logical diagram:
The final design is still not completely validated but basically here are the main concepts which have been selected so far: Failover(YES): 2 distinct Alfresco instances located in 2 distinct Data-Centers geographically separated. Site A is the live site exposed to end-users. Site B is used only for failover (only in case of crash of Site A). Clustering(YES/NO): yes, because we will simulate a 2 cluster nodes. But this will be more a hot standby system rather than a real cluster. Here the standby instance is mainly used for backup purpose (and of course failover). The failover site will be in read only mode and it will not be synchronized in real time (the data are replicated only every 1 hour). Data replication(YES): rsync will be used for files synchronization. The archive log export/import mechanism (Oracle) will be used to replicate the database meta-data. One very
important constraint is that Content Store replication must be performed before the DB replication (I will detail that later on). Lucene indexes management: even with 3.2 version of Alfresco, each node of the cluster needs to have a local copy of the indexes. Site A indexes are updated in real time during each transaction. On Site B, we might need to enable the lucene index tracking process to make sure the lucene indexes are up to date. Load-balancing(NO): we have decided not to address load-balancing for the first release of the project. The main difficulty is that load-balancing requires the 2 nodes of the cluster to be in synch in real time which can be complex to achieve (especially if you consider the limitations of the RCS component). Also, managing CIFS and load-balancing is something we did not have the time to study for the first release. Hope that you begin to have a good overview of what infra we currently try to setup in our case. In a next post, I will continue to describe the following recommendation: 4 Estimate the number of concurrent users you plan to have. 5 Estimate your documents data volume and how it will scale up in the future (this is a key point that will drive your backup strategy). 6 Study carefully the backup and the restore strategy (DRP). Also define if you want to be able to restore a single file. 7 Involve your internal infrastructure team (and Alfresco service team) to finalize the design of your infrastructure.