How to design a server room without regrets
My name is Anatoli Yakhontov. I am in charge of the technical department at X NET nd for the last 10 years I have been restoring operability of our clients’ infrastructure. I work with server rooms and data centers, which were built on a minimal budget and with numerous errors, or on a huge budget and all-round incompetence.
In this article, I will tell you how not to repeat other people's mistakes and I will also share my experience:
- What issues are often left out when building server rooms, and what consequences it may bring.
- Why do you need to spend nine months designing your data center and three months building it.
- How to set up a reliable data center without wasting money.
- When power supply redundancy is not needed.
- Things people tend to forget when building the supervision system.
Cooling system: how to design and what room features to consider.
Power consumption and communication channels: кhow to calculate the redundancy level and when no redundancy is needed. How the passion for fishing led to an accident in the data center.
Waterproofing: why it is always necessary to provide a drainage system and whether the room can be isolated from water.
Physical security: an important detail that is often forgotten.
Firefighting: s it always worth extinguishing the fire, why a powder fire extinguishing system is not suitable for a server room, and how negligent design makes fire extinguishing useless.
Ventilation and climate in the server room: the harm of dry and dusty air.
Dispatching: why it is not enough to install sensors on the equipment
I witnessed the opening of a server room in a large university. A new building was built for the purpose. Everything seemed to be perfect. A lot of money was invested in the server room and a lot of equipment was purchased. At the opening ceremony, the ribbon was cut and innovations in the IT were explained. The Commission solemnly moved from room to room. By the time we got to the floor with the server room, we had made a number of interesting discoveries. It turned out that wall-hung two home class air conditioners with the total heat dissipation capacity of 1.5 kilowatts each were provided as the cooling system. Moreover, the thermal capacity of the equipment was much higher than that of the air conditioners. However, most importantly, the condensate from the indoor units was drained into a plastic bucket. This was done because the server room was in the center of the building, and it was not possible to arrange drainage of water out of the building using makeshift solutions.
Further investigation showed that the task of building the server room was not done properly either: no technical parameters of the server room were specified. That is why the builders installed the air-conditioning system based on the area of the room and the fact that there will be four people sitting in it. The design mentioned that the system should be redundant, so two air conditioners were installed instead of one. Why have not drainage systems been provided for? Because the terms of reference did not include them.
In small data centers, heat removal systems are often made from what is at hand. This leads to painful consequences.
All electric power that comes to the computing capacity is one way or the other converted into heat. The air in the server room heats up and must be removed so that the equipment does not break down.
How to design a cooling system
1. Calculate the specific heat dissipation capacity
However difficult it may seem, calculate how much heat your equipment generates. I recommend doing the calculations not in kilowatts but in British thermal units (BTU). The majority of large vendors indicate heat generation in the BTUs.
After that, pay attention to the recommendations of ASHRAE (American Society of Heating, Refrigerating and Air-Conditioning Engineers). It is a professional community of heating, ventilation and cooling systems designers who have had experience in "refrigerators" since 1894, when the ice was still carried under blankets on carts.
2. Double the calculated heat output
I recommend a heat removal system with a double safety margin. You should not make the system work at the limit — this is the principle of fault tolerance. Double redundancy is reasonable for several considerations:
- Make a correction for an error in calculations.
- Enable future scaling: the amount of equipment in the server room may increase over time, and a more powerful cooling system will be required.
- Consider local overheating points. Depending on the configuration of the room and the location of equipment in it, local overheating points may occur in the server room. These are the places where the cooled air does not get, even if there is a very powerful air conditioner in the room. For example, they often appear in the back of a cabinet. Local overheating points can be calculated by thermal modeling, but this is quite an expensive method, which does not always lead to the right result. It is often cheaper to provide a greater air-conditioning capacity.
3. Purchase the cooling system
Once the required heat removal capacity has been calculated, you can start looking for cooling equipment. The principles of thermodynamics are universal, and the operating principle for most cooling systems is identical. Still, there are many solutions on the market:
- Classic freon cooling;
- Free cooling systems that cool equipment almost exclusively using the environment;
- Chiller cooling in which a liquid (e.g. water or ethylene glycol) serves as the medium for heat transport.
The choice of cooling system is influenced by a large number of details: from the climatic zone, where the server room is located, to the features of the building and the specific room in it.
The head office of the company where I work is in Central Kazakhstan. The weather remains cool for 8 months of the year, so we often install free cooling systems. In our conditions they cope, and usually provide economic benefits with heat transfer capacity of 40 kW or more. In cold regions, free cooling can be used almost anywhere, and in warmer regions, they can still be used after careful calculation.
In Finland, there is a data center with cooling water coming from the river via a pipe. The water temperature in the river is almost always about 2 degrees. The speed of the river is quite high, and there are several hydroelectric power plants along the stream. They constantly provide heat load, so the river does not freeze. This cooling system costs the owners almost nothing — they pay only for the electricity for the pump that delivers the water.
Once we analyzed a case when a customer installed a container data center in an open territory and connected the mainline power supply and networks. It was left to install a cooling system on the roof. This was done, but the company forgot that the metal container is exposed to direct sunlight, and it brings much more heat than all the equipment inside. They had to erect a mini hangar around the container, which was already connected to the networks, to protect the equipment from the sun.
Locating the server room in a building
It is important not only where the building is located, but also in what part of it the server room is placed.
Let us consider a case when the server room is in the depth of a large building. Engineers have measured that a freon line needs to be routed from the external wall to the equipment. Its length was 56 meters, taking into account all the turns along the way. At the end of the line there is a powerful air conditioner and an additional receiver. At first, everything worked well, but after three seasons the compressor failed. It seems that it was the fault of the equipment manufacturer. Actually, the problem was in the design and construction. Nobody had calculated how to make a proper slope of the freon line so that the oil dissolved in freon would flow towards the compressor and lubricate it. As a result, over time, the compressor failed.
Such details may be discussed only in the context of a particular site. Therefore, it is better to entrust the choice of heat removal system to professionals. They will choose a solution suitable for the climate zone, the specific building, premises in it, and the equipment capacity.
For a small server room with 5-6 kW of power, it makes no sense to build a large cooling system, which also consumes utilities during operation. It is enough to supply quality home class air conditioners and provide them with a way to drain the condensate (at least in the domestic sewage system), so that you do not have to run around with a bucket.
What to remember about designing the cooling system in a server room
- Calculate the specific heat capacity of the equipment, not the area or other parameters.
- Double the redundancy of the heat removal system. It is necessary to be insured against errors in calculations and to have a reserve in case of an accident or expansion of the server fleet.
- When choosing a cooling system, take into account the climatic zone and peculiarities of the room. In complicated cases, it is better to seek help from professionals.
- For a small server room you can use home class air conditioners and provide a system of condensate removal.
2. Power consumption and communication channels
Setting up power supply in the server room is not a complicated matter: all you need is to select an uninterruptible power supply system and calculate the required duration of autonomous power supply in minutes.
Main solutions for power supply redundancy:
- a diesel or a gasoline power generator,
- a natural gas power generator,
- a second feed line from an additional substation.
The choice of option depends on the particular location. A lot has been written on this subject, and no significant issues are normally encountered. The thing to remember is that any power line may fail.
How to calculate the degree of redundancy for utilities
One of our clients was a supermarket chain. We were arguing with their local management about whether a sufficient level of redundancy was included into the design. To understand this, you need to answer the question: how long can a business survive without its server capacity?
It turned out that the supermarket had local cash registers, so for some time the goods could be sold without communication with the server. I asked: “Will you be able to arrange cash collection at the end of the day?” The answer was: “We can order the service by phone”. Meanwhile, when the cash register shift ended, more problems would appear: you needed to upload the balance of goods to the server.
Therefore, the business could exist without server capacity for one day. On this basis, we calculated the volume of the fuel tank for the diesel generator and other redundancy systems. Even if the data center were idle for half a day, employees would be able to work until the evening. And toward the night you could refuel and start the generator, connect to the server and close the cash register shift.
If a business can live without a data center, it does not need to be redundant.
Can you safeguard yourself against cable damage caused by excavation work?
At the design stage, it is impossible to prevent possible cable damage.
We were building a server room near a large court building. Our electric cable ran through its territory. The court staff convinced us that the site was well guarded and we were protected from problems with damaged utility lines. Therefore, we manually buried the cable to a shallow depth. After a while, the cable failed. It turned out that the court guards were fond of fishing and had been digging worms on the territory for several years. At some point, the cable was cut with a spade.
In one of our projects, we discussed the option of protecting the utility lines with the help of concrete U-blocks that are used to cover sewerage systems. Nevertheless, this solution requires more money and time: more approvals, machinery and labor are needed.
When do you need a raised floor?
A good raised floor increases the price of the data center project, because it is made of high-quality non-combustible materials. A raised floor is needed for cooling, using mobile cabinets or if you have a large number of communication lines.
There are cooling systems, in which the air conditioners blow into the underground space and create excess pressure there, causing cold air to be released in spots. In this case, a raised floor is used.
In addition, raised floors are required when you use mobile cabinets. This is convenient because you can remove the tiles of the raised floor in any place, install a grille instead and provide release of cold air.
However, this system has its own “buts”. For example, it makes no sense to put ventilation grilles in the floor within three meters from the air conditioner. This creates three meters of empty space in the data center, where you cannot place the equipment. I saw a couple of mistakes when racks were placed very close to air conditioners. The equipment overheated as the air speed under the floor was so high that it flew by under the grille and did not go up.
Large number of utilities
With the help of a raised floor, you can easily route complex communications in the server room. For example, if you use water fan coils for cooling, you can bring water to them through a low raised floor. Then, in case of an accident, the water will not damage the equipment. However, raising the floor higher than 30 centimeters does not make sense.
To protect the equipment from water, you can install moisture protection — for example, metal cases. But such barriers may be expensive. It is cheaper to arrange drainage in the server room: a small pit, a floor with a 3% slope and a drainage pump with a hose to the nearest sewer. Provide the pump with a float and include system maintenance in the scope of planned works. At least once a quarter, technical support should check that everything works.
The data center of one of our clients had a rather complicated air cooling system — air-freon/freon-glycol/glycol-water which was supplied to the fan coils of the server room. The water system had large three-ton storage tanks with water. They could cool the equipment for another 40 minutes if the freon system failed.
Two walls away from that premise, there was a janitor's room with a washstand and sewerage. The cleaners kept their equipment in it. The sewerage system was common, and the drainage pipes in the janitors’ room were routed from that data center.
Once the engineers who were servicing this data center said: “The filters are clogged with some unidentifiable dirt, it seems like there is algae growth in the storage tanks. You should have them rinsed”. We drained three tons of water, and for a while everything was fine. Then, all of the sudden, we heard shouts: “Help, the sewerage is flowing over!” In the data center the floor was flooded with water from the hallway side. However, we could not see the signs that our sewer pipes were leaking. But it turned out that the joint of the sewer ruptured beneath the washstand in the janitors’ room under the pressure of our drainage. Everything in their room was flooded, and water flowed across two walls to our room.
4. Physical security of the data center
Physical security is a simple thing, but it is important not to forget about it. There should be doors and locks in the rooms. Keys should not be given to random people. Once we were in a data center where one of our clients was renting rack space. An old lady at the entrance gave us the electronic key to the access control system. There was only one key for all visitors, so one could only use the logbook to see which of the guests visited the server room. In turn, we could have written anything in the entry logbook.
We came to replace some equipment. I noticed the cameras that were installed between the rows and were recording everything that was happening. I went downstairs to the lady's desk: She had a monitor and was probably supposed to be looking at it. But she did not seem interested, so we took a relaxed walk around the racks of other clients and checked out our competitors’ equipment.
To arrange a fire extinguishing system, you need to answer the following questions:
- Do you need to extinguish something that has already caught fire?
- How and where to detect smoke?
- Where are the most probable points of ignition?
Do you need to extinguish something that has already caught fire?
If you are not a hosting provider or it is not a huge data center, it does not always make sense to install a fire extinguishing system: installing sensors will often be enough.
When the server room is small and fault tolerance is ensured at the software or cluster level, the loss of one rack may not be worth the fire extinguishing system.
Where are the most probable points of ignition?
Probable points of ignition may be indoors or outdoors.
In the inner rooms, fires occur quite rarely — I have not encountered any cases when a server in the data center caught fire. Much more often fire occurs in rooms with uninterruptible power supply; or batteries or heaters in the next room start emitting smoke.
When we design data centers, we more often think about what is around it: whether fire can come from outside. That is why it usually makes no sense to extinguish fire in the server room: it is better to provide a fireproof perimeter.
The classic solution for fire extinguishing is the inert gas release systems. The gas displaces oxygen from the room and stops the fire. However, if the fire starts outside, inert gas will not protect you — after the wall of the data center burns through and collapses, fresh oxygen will enter the room.
Fire barriers are the most effective method of fire protection. In large data centers, server rooms are divided by a very thick fireproof wall into two parts. In case the one half of the hall burns down, at least the second half remains intact.
However, the roof of the building included some wooden elements. So we suggested discussing a rather reasonable question: “Guys, do you really need a wall here? If the roof catches fire, it will have no sense”. Fortunately, we were able to persuade the client. Therefore, it is important to pay close attention to the design, rather than just use a formal approach. If the data center design has the Firefighting and Fire Alarm section, it does not yet mean that nothing will burn down.
The fire extinguishing system should be simple
Once our engineer was performing preventive maintenance works. He retrieved the documentation for the fire extinguishing system and reviewed its intended design. After that, he disconnected the actuators and the gas cylinder and tested the system. But the documentation did not specify that the fire extinguishing system de-energizes in the entire server room. As a result, all activity stopped.
Why it is important to design the data center carefully and pay attention to details
I often saw the same mistake in clients’ data centers. It was caused by inattentive design. Imagine a diesel generator that works in a noise-proof enclosure: Only the exhaust and the radiator grill are exposed. A powder fire extinguishing system is placed above the generator. In case of fire, the flame will remain inside the enclosure. The powder that the system will release will not extinguish the fire, or eliminate the source of ignition inside the enclosure. It is this lack of attention to details that causes technological failures.
Powder fire extinguishing systems cannot be used in server rooms. In case of fire, the powder will actually stop the combustion reaction, but the equipment will fail – servers will suck the powder in and it will settle on the fans and components.
How to make the fire extinguishing system simpler
The simpler a fire extinguishing system is, the more efficient and easy to maintain it is. An example of such a system are the special STEG tiles, which are installed in a rack with servers. If the room heats up, the tiles release a special gas that extinguishes the fire. This system does not require any fire sensors.
What to remember about the fire extinguishing system
- Sometimes it is not economical to install a fire extinguishing system in the server room.
- The probable points of ignition are usually outside of the server room, rather than inside it.
- Fire barriers are the most effective method of fire protection.
- Powder fire extinguishing systems cannot be used in a server room — powder gets into the equipment and damages it.
Ventilation and climate control in the server room
Although the server room is not permanently staffed, it needs good ventilation and removal of dust, which accumulates with each employee visit. Air preparation requires maintenance and constant filter replacement. If there are only 6 racks in the server room, it makes no sense to spend money on ventilation.
Excessively dry air can cause a static step discharge. This is a static discharge that occurs when a human step produces a difference of potentials. It is impossible to make all people walk around wearing antistatic bracelets connected to the grounding bus, so you need to take care of humidity in the server room.
There are three ways to approach the issue of monitoring and supervision:
1. Proceed without a monitoring and supervision system
The decision to proceed without a supervision system may be a working option. This choice leads to the fact that accidents happen anyway, but they become fatal.
2. Using sensors installed on the equipment
Various sensors installed on equipment, such as servers, switches or PDUs, may be used for supervision.
There are data center supervision systems, which are based on built-in sensors. On a modern server, as a rule, there are 2-3 temperature sensors at the inlet, on the CPU, on the power supply and on the outlet: you can view the status of equipment and climate in the server room.
Nevertheless, it is impossible to determine from the internal sensors whether someone is working with the equipment now, whether the airflow near the server is not blocked, and other subtle points. For example, perhaps an engineer is standing in front of the rack right now, or the walls of the cabinet are open or it is disassembled altogether.
3. Using a supervision solution by third-party vendors
A specialized solution by third-party vendors or assembled independently on controllers allows you to install separate airflow, temperature, humidity and other sensors. In this case, you can connect specialized software for monitoring the environment in the data center.
Write a policy on how to respond to the problem
The main difficulty of supervision is not to detect the problem, but to solve it. Therefore, it is important to prescribe in advance what to do when a critical situation occurs.
If an operator engineer sees that the rack temperature sensor suddenly starts to show 32 degrees, he must know exactly whether to call other specialists or that no actions are required.
How to build dependency lines
One of our clients had a PDU that failed. Panic started immediately because it was the power supply panel to which the servers were connected. Engineers were urgently called to the data center. It turned out that the PDU was placed in an empty rack and no servers were connected to it.
One of the approaches to supervision that helps reducing the anxiety is to build dependency lines. In such a system, if one of the sensors deviates from the normal values, you can check all the dependent sensors in the line and make a decision.
In case of a PDU, you may build a dependency line in this way:
- Sensors record the current and voltage on the PDU;
- Readings are taken from the power supplies on all servers that are connected to the PDU;
- Server temperature is measured;
- Whether the dependent servers are on or off is indicated.
Automate management of your data center
You cannot secure the server room from human errors, but it is possible to exclude people from all possible processes. Even where automation is more expensive than human work, it is worth automating.
Why it is important to manage employee access to the system and record their actions
Most problems in data centers are caused by human error: For example, someone accessed the system and broke it.
Divide data centers into systems: utility infrastructure, physical infrastructure for power cooling, ventilation, network and server rooms. Each group of employees should be responsible for their own workspace and should not have access to related systems.
To understand the causes of an accident, you need to have information from both the network and utility infrastructure. All users and their actions should be visible in the monitoring system. For example, operations on the server management controller, entrances to the ACS system, entrances to the room. Then there will be no situation when employees shift the responsibility for problems on each other.
8. The main secret to building a data center: spend nine months designing the data center and three months building it
Most server room designers copy the same designs — they are engaged in stock design. It is very rare that design is approached as a toolkit for a builder, engineer, and integrator.
The attempt to assign all stages of data center design to different parties is doomed to failure from the beginning. To build a reliable data center, you can:
- Use a company specializing in building data centers.
- Use consultants. For instance, we are often approached for expert review by design organizations.
Once my rather peculiar university lecturer explained to me the main principle of building a data center:
Rushing to finish the design too early may lead to absurd errors like some of the ones I told about. Fixing these can be costly and time-consuming.
Управляйте дата-центром с DCImanager
To make sure your infrastructure runs without failures, we recommend using the DCImanager platform.
It monitors the status of the data center: collects metrics on power consumption, temperature, traffic, proper operation of the infrastructure, and alerts about any problems.
It manages equipment: servers, network devices, PDUs and other equipment.
It manages IT assets based on ITAM system from purchase planning to decommissioning. DCImanager controls the filling of racks, keeps inventory of equipment and tracks the status of address space.