Gluster infrastructure

This website is meant to serve as the source of truth about gluster project’s infrastructure. It describes our tools and general operational procedures.

Infra overview

We maintain several tools and systems for the project. This section describes our tools and any quirks in our setup. The detailed configuration for each tool can be found in our Ansible repository on https://github.com/gluster/gluster.org_ansible_configuration

Ansible

Ansible is the configuration management tool used for the infra.

The main reasons to use it are familiarity by the community, ease of use, and synergy with others projects using it.

Setup

The current setup is using a trusted server that do the deployment based on git commit. The setup details are documented on https://github.com/OSAS/ansible-role-ansible_bastion.

The trusted server is ant-queen.int.rht.gluster.org, and requires to jump from a bastion, any of the current hypervisors would do the trick for now.

Running a playbook or a ansible ad-hoc command

For security reasons, ansible should be run with a specific user on ant-queen.int.rht.gluster.org. The specific user is set in order to restrict usage of the ssh keys and/or salt bus access. There is work on going to let people in a specific unix group do all the work with sudo, but this is not finished yet.

So to run anything, just connect as root, then use su to switch to ansible_admin:

su - ansible_admin

From here, ansible and ansible-playbook can be run directly:

ansible all -m ping

Pushing a change

The setup uses 2 git repositories. The public repository is automatically synced to github on push. The push of a commit also trigger a deployment to apply changes right away.

To push for a change, start by cloning the repository:

git clone https://github.com/gluster/gluster.org_ansible_configuration.git ansible_gluster_public
cd ansible_gluster_public
git remote set-url --push ssh://ant-queen.int.rht.gluster.org/srv/git_repos/public

Then modify and push to the same repository. If you are in the group admins, then you will be able to push. If not, then you can send the patch on gluster-infra mailling list with git send-email.

Fetching a PR from github

If anyone opens a PR on github, you can merge it (after proper review) using the following process, for the PR number 4. I will assume that the repository is setup like as explained in the previous pragraph.

Then you can use the following commands to fetch and merge 1 specific PR (for example, the PR 4):

git fetch origin pull/4/head:pr_4
git checkout master
git cherry-pick pr_4
git push

Running ansible from a admin workstation

In order to run a command on all the servers (or a subset), you can use the following command from a git checkout, provided you have root access:

ansible -i hosts -l some_group -m ping

You can also adjust if you want to use sudo with -u and -K.

DNS

DNS is currently hosted by Red Hat IT, on Red Hat hosted servers. While being less than ideal from a openess point of view, we currently do not have the required ressources to replace that, and due to the criticity of DNS, this is not a priority to be changed.

To modify it, someone with access to the internal dns-maps server need to send the commit.

The current list of persons with access are:

i. Michael Scherer
ii. Nigel Babu

If needed, RH IT can also do a modification and/or grant access.

Naming conventions

In order to be able to use specific domain name matching with ansible, the following naming convention is used.

For ip hosted in Rackspace cloud, we use *.rax.gluster.org.

For ip in the Community cage, we use *.rht.gluster.org.

If the ip is internal and on the VLAN 401 (Common, also called Internal), the domain should be *.int.rht.gluster.org.

If the ip is on the VLAN 400 (Management, also called Admin), the domain should be *.adm.rht.gluster.org.

Historical servers may use *.rack.gluster.org and *.cloud.gluster.org, but newer servers no longer use this. We try to keep the *.gluster.org naming for externally visible name. Ideally, they should only be used for CNAME on real server name. This is done to permit potential migration or change without impacting external users.

Managing networking in the community infrastructure cage

Current setup is composed of 4 servers, each with 4 ethernet interfaces.

For various historical reasons, all of them are directly connected to the internet using the public ip range assigned in the Community Infrastructure Cage, but we plan to change that.

VLAN description

Switches and routers are currently managed by RH IT, and so any VLAN change requires to be asked to their ticket system by someone working at RH. Please contact gluster-infra list for any change, the admins will take care of that.

The network setup is composed of a few VLANs, each to be used for a separate usage.

The first vlan is the public vlan (ID 190). This is used for public internet access, on 8.43.85.129 to 8.43.85.190. It is shared with other tenants in the cage.

The 2nd vlan is the “management” vlan (ID 400), sometime also called “admin vlan”. Each hypervisor is connected there, as well as the IPMI or remote admin cards of the physical servers. Unless exception, no VM should ever be connected here for security reasons.

The 3rd vlan is the “common” vlan (ID 401), also called “internal vlan”. It is used to connect internal server with non routable IP. Server here have internet access behind a NAT, but we are moving on tightening the access with a dedicated firewall managed by us.

IP range assignation

VLAN Name IP Range Gateway
190 Public 8.43.85.128/26 8.43.85.254
400 Management 172.24.0.0/24 172.24.0.254
401 Common 172.24.1.0/24 172.24.1.181

DNS Servers

Any DNS server can work. Historically, we have used either the one given by the hosted (Rackspace), or Google Public DNS (8.8.8.8, 8.8.4.4).

There is plan to isolate internal network by forcing to use a specific DNS resolver, since this can be useful for filtering and forensic later, but nothing is setup yet.

Firewall setup

A set of 2 redundant firewalls, masa.rht.gluster.org and mune.rht.gluster.org have been setup. The setup is in high availiability, with one of the 2 server being the active one using a shared floating ip on both side of the firewall.

The firewalls are connected on public and common vlan, and serve for now only to do NAT for the servers on the common vlan. Due to HA, the servers can be rebooted one at a time during the day, the other one picking the IP and the routing automatically.

To share connexion state, the 2 firewalls are connected on a 3rd interface on a isolated vlan numbered 3998.

The shared IP can be obtained by resolving the name masamune.$domain for the internal and external domain ( int.rht.gluster.org and rht.gluster.org ).

There is no filtering for now, but lockdown is planned later.

Planet

http://planet.gluster.org is the blog aggregator of the gluster community.

This is based on middleman, a ruby framework to build static websites.

Setup

The website is served out of the main webserver, supercolony.gluster.org. The planet is built on webbuilder.int.rht.gluster.org on a regular basis specified in the configuration, see https://github.com/gluster/gluster.org_ansible_configuration/blob/master/playbooks/deploy_webbuilder.yml.

To debug the build, please take a look at the ansible module documentation, on https://github.com/OSAS/ansible-role-web_builder.

Gerrit

We use Gerrit as the primary place for hosting code and conducting code reviews. It is hosted on myrmicinae.rht.gluster.org in RDU.

Plugins

When upgrading, remember the plugin version should match the version of Gerrit. We have the following plugins installed on our instance (apart from the default ones):

Replication

Two repositories from our instance are replicated onto Github

  • glusterfs
  • glusterfs-specs

Authentication Issues

In June 2016, we upgraded Gerrit from 2.9 to 2.12.2. This caused login issues for a lot of users. In older versions of Gerrit, users could set a username after login. In newer versions, not only did this ability go away, Gerrit required that your Gerrit username must match your Github username. It would throw an error if your Gerrit username and Github username were not the same. To fix this, we changed everyone’s Gerrit username to match their Github username. This cleared up the issues for everyone who had trouble back then.

In the recent months, we’ve noticed that when some of the previously active users tried to sign in, they’d run into troubles that looked very much like the old problems. The easiest solution has been to find their old account and remove the entries from accounts_external_ids table. Changing the username entries on Gerrit does not seem to work anymore. I recommend attempting this on staging first and confirming all is well before trying on production.

Gerrit has completely switched to NoteDB for User information, so any hacks we had that needed PostgreSQL queries will not work anymore.

Jenkins

We use Jenkins to run smoke tests, regression tests, build the release tarball, and to build RPMs. It is hosted on myrmicinae.rht.gluster.org in RDU. We track the continous updates on the LTS channel provided by Jenkins.

Plugins

The plugins of note are the following:

  • Gerrit trigger
  • Job configuration history plugin
  • Audit trail
  • Git plugin

Access

Access to Jenkins is via Github. The admins are added directly on the Jenkins UI based on a Github username. To add an admin, they need to first use Github to login to Jenkins so their user is created.

Gluster release managers and some developers have write access to selected parts of the Jenkins user interface through the jenkins-admins team on Github. Unless someone is a part of the Gluster CI team, this is the higest level of access they will be granted.

Managing Jenkins jobs

Jenkins jobs are managed with Jenkins Job Builder. Editing existing jobs or adding new jobs are managed by submitting patches to the build-jobs repository. A Jenkins job called jenkins-update will run jenkins-jobs update when a patch is merged via Gerrit to the repo. This will not run on a direct push to master, however.

Upgrade Jenkins

When there’s a Jenkins security update or an LTS release, we try to install it as soon as possible. Here are the steps to upgrade Jenkins:

  1. Send an email to gluster-infra@ and gluster-devel@ announcing the outage window.
  2. Prepare Jenkins for shutdown. Jenkins -> Manage Jenkins -> Prepare for Shutdown.
  3. Send an email gluster-infra@ and gluster-devel@ that Jenkins is now in quiet mode.
  4. Cancel any jobs that will not finish in the next 30 mins.
  5. Upgrade all plugins. Jenkins -> Manage Jenkins -> Manage Plugins.
  6. Once all currently running jobs are complete, ssh into build.gluster.org and run yum -y update.
  7. Send a follow up email to gluster-infra@ and gluster-devel@ announcing upgrade status.

Munin

Munin is our primary graphing solution. We used it mostly because it is packaged and easy to deploy.

Deployment

Munin was deployed using ansible, with the ansible-role-munin <https://gitlab.com/osas/ansible-role-munin.git> role. It should be deployed on all ansible managed systems, which at the time of writing of this documentation, mean all FreeBSD and Linux systems except gerrit and jenkins.

Procedures

This document is meant to serve as a guide for our standard procedures in case of emergencies, errors, upgrades, or other maintenance tasks.

Adding and restoring backups

backups are done by rdiff-backup, on backups.int.rht.gluster.org. The server connect to the remote server with ssh, so there should be a way to access it. The VM is currently running on formicary.rht.gluster.org, and is not accessible from the internet.

Adding a directory to be backed up

To add a new directory, the file playbooks/deploy_backups.yml must be modified. The role ‘backups’ take a list as a argument, with the directory, the size to be set aside for backups, the maximal age of the files to keep, and the server and user to connect by ssh.

For example, to make a copy of all data from /etc/ since 20 days from server.example.org, assuming this will take 30G and using root, it should be like this:

- directory: /etc
  age_max: "20D"
  size: "30G"
  server: "server.example.org"
  user: root

Make sure that the data to be copied are in suitable format for restoration (SQL dump).

The role will automatically install a ssh key, limit it for usage, and add a logical volume of the specified size on the backups server.

Restoring files from backups

To restore a file, just copy it from the server to the target.

if the file is no longer present, it can be restored using rdiff-backup. Refer to the man pages for usage.

Adding a new VM using ansible

VM creation is fully automated using a custom role that will create the VM, using virt-install.

Step 1: Adding the VM to the DNS

For the time being, DNS is managed by RH IT, but they did delegate the modification to members of the community.

So the first step is to decide the ip address to use, and change it in the reverse DNS and the forward zone in the internal dns-maps repository.

Please refer to the networking documentation to choose a ip, or contact a admin.

Step 2: Creating the VM

In order to create the VM, we need to have:: - an IP address and network setup - the distribution to be used (with a version) - the number of CPU (optional, 1 by default) - the size of the main disk - the size of a data disk (optional) - the size of the ram - the hostname - the hypervisor where it should be installed

For the hypervisor, we are installing Jenkins builder VMs on haplometrosis and pleometrosis, and infrastructure VMs on myrmicinae and formicary. If installing redundant services, please make sure to install on 2 differents hypervisors.

For network parameters, please refer to the networking documentation to get the right gateway and ip.

By convention, we try to place all VM installation role in a separate role whose name end with ‘_vm.yml’.

The bridge parameter indicated the primary network to which the VM would be connected. Due to historical reasons, the bridge name are not consistent on the whole set of servers, so this has been abstracted in variables defined in hostvars ( such as host_vars/myrmicinae.rht.gluster.org/network_interfaces.yml ).

To connect to the external public facing network (VLAN 190), you need to use the bridge defined by bridge_public.

To connect to the internal private network (VLAN 401), you need to use the bridge defined by bridge_common.

Then, a deploy playbook must be added to the ansible repository in ‘playbooks/’ directory, using the guest_virt_install role. For a example, please see playbooks/deploy_gerrit_vm.yml and the documentation of the role for more explanation on the arguments on https://gitlab.com/osas/ansible-role-guest_virt_install

Once the role is written, it can be pushed to the bastion, which would trigger the deployment automatically. It may take 10 to 20 minutes to finish the run.

Step 3: Adding the VM to the inventory

Once the VM is properly installed, it can be added to the inventory file (file hosts in the ansible public repository).

The VM host must also be added to either ‘community_cage’ group, for public facing servers, or to ‘int_rht_gluster_org’ group if plugged on the internal network.

Upon sending the commit to the bastion server, ansible will take care of triggering any role that requires to be deployed there, and add the ssh key to known_hosts if needed.

Resizing a partition in a VM

VMs installed by our automation role will use a LVM backed storage, permitting easier resizing. The process is not automated yet, partially because this is unfrequent enough to not warrant automation right now.

This can be done online if LVM is used all the way, which is the case right now.

Step 1: Resize the logical volume on the hypervisor

On the hypervisor (host.example.org), we have to resize the logical volume. Each VM has 2 differents volume, one for the system, one for the data. System is supposed to be wiped on reinstall, and data is not. Let’s assume we are resizing the data volume, and adding 50 G to a 100G disk. This operation have to happen on the hypervisor, for a VM called vm.example.org:

lvextend -L 150G /dev/mapper/host-guest--vm.example.org_data

Step 2: Ask to QEMU to refresh the disk

The operation need the size of the block device, so we need to reuse it:

virsh blockresize vm.example.org /dev/mapper/host-guest--vm.example.org_data 150G

Step 3: Resize the disk in the VM

First, we have to resize the physical volume on vm.example.org. The data partition is usually on /dev/vdb but it depend on the VM. The command pvresize is safe to run in all case:

pvresize -v /dev/vdb

Then the volume need to be resized, either to take 100% of the space or more, using -L:

lvresize /dev/mapper/guest-data

And finally, grow the filesystem. This step depend on the filesystem used. For example, for xfs, the command would be:

xfs_growth /dev/mapper/guest-data

For ext4, it would be:

resize_e2fs /dev/mapper/guest-data

Step 4: reflect the change in the playbook

If any change is done, it should be reflected in the playbook. There is usually something to change on the hypervisor playbook, and depending on the VM, in the playbook to deploy the VM or any filesystem deployed there.

Removing a VM

VM removal is not a automated process, due to risk involved, usage of systems outside of ansible reach and unfrequent needs.

Step 0: Do backups and verify the VM is not needed

In order to avoid errors, we should always verify that the VM is really not needed, and make sure we have backups of whatever is needed on that VM. Backups should be done automatically, but the coverage is not perfect yet, so better double check.

Step 1: Shutting down the VM

In order to verify nothing break, the VM should be shutdown first and we should wait to verify nothing broke. Dependin on the VM, we can wait a few hours to a few days.

Step 2: Remove the VM from ansible

The VM need to be removed from ansible hosts file. Edit the ‘hosts’ file and remove the VM from all groups where it appear.

Step 3: Remove it from all playbooks

After a VM removal, if a group is empty, it should be removed, and corresponding playbooks must also be erased. Since all is in git, this can always be reinstated.

Step 4: Clean the DNS

For now, the DNS is managed by RH IT. Please contact a admin to get them modify the internal dns-maps repository. Both the gluster.org zone and reverse zone must be modified.

Step 5: Clean up the ssh key from ansible bastion

To avoid issue in the future is the system is reinstalled, the ssh key need to be removed from the ssh know_hosts file on the ansible bastion.

To do that, connect on the bastion as a regular user:

ssh ant-queen.int.rht.gluster.org

Then use clean_ssh_public_keys.py to clean it:

sudo -u ansible_admin /usr/local/bin/clean_ssh_public_keys.py oldvm.example.com

Step 6: Clean up of the other systems

Depending on the system, several others things need to be cleaned.

For example, the definition of the host need to be removed from FreeIPA. The backup need to be removed if needed, the same go for the logs.

We currently do not have documentation for that, so please contact admins.

Step 7: Removal of the VM

Finally, once we are sure that we no longer ever need the VM or any data on it, you can use virsh on the hypervisor:

virsh undefine  --remove-all-storage oldvm.example.com

Beware, this will also remove the 2nd drive and any attached drive on the VM. Please be cautious.

Installation of a new server

All new servers must be put into the pool of ansible managed server. For that, once the server is installed and a admin has ssh access, the following procedure is used to let ansible manage it.

Step 1: Openssh

Disable root login with a password as soon as possible. Ansible will take care of that, but better be sure.

For that, make sure that the sshd configuration contain:

PermitRootLogin without-password

Step 2: Upgrade

Upon installation, make sure that the server is upgraded to the latest security patches. Exact syntax depend on the OS, report to the documentation of the system. Do not forget to reboot after.

Step 3: Install the ssh keys

On ansible bastion, locate the public key used by Ansible for deployment. It should be in ~ansible_admin/.ssh/id_rsa.pub

Place it on the new server in /root/.ssh/authorized_keys:

ssh server.example.org mkdir -p /root/.ssh
ssh server.example.org chmod 700 /root/.ssh
ssh ant-queen.int.rht.gluster.org cat ~ansible_admin/.ssh/id_rsa.pub | ssh server.example.org tee /root/.ssh/authorized_keys

Test connectivity on the bastion with:

su - ansible_admin
ansible all -i 'server.example.org,' -m ping

it should answer something like this:

server.example.org | SUCCESS => {
    "changed": false,
    "ping": "pong"
}

Then basic configuration can be deployed with:

su - ansible_admin
ansible-playbook -i 'server.example.org,' /etc/ansible/playbooks/deploy_base.yml

Step 4: Add it to ansible inventory

To add the server to ansible, you need to have a checkout of the ansible repo, report to the ansible documentation for instruction.

Then you just need to add to the hosts file, in the appropriate group, and push (and/or make a PR).

Connecting to the admin remote interface in the community cage

The 4 servers in the DC are: - pleometrosis.rht.gluster.org - haplometrosis.rht.gluster.org - myrmicinae.rht.gluster.org - formicary.rht.gluster.org

Each of them is connected on the admin network.

Here is a summary of the ip address:

Server IP Admin interface
haplometrosis 172.24.0.11 172.24.0.1
pleometrosis 172.24.0.12 172.24.0.2
myrmicinae 172.24.0.13 172.24.0.3
formicary 172.24.0.14 172.24.0.4

Step 1: Setting up a tunnel with ssh

Depending on the operation, you will need to connect to one server in order to make a ssh tunnel for the others.

If you want to connect to pleometrosis admin interface, use this:

ssh -N -L 5900:172.24.0.2:5900 -L 8080:172.24.0.2:8080 haplometrosis.rht.gluster.org

If you want to connect to haplometrosis admin interface, then use this:

ssh -N -L 5900:172.24.0.1:5900 -L 8080:172.24.0.1:8080 pleometrosis.rht.gluster.org

If you want to connect to myrmicinae admin interface, then use this:

ssh -N -L 5900:172.24.0.3:5900 -L 8080:172.24.0.3:8080 -L 8443:172.24.0.3:8443 pleometrosis.rht.gluster.org

If you want to connect to formicary admin interface, then use this:

ssh -N -L 5900:172.24.0.4:5900 -L 8080:172.24.0.4:8080 -L 8443:172.24.0.4:8443 pleometrosis.rht.gluster.org

Step 2: Connecting to the web interface

Just use a browser and go to http://127.0.0.1:8080/

Connect using the login/password provided by the infra team.

iDrac (for Dell servers) will redirect you to a https connexion, with a self signed certificate. Adding the ssl certificate key and/or a proper certificate is planned to be done later.

Step 3: Connecting to VNC

Supermicro and Dell use a java applet for that. Make sure that iced tea is installed and properly configured on your system.

On the web interface, you can click on the preview of the console to start the VNC interface. If the browser download the launch.jnlp file, then you have to open it with /usr/bin/javaws.

Adding a bridge on a hypervisor

Current nmcli ansible module (as of Ansible 2.4) do not work work and do not implement any bridge support, and fixing it would requires a rather complete rewrite.

So we can’t really automate this part.

Step 1: Creating the bridge with nmcli

Assuming we want to add a bridge virbr2 to be used for hypervisors, with a specific vlan connected on em2.

The following commands can be used:

$ nmcli con add type bridge con-name virbr2 ifname virbr2
$ nmcli con add type ethernet con-name em2 ifname em2 master virbr2
$ nmcli con modify virbr2 bridge.stp no

Step 2: Complete the ansible configuration

In order to not hardcode any interface, we use variable for each bridge connected on a specific VLAN. They are meant to be used by the role that install VMs.

See the file network_interfaces.yml in each hypervisor host_vars directory in the ansible repository for the format used.

Finding network information in the Community cage

To find which port is connected to each VLAN, tshark and lldp announces can be used.

Step 1: Connect to the hypervisor as root

All hypervisor are connected to the public internet for now. So ssh as root is sufficient:

$ ssh root@server.example.com

Step 2: Run tcpdump to capture traffic

The switches announce information using the Link Layer Discovery Protocol, that can be printed using specific tool or tshark, the command line version of wireshark.

However, on EL7, tshark come with wireshark, which pull a complete GTK and X11 stack, and we want to avoid it. It is also safer to not run tshark as root due to the complexity of the parser and past security issues.

So the recommended approach is to dump network using tcpdump, then locally look with tshark.

To find for example which vlan is on virbr2, the command line would be:

# tcpdump -i virbr2 -w /tmp/dump not ip and not arp and not stp

After enough time (something like 30s to 1 minute), stop tcpdump with Ctrl-C

Step 3: Dissect the dump with tshark

Copy the file using scp:

$ scp root@server.example.com:/tmp/dump /tmp/dump

Verify with tshark the content:

$ tshark -r /tmp/dump -Y lldp
24          11 JuniperN_fe:f4:30 -> LLDP_Multicast LLDP 378 Chassis Id = 5c:43:27:ef:f8:04 Port Id = ge-1/0/43 TTL = 120 System Name = sw02-access-r3-06-b01

Here, we can see that the that the switch were that interface is connected is ‘sw02-access-r3-06-b01’, the port is ‘ge-1/0/43’

Once the information is found, do not forget to remove the dump:

$ rm -f /tmp/dump
$ ssh root@server.example.com rm -f /tmp/dump

Requesting a Loaner Machine

If you wish to debug a test failure on our test environment, you can request a loaner build node.

How to Request

  • File a bug asking for a loaner machine.
  • Mention the operating system version and version (For example: NetBSD 7).
  • Attach your SSH public key to this bug. You will be granted access via SSH keys.
  • Once the machine is setup, you can run your tests. Before returning the machine, please delete any temporary files or folders you use for testing.
  • After you have finished testing, please comment on the bug. The machine will be returned to the pool in 7 days unless you specify that you want it for more time.

How to Handle a Loaner Request

  • Find an idle machine and disable it from Jenkins. Mention the bug number in the reason.
  • Add the requester’s SSH key on the machine as the Jenkins user. Do not close the bug.
  • When the machine is returned, remove the requester’s SSH keys from the Jenkins user and delete any temporary files placed for testing.
  • Return the machine to Jenkins pool.
  • Close the bug after the machine is returned to the Jenkins pool.

Adding a email alias

Email alias are managed by ansible, in the private repository.

Step 1: Clone the private repository

Since we want to keep people email private, for spam and privacy reasons, the list of alias are stored in a private git repository:

git clone ssh://ant-queen.int.rht.gluster.org/srv/git_repos/private gluster_private

Step 2: Add the alias

The aliases are stored as a yaml hash in host_vars/supercolony.gluster.org/email_aliases.yml Format is quite straightforward, the key is the name of the alias, and the value is a list of email where the alias is going to be redirected.

For example:

mail_aliases:
    root:
    - michael
    - nigel@example.com

This redirect root@ to michael@ (who is later redirected to another email) and to nigel@example.com

Step 3: Trigger a deploy

For technical reasons linked to the use of a second repository, ansible do not deploy automatically on commit on that repository. So we either have to wait on automated run of ansible, or do it manually with:

su - ansible_admin
ansible-playbook /etc/ansible/playbooks/deploy_postfix.yml

Adding a new mailling list

In order to add a list, you need to modify the file playbooks/deploy_postfix.yml in the ansible repository. The variable mailing_lists containt the lists, and you just need to add the name of the list there.

Once the list is created (ie, after the push), people in the list-admin alias will get the password by email and need to forward the mail to the owner of the list to configure it.

Banning and unbanning email on all lists

In september 2017, the gluster infrastructure faced a increase in spam with people using a specific pattern to subscribe to the list. In order to fight back, we have deployed several counter measures, one being a command to modify the ban list on every mailling list.

The command need to be executed as root on the lists.gluster.org server:

$ ssh root@lists.gluster.org

Banning a address

A regexp can be used for banning emails. For example, to ban all users with a email like spammer01@example.org, spammer02@example.org, the command would be:

$ /usr/lib/mailman/bin/withlist -a -r unban '^spammer.*@example.org$'

Seeing the current list of ban

To see the current list of bans on all lists, the command is:

$ /usr/lib/mailman/bin/withlist -a -r list_bans

Unbanning a address

To unban massively, the unban command work in a similar way to the ban one:

$ /usr/lib/mailman/bin/withlist -a -r ban '^spammer.*@example.org$'

Gerrit Upgrade

Upgrade Steps

  • Send email to gluster-devel, gluster-infra, and automated-testing announcing the start of the outage window.

  • Put an outage page for Apache. Modify the apache config with this line:

    ErrorDocument 500 "Sorry, our script crashed. Oh dear"
    
  • Stop the port 22 port forward to prevent accidental pushes:

    systemctl stop xinetd
    
  • Stop gerrit.:

    su - review
    cd review.gluster.org
    bin/gerrit.sh stop
    
  • Back up Gerrit DB and files into ~/review inside the review user’s home directory.

  • Download the latest version of gerrit into /review/review.gluster.org

  • Run java -jar gerrit-x.x.x.war init (press enter for most prompts).

  • Make sure you upgrade all the prompted plugins

  • Upgrade Github auth plugin to latest available on ci.gerritforge.com

  • Upgrade events-log to latest available.

  • There is a good chance you need to do a full re-index:

    java -jar /review/review.gluster.org/bin/gerrit.war reindex
    
  • Run bin/gerrit.sh run to confirm that there are no errors.

  • Ctrl + C the terminal and start Gerrit from systemd.

  • Start the SSH proxy via xinetd

  • Restart Apache without the error page

Post Upgrade Testing

The following are them minimum pieces to verify after a Gerrit upgrade:

  • Use the UI to browse changes. Verify diffs are visible. Problems here usually mean that there’s an Apache configuration issue.
  • Logout and login. Any problems here usually point to a Github library mismatch.
  • Verify that all the user names are visible and not just account numbers.
  • Push a test review and verify that the push goes through.
  • Verify that Jenkins voting work as expected for this change.
  • Verify that git protocol clones work.
  • Verify that git.gluster.org works as expected.

Emergencies

Nobody likes it when enough attention isn’t paid to an urgent infra request. We don’t like it when our days are constantly interrupted either. This document is intended to serve as a guide to define what is an emergency and who to contact in case of emergency.

What is an emergency

Any downtime of essential services will be a “code red”. Situations which are currently preventing our users from using services which we are reasonably expected to provide them, or which are stopping our developers from working. As such, the following services fall into essential services:

  • review.gluster.org
  • build.gluster.org
  • download.gluster.org

Please file a bug and get in touch with corresponding point of contacts for the respective service immediately. In a code red situation, all work will be paused until the issue is resolved.

What is not an emergency

Issues which affect developers/users but can be mitigated trivially, like a build node that’s acting up. It can be deactivated by multiple team members.

Contributing

Gluster Infrastructure is open to code contributions. Almost all of our infrastructure is handled with code. If you cannot contribute to a particular piece of infrastructure in our code repos, that’s a bug. Please talk to us.

The primary place for most contributions is our ansible repository on Github. The infrastructure documentation is also maintained on Github and is open to contribution.

The CI system is also open to contribution in terms of new jobs and maintainting the existing scripts. Jenkins jobs are managed by Jenkins Job Builder. The test scripts that actually power our tests are in a different repository on Github and is open for contribution too.

Contact

The gluster-infra list is a good place to get up-to-date information and coordinate with the Infrastructure team. We’re also on irc.freenode.net on #gluster and #gluster-devel.

Infrastructure Responsibility Matrix

The section aims to define ownership and point of contact for issues.

Gluster Sysadmin

Michael Scherer

Gluster CI Engineer

Nigel Babu

Infrastructure Responsibility
Product Owner
Gerrit Gluster CI Engineer
Jenkins Gluster CI Engineer
Build Nodes Gluster CI Engineer
supercolony.gluster.org Sysadmin
webbuilder.gluster.org Sysadmin
munin.gluster.org Sysadmin
download01.gluster.org Sysadmin
freeipa01.gluster.org Sysadmin
syslog01.gluster.org Sysadmin
Infra Security Issues Sysadmin and Gluster CI Engineer

Note: Nigel is only responsible for the Gerrit software and Jenkins software. The uptime, OS updates, and OS upgrades are all owned by the Sysadmin team.

Who to Contact for What

Point of Contact
Situation Contact
Gerrit/Jenkins Outage Sysadmin Team
Gerrit/Jenkins Issues (not an outage) Gluster CI Engineer
Access to machines Sysadmin Team
Access to build nodes (previously called slaves) Gluster CI Engineer
Infra-related build failures Gluster CI Engineer
Infra Security Issues Sysadmin and Gluster CI Engineer

Indices and tables