Remove Atlassian Stash from an Ubuntu system – CommandDump

Remove Atlassian Stash from an Ubuntu system – CommandDump

To remove atlassian stash from an Ubuntu system (in my case I needed a clean clone of a system similar to a system we Atlassian Stash on)

This assumes that you are using the default install and home locations ,  you may have to change the paths for your system (be careful,  you dont want to accidentally do this if you need the information)

sudo service stop atlstash
sudo rm /var/atlassian/stash  -rf
sudo rm /opt/atlassian/stash -rf
sudo update-rc.d -f atlstash remove 
rm /etc/init.d/atlstash 

Postfix Configuration: The argument against reject_unverified_sender

Postfix Configuration: The argument against reject_unverified_sender

When configuring a mail server there techniques available which can help you to reduce spam.

A popular spam reduction technique is to ‘greylist’ emails.  This method temporarily rejects an email if the server has never seen a prior attempt to send the same combination of From Email, To Email and From IP address ,  legitimate emails are tried again after a few moments and the email goes through no problem.

Another option in the postfix system which can be used to reduce spam is the ‘reject_unverified_sender’ option.   As postfix confirming that the ‘Sender’ is a valid user,  a connection is made to the server associated with the Senders domain (by MX record).   It goes through the email sending process far enough to find out if the server accepts or rejects the Sender email (RCPT TO: sender@email.com).

While it seems like a good idea to confirm that the email address that is sending email to us is valid,   if the sender’s server has greylisting on their domain they would reject the ‘verification’ connection,  which would then ‘reject’ the attempted message.

For this reason,   we choose not to globally install postfix servers with reject_unverified_sender.

————————–

There is some argument though that this does not cause a permanent problem,    because when the reject_unverified_sender_code is set to 450.  Because the rejection of the email will only be temporary and when the email is attempted again,  the sender verification should pass the grey listing.

However,  this is not good enough for me because there are other reasons that the sender verification could fail.  Such as the fact that the server does not accept the MAIL FROM email from the verifying server.  This could be because the doublebounce@ email address used for verification is not accepted by the server for other reasons such as the fact that THEY may be doing some sender verification,  which would fail when an email is attempted to doublebounce,  additionally the verifying server could end up getting a bad ‘reputation’ by only checking to see if email addresses are deliverable,  but then never actually delivering a mail message.

For these reasons,  I recommend skipping this setting and perhaps only using the reject_unknown_sender_domain which just checks whether there is a valid MX record for a domain.

While postfix is powerful with a plethora of options, not all of them immediately helpful.

Recovering / Resyncing a distributed DRBD dual primary Split Brain – [servera] has a different data from [serverb]

Recovering / Resyncing  a distributed DRBD dual primary Split Brain – [servera] has a different data from [serverb]

A client had a pair of servers running drbd in order to keep a large file system syncronized and highly available.  However at some point in time the drbd failed and the two servers got out of sync and it went unnoticed for long enough,  that new files were written on both ‘servera’ and on ‘serverb’.

At this point both servers believe that they are the primary,  and the servers are running in what you call a ‘Split Brain’

To determine that split brain has happened you can run several commands.  In our scenario we have two servers servera and serverb

servera#drbd-overview
 0:r0/0 WFConnection Primary/Unknown UpToDate/DUnknown C r----- /data ocfs2 1.8T 1001G 799G 56%
serverb#drbd-overview
 0:r0/0  StandAlone Primary/Unknown UpToDate/DUnknown r----- /data ocfs2 1.8T 1.1T 757G 58%

From the output above (color added) we can see that servera knows that it is in StandAlone mode,   the server realizes that it can not connect.  We can research the logs and we can find out why it things it is in  StandAlone d.  To do this we grep the syslog.

serverb#grep split /var/log/syslog
Nov 2 10:15:26 serverb kernel: [41853948.860147] block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0
Nov 2 10:15:26 serverb kernel: [41853948.862910] block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0)
Nov 2 10:15:26 serverb kernel: [41853948.862934] block drbd0: Split-Brain detected but unresolved, dropping connection! 
Nov 2 10:15:26 serverb kernel: [41853948.862950] block drbd0: helper command: /sbin/drbdadm split-brain minor-0
Nov 2 10:15:26 serverb kernel: [41853948.865829] block drbd0: helper command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0)

This set of log entries lets  us know that when serverb attempted to connect to servera,  it detected a situation where both file systems had been written to,  so it could no longer synchronize.  it made these entries and put itself into Standalone mode.

servera on the other hand says that it is waiting for a connection WFConnection.

The next step is to determine which of the two servers has the ‘master’ set of data.  This set of data will sync OVER THE TOP of the other server.

In our client’s case we had to do some investigation in order to determine what differences there were on the two servers.

After some discovery we realized that in our case serverb had the most up to date information, except in the case of one directory,  we simply copied that data from servera to serverb,  and then serverb was ready to become our primary.  In the terminology of DRBD,  servera is our ‘split-brain victim’ and serverb is our ‘splitbrain survivor’  we will need to run a set of commands which

  1. ensures the status of the victim to ‘Standalone’ (currently it is ‘WFConnection’)
  2. umount the drive on the victim(servera) so that the filesystem is no longer accessible
  3. sets the victim to be ‘secondary’ server,   this will allow us to sync from the survivor to victim KNOWING the direction the data will go.
  4. start the victim (servera) and let the let the ‘split brain detector’ know that it is okay to overwrite the data on the victim(servera) with the data on the survivor (serverb)
  5. start the survivor(serverb)  (if the serverb server was in WFConnection mode, it would not need to be started,  however ours was in StandAlone mode so it will need to be restarted)

At first we were concerned that we would have to resync 1.2 TB of data,  however we read here  that

The split brain victim is not subjected to a full device synchronization. Instead, it has its local modifications rolled back, and any modifications made on the split brain survivor propagate to the victim.

The client runs a dual primary,  however as we rebuild the synced pair, we need to ensure that the ‘victim’  is rebuilt from the survivor,  so we move the victim from a primary, to a secondary.   And it seems that we are unable to mount a drive (using our ocfs2 filesystem) while it is a secondary.  So we had to ‘umount’ the drive,  and we were unable to remount it while it is a secondary.  In a future test (in which restoring data redundancy primary / primary is less critical),  we will find out whether we are able to keep the primary/primary status while we are rebuilding from a split brain.

While the drbd-overview tool shows all of the ‘resources’ we are required to use a third parameter specifying the ‘resource’ to operate on .  If you have more than one drbd resource defined you will need to identify which resource you are working with.   You can look in your /etc/drbd.conf file or in your /etc/drbd.d/disk.res (your file may be named differently).  The file has the form of

resource r0 { 
....................
}

where r0 is your resource name,   you can also see this buried in your output of drbd-overview

servera# drbd-overview
0:r0/0 WFConnection Primary/Unknown UpToDate/DUnknown C r----- /data ocfs2 1.8T 1001G 799G 56%

So we ran the following commands on servera to prepare it as the victim

servera# drbd-overview #check the starting status of the victim
 0:r0/0 WFConnection Primary/Unknown UpToDate/DUnknown C r----- /data ocfs2 1.8T 1001G 799G 56%
serverb# drbd-overview #check the starting status of the survivor
 0:r0/0 StandAlone Primary/Unknown UpToDate/DUnknown r----- /data ocfs2 1.8T 1.1T 760G 58%

From this above we can see that serverb has 58% usage and 760GB free, were server a has 56% usage and 799GB free.
Based on what I know about the difference between servera and serverb, this helps me to confirm that serverb has more data and is the ‘survivor’

servera# drbdadm disconnect r0 # 1. ensures the victim is standalone 
servera# drbd-overview #confirm it is now StandAlone 
 0:r0/0 StandAlone Primary/Unknown UpToDate/DUnknown r----- /data ocfs2 1.8T 1001G 799G 56%
servera# umount /data # 2. we can not mount the secondary drive with read write
servera# drbdadm secondary r0 # 3. ensures the victim is the secondary 
servera# drbd-overview #confirm it is now secondary 
  0:r0/0 StandAlone Secondary/Unknown UpToDate/DUnknown r-----
servera# drbdadm connect --discard-my-data r0 # 4. start / connect the victim up again knowing that its data should be overwritten with a primary 
servera# drbd-overview #confirm the status and that it it is now connected [WFConnection]
  0:r0/0 WFConnection Secondary/Unknown UpToDate/DUnknown C r-----


I also checked the logs to confirm the status change

servera#grep drbd /var/log/syslog|tail -4
Nov 4 05:14:03 servera kernel: [278068.555213] drbd r0: conn( StandAlone -> Unconnected )
Nov 4 05:14:03 servera kernel: [278068.555247] drbd r0: Starting receiver thread (from drbd_w_r0 [19105])
Nov 4 05:14:03 servera kernel: [278068.555331] drbd r0: receiver (re)started
Nov 4 05:14:03 servera kernel: [278068.555364] drbd r0: conn( Unconnected -> WFConnection )

Next we simply have to run this command on serverb to let it know that it can connect as the survivor (like I mentioned above,  if the survivor was in WFConnection mode,  it would automatically reconnect,  however we were in StandAlone mode)

serverb# drbd-overview #check one more time that serverb is not yet connected
 0:r0/0 StandAlone Primary/Unknown UpToDate/DUnknown r----- /data ocfs2 1.8T 1.1T 760G 58%
serverb# drbdadm connect r0 # 5. start the surviving server to ensure that it reconnects
serverb# drbd-overview #confirm serverb and servera are communicating again 
 0:r0/0 SyncSource Primary/Secondary UpToDate/Inconsistent C r----- /data ocfs2 1.8T 1.1T 760G 58%
 [>....................] sync'ed: 0.1% (477832/478292)M
servera# drbd-overview #check that servera confirms what serverb says about communicating again 
 0:r0/0 SyncTarget Secondary/Primary Inconsistent/UpToDate C r-----
 [>....................] sync'ed: 0.3% (477236/478292)M

Another way to confirm that the resync started happening is to check the logs

servera# grep drbd /var/log/syslog|grep resync
Nov 4 05:18:09 servera kernel: [278314.571951] block drbd0: Began resync as SyncTarget (will sync 489771348 KB [122442837 bits set]).
serverb# grep drbd /var/log/syslog|grep resync
Nov 4 05:18:09 serverb kernel: [42008909.652451] block drbd0: Began resync as SyncSource (will sync 489771348 KB [122442837 bits set]).


Finally,   we simply run a command to promote servera to be a primary again,  and then both servers will be writable

servera#drbdadm primary r0
servera# drbd-overview
 0:r0/0 Connected Primary/Primary UpToDate/UpToDate C r-----
servera# mount /data #remount the data drive we unmounted previously

 

 

Now that we ‘started’ recovering from the split-brain issue we just have to watch the two servers to confirm once they have fully recovered.  once that is complete we will put in place log watchers and FileSystem tests to send out a notification to the system administrator if it should happen again.

Debugging Multiple MySQL query in a stored procedure – unusually long runtime

Debugging Multiple MySQL query in a stored procedure – unusually long runtime

I recently was asked to look into an issue where a client had recently moved a mysql database from one server to another server,   and the run time of a stored procedure (which internally ran 2000+ queries ) was running 90 seconds,  where it had previously only taken 0.6 seconds.

What follows is an ongoing journal of the debugging, trouble shooting and discovery as we walked through the process of addressing issues.  Along the way we had lots of small wins,  and at the end we had one major win which basically allowed us to undo all of the other changes.

The result is a documented discussion of many different methods to trouble shoot issues,  as well as some solutions to different issues.   I find these type of blogs to be useful to me in the future as they help  me to shortcut addressing different issues where I don’t have to come up similar troubleshooting steps again.

Here is a list of commands I ran that help me evaluate the overall server

  • top – to determine whether there was significant load and adequate memory free (in this case there was a load average hovering between .2 and 1 and there was adequate memory)
  • mysqladmin variables|grep datadir – the location of the MySQL datadirectory
  • mysqladmin variables|grep storage – to find out what storage engine is being used (innodb or myisam)
  • mysqladmin variables|grep binlog – to find out what synchronization or replication the server might be doing
  • df -h – to determine what disks are mounted and used to store you mysql instance
  • cat /proc/mdstat – to find out if there are any software raid instances
  • du -h /var/lib/mysql – to show the total size of the databases
  • ls -lt /var/lib/mysql/*/* – to find out the size of tables and which tables and logs are update most often (in my case there were no large tables being updates during the last few days)
  • echo ‘desc tblname’|mysql databasename – to ensure there is an index on the column I am using to update with
  • dd if=/dev/zero of=/tmp/test bs=1048576 count=2048 – to check the write speed (I have had to run this command multiple times before in order to get accurate disk write speeds because of File System caching)
  • dd if=test of=/dev/null bs=1048576 – to check the read speed
  • iostat -x 1  – to watch the disc write speed and usage (run this in the other window to watch dd run) -x gives more table based output
  • echo “show create procedure procedurename” |mysql databasename – to evaluate any stored procedures that are causing the issues
  • wc -l /var/lib/mysql/mysql-bin.* – find out how big the bin logs are (if any) to help determine what kind of mysql overhead is required to keep the bin logs up to date.

I created two connections to the server so I could evaluate what was happening as I executed the 180 second run time of the query.

  • Based on the output of the dd commands the write speed on the /var/lib/mysql folder is about double (60+ MB/s) than thta of the /home dir.  They are separate partitions however we are not doing any software raid,  so I assume some sort of a Hardware Raid with striping.

I additionally started to dig into the stored procedure that started having the issue,  adding debugging statements, commenting out capabilities to evaluate the time each takes.

  • As I ran the stored procedure,  I also ran wc -l /var/lib/mysql/mysql-bin.000current and found that the binlog was growing rapidly during the running of the stored procedure.   It turns out that the binlog file was only growing by less than 100 K during the time that 100s of queries were running,
  • Next I evaluated the contents of the binlogs and it turns out that it is actually adding 1000s of lines to the bin logs,  multiple every time that one of the ‘update’ queries inside of the stored procedure ran.

Now I recognize that somehow,  inside of the stored procedure,  the system is syncing the results of an update to the disk before it is done.

  • The server it am running on is 5.5.xx so there are not any advanced binlog write delay features like in 5.7 so I had to come up with another solution.

By using a Start Transaction and Commit,  I could avoid writing / committing the transaction to disk until the very end of stored procedure,  so I added this to the begging and end.

This was IT!  by using transactions I avoided having to evaluate and reconfigure the server (which is running a master slave configuration for backup) and I was able to reduce SP run times from

180 seconds to sub seconds.

One important thing to note about the amount of time that it takes to execute – Each update statement that it runs (more than 2000) do not actually end up making any changes to the database.  Perhaps this is much easier for MySQL to write to the binlog after the transaction is complete,  because the server has the ability to say that the NET CHANGE is …. nothing.

 

Digging Deeper

It turns out that even though I was able to find a way to make the multiple sql statements go faster on the new server.  We still have other uses of the same SQL Statements which require a faster run time.  (we are talking about an execution time of a single query which takes .15 seconds on the new server and .01 seconds on the old server.   Some additional commands that I ran (on both the old and new servers) in order to figure out differences and why it may be running a bit slower on the new server.

  • tune2fs -l /dev/mapper/DEVICE |grep -i block
  • mysql.databasename> stop slave; – tried this temporarily to determine whether the running replication is causing a delay – this did not stop the writes to the bin log.
  • mysqladmin variables|grep innodb_buffer_pool_size – confirms that the buffer is big enough to fit the entire set of tables into the server memory
  • mysql.databasename>  SET sql_log_bin = 0; – this stops the current connection from logging to the bin log – this can help to evaluate the amount of time that it takes to run the update event when the bin file is not being updated
  • mysql.databasename>  SET sql_log_bin = 1; — reset this to 1 so that future updates are bin logged.

I noticed with this testing that the sql_log_bin = 0  reduced the update time by 50%!   I have two possible assumptions from this,  which may note be mutually exclusive

  • That the disk seek and write speed is slower on the new server to the point that the additional work that has to be done by writing to the bin file is slowing the server down.
  • That the server is simply computationally slower and CPU work is simply slower,  so this additional work takes longer.

Related to the sql_log_bin is a varialbe I found which seems to make a difference (from a run time of 0.15 second to 0.10) is sync_binlog.  I updated this to set it to 0 and this apparently makes it so that MySQL does not have to wait until the binlog is synced to the file system before it returns.

Some notes about what I feel comfortable ruling out:

  • There is adequate memory,  and memory is not being swapped around by this very mild update.  Infact,  I can run the exact same update statement a dozen times,  to where there is actually NO change happening to the database.
  • When an update is made to the database,  no change is being immediately pushed to a slave.  the only change that happens is a write to the binlog file

So since no change is happening to the database,  i wonder,  what is causing the overhead?  Why would turning off the  sql_log_bin (0) help anything if we are not actually logging a change?

I decided to look into the binlog to see what was being tracked,  even though there is no real need to do anything on the slave server.

  • mysqlbinlog mysql-bin.0000CURRENT|grep MYTABLENAME

The result showed me a plethora of updates to the same table , even though the change did not make a difference to the underlying database.  With more research I found that binlog_format accepts either ‘STATEMENT’ or ‘ROW’.   It seems that if you use ‘ROW’  the actual database changes are somehow appended to the log file,  instead of the statements which generate the change.

  • mysql> SET binlog_format = ‘ROW’;
  • mysql> SET binlog_format = ‘STATEMENT’;

I immediately noticed a change in the amount of time it took to run the statements.  However the time it took is smaller ONLY when there is NOT an update to the database.

To dig deeper here and find out which of the two methods is better, we will need to do a more indepth analysis of how many of the updates actually manipulate the database,   and how many updates do not.   At this point,    I am not yet comfortable changing the binlog method from statement to row permanently because I am not comfortable that the log files will be properly replicated.   I would want to be able to have a more robust test setup where I am not only evaluating a single statement,  but rather a group of them so that I could determine what method of updating is more optimal.

Version Differences

As we look into the difference between the update statement on the new server and the old server,  we can quickly tell that there is an issue which how quickly things are writing to the disk,  when we remove some of the features which write to the disk,  we quickly find out that the queries operate more quickly.  To break this down to a something we can verify,  we stopped all other services which might write to the disk and we ran on a second screen

  • iostat 1 – this shows the statistics of how many writes are happening within one second

We found that there were no writes until we ran an update statement

  • update TABLENAME set column=’xxxx’ where id = 10;

This updates a single rows,  and we were able to see the number of bytes that it wrote to the disk.  We could run the query multiple times and we could see it write to the disk multiple times.  However,  when we run the same test on the original server,  it would write to the disk the first time the update statement ran,  but not the second, or subsequent time.   Logically this makes sense,  the row / column does not need to be updated the second time because the column already = ‘xxxx’ for id = 10.

With this in mind,  we have a  test that I came up with we can do to confirm that our different servers behave differently to the same query.  Since we are running innodb we can check to see some stats about how many times innodb has written something to the disk.  and we can check this to see whether it increments as we make updates

Old Server – MySQL 5.0

  • echo “show engine innodb status”|mysql databasename|sed -e’s/\\n/\n/g’|grep ‘OS file’
    29744505 OS file reads, 10036031 OS file writes, 6213811 OS fsyncs
  • [root@auth1 ~]# echo ” update TABLENAME set column=’xxxx‘ where id = 10;”|mysql databasename
  • [root@auth1 ~]# echo “show engine innodb status”|mysql databasename|sed -e’s/\\n/\n/g’|grep ‘OS file’
    29744505 OS file reads, 10036033 OS file writes, 6213813 OS fsyncs
  • [root@auth1 ~]# echo ” update TABLENAME set column=’xxxx‘ where id = 10;”|mysql databasename
  • [root@auth1 ~]# echo “show engine innodb status”|mysql databasename|sed -e’s/\\n/\n/g’|grep ‘OS file’
    29744505 OS file reads, 10036033 OS file writes, 6213813 OS fsyncs
  • [root@auth1 ~]# echo ” update TABLENAME set column=’1234‘ where id = 10;”|mysql databasename
  • [root@auth1 ~]# echo “show engine innodb status”|mysql databasename|sed -e’s/\\n/\n/g’|grep ‘OS file’
    29744505 OS file reads, 10036035 OS file writes, 6213815 OS fsyncs

Note that the first time that value is set the number of writes to the OS increments,  but the SECOND time it does not,  because there is no actual update to the data

However,  If I run the exact same set of updates on the new server,  the number of file writes increments EACH time that an update statement is run,  regardless of whether the update statement represents a change in the data

New Server – MySQL 5.5

  • echo “show engine innodb status”|mysql databasename|sed -e’s/\\n/\n/g’|grep ‘OS file’
    28371 OS file reads, 27188 OS file writes, 9678 OS fsyncs
  • [root@auth1 ~]# echo ” update TABLENAME set column=’xxxx‘ where id = 10;”|mysql databasename
  • [root@auth1 ~]# echo “show engine innodb status”|mysql databasename|sed -e’s/\\n/\n/g’|grep ‘OS file’
    28371 OS file reads, 27190 OS file writes, 9680 OS fsyncs
  • [root@auth1 ~]# echo ” update TABLENAME set column=’xxxx‘ where id = 10;”|mysql databasename
  • [root@auth1 ~]# echo “show engine innodb status”|mysql databasename|sed -e’s/\\n/\n/g’|grep ‘OS file’
    28371 OS file reads, 27192 OS file writes, 9682 OS fsyncs
  • [root@auth1 ~]# echo ” update TABLENAME set column=’1234‘ where id = 10;”|mysql databasename
  • [root@auth1 ~]# echo “show engine innodb status”|mysql databasename|sed -e’s/\\n/\n/g’|grep ‘OS file’
    28371 OS file reads, 27194 OS file writes, 9684 OS fsyncs

Basically this means that unless we can find out why the update is writing each time,  we are not goig to be able to reduce the query time on the new server.  It seems like this issue is either

  1. Built into the new version innodb engine
  2. Built into the new version of MySQL
  3. Either way,  it could be an option configurable via my.cnf

As an aside this means that we could potentially write sql which DOES NOT select a row if it does not need to update it.Forexample,  note that the following SQL statement has somewhat of a redundancy in it,  but it results in NOT touching a row in the way that we are seeing the row be touched in the New Server

  • update TABLENAME set column=’xxxx‘ where id = 10 and column <> ‘xxxx’

I started looking into a couple of innodb variables to try to find some way of addressing this.  This article seemed to help and led me to the ultimate solution – 8.5.8 Optimizing InnoDB Disk I/O

The article suggested innodb_flush_method parameter to O_DSYNC.   I updated this and immediately my queries that were running at 50 MS went down to 20 MS,  consistanly.   In thinking about this,  we potentially have a small possibility of data loss between the time that the innodb receives the update and when it is flushed to disk.   In our case with the type of data we are tracking,  that is acceptable.   Elsewhere on the page,  I found another variable which seemed to go one more step. innodb_flush_log_at_trx_commit.  This setting intuitively explains that the log must be flushed to the drive each time that the transaction is commited.   The option did not even exist in 5.0 so I am unsure of what the default behavior was,  however when setting innodb_flush_log_at_trx_commit to 0,   I was able to get the query time down to 0.00.

Basically this means that,  after an update query is run, innodb is allowed to return immediately before the change is flushed and stored to disk.    This addresses our concerns completely.

Run this command to see mysql variables that control how things are flushed.

  • mysqladmin variables|grep flush

So drastic was the change I made using the innodb_flush_log_at_trx_commit variable set to 0,  I decided to undo some of the other testing I had done,  to see how performance was without them.

  1. When I restore innodb_flush_method to default (NULL),  the time to update stayed at 0.00
  2. The OS file writes DO NOT always increment immediately,  however they will increment after a second or two.  I could confirm this by running a statement to look at the number of srv_master_thread log flush and writes: in the innodb status, immediately before and immediately after an update statement and they would be the same,  however if I wait a couple of seconds,  they would increment.
    1. echo “show engine innodb status”|mysql radius|sed -e’s/\\n/\n/g’|grep srv_master_thread
  3. I updated restored sync_binlog from 0 back to 1 in the /etc/my.cnf.  to determine whether syncing the binlog makes a difference.   This is where I ran into a bit of discrepancy.  I have actualy been testing with two separate servers this whole time,  one is under load,  and one is not under load.  until now everything has been the same,  however changing the sync_binlog variable affects the performance differently on the two servers.
    1. On the server WITHOUT load,  the sync_binlog can be 1,  and the server still has an update time of 0.00
    2. On the server WITH load,  the sync_binlog must be 0 to keep the update time to 0.00
      This makes some  sense as a server that does not have much load on it,  may have more File System caching available or other resources available in order to be able to quickly respond to a single infrequent disk write.  This is likely a similar principle to the concept where we must run dd a couple of times in order to get past the FS cache.

 

Complex Restore of Postgres data where multiples rows are deleted from a single row.

Complex Restore of Postgres data where multiples rows are deleted from a single row.

We ran into an issue with a client database where an application deleted upwards of 360 K records. In this process we found we needed to restore a database from several hours ago but we were unable to do a full system restore,  so here is the recepie for how to resolve

  1. restore a database gziped file to a ‘sandbox’ database
    #gunzip -c database_back.1014|psql -Uuser database_restore_20150909
  2. alter the public schema to rename it
    #echo ‘alter schema public rename to ‘restore_09122015’|psql -Uuser database_restore_20150909
  3. dump the database and load to the live db in the new schema
    #pg_dump -Uuser database _restore_20150909|psql -Uuser livedb
  4. create insert statements which will insert all missing records from the new schema into the live (public) schema
    #awk ‘/tbl/{gsub(“tbl”,””); print “insert into tbl”$1″ select r.* from restore_09122015.tbl”$1″ r left join tbl”$1” n using(“$1″id) where n is null; “}’ tbllist.save | psql –U user livedb.

Note that this relies on your database being in the same consistent naming format tblshortname.shortnameid.

 

 

Find out which PHP packages are installed on ubuntu / debian

Find out which PHP packages are installed on ubuntu / debian

As we have moved or upgraded sites from one server to another,  sometimes we have needed to know which PHP5 dependencies were installed on one server servera,  so that we could make sure those same dependencies were met on another server serverb

To do this we can run a simply command line tool on server a

servera# echo `dpkg -l|awk '$1 ~ /ii/ && $2 ~ /php5/{print $2}'`
libapache2-mod-php5 php5 php5-cli php5-common php5-curl php5-gd php5-mcrypt php5-mysql php5-pgsql 

and then we copy the contents of the output and past it after the apt-get install command on serverb

serverb# apt-get install libapache2-mod-php5 php5 php5-cli php5-common php5-curl php5-gd php5-mcrypt php5-mysql php5-pgsql 

Dont forget to reload apache as some packages do not reload it automatically

serverb# /etc/init.d/apache2 reload

Proftpd: FTP Permission denied – Even after confirming correct AuthUserFile user and path permissions

Proftpd: FTP Permission denied – Even after confirming correct AuthUserFile user and path permissions

On A VPS – even though we hav the correct user setup in the proftpd.conf file

User www-data
Group www-data

and I have virtual users setup in proftpd.conf

AuthUserFile /etc/proftpd/ftpd.passwd

and the user I am logging in setup with a special home page in my ftpd.passwd

myusernamexxxx:$MYENCRYPTEDPASSXXXX:33:33::/data/webs/mydomain.com/htdocs:/bin/bash

 

and the permssions on my directory were correct

#cd /data/webs/mydomain.com
#ls 
drwxr-xr-x 2 www-data www-data 4096 Jun 27 14:04 htdocs

So i started up proftpd in my console

proftpd -nd10

and  tried to save an temp.txt file via ftp and I got alot of log files,  but this is the one that tipped me to the problem

2015-07-01 06:50:27,117 myserver proftpd[11396] 127.0.1.1 (x.x.x.x[x.x.x.x]): in dir_check_full(): path = '/temp.txt', fullpath = '/var/www/temp.txt'.

So it appears that proftpd,  while reading the ftpd.passwd to find the username and password,   was not reading the home page.

When I checked the permissions on /var/www it turns out that directory was owned by root,  so I changed it (the directory is empty unused,  i just wanted to confirm whether

#cd /var
#ls
drwxr-xr-x 2 root root 4096 Jun 27 14:04 www
#chown www-data.www-data /var/www

Then when I tried to save a junk.txt file,  it was saved correctly

2015-07-01 07:44:33,024 myserver proftpd[20581] 127.0.1.1 (x.x.x.x[x.x.x.x]): Transfer completed: 0 bytes in 0.03 seconds

This confirms it,   the home directory in the ftpd.passwd file is not being used correctly,  so i have to find out why note.  So I started to dig back through the output and found this line

unable to chdir to /data/webs/mydomain.com/htdocs (No such file or directory), defaulting to chroot directory /var/www

Ultimately I found that the DefaultRoot variable was not supposed to be overridden by the /home directory,   it turns out that the DefaultRoot works WITH the /home directory in the ftpd.passwd file to determine where to place the user.  So the fix for this issue was to simply replace the DefaultRoot /var/www with

DefaultRoot ~

This tells proftpd to use what ever home directory is in the ftpd.passwd file.

Ubuntu Server Time Resetting

Ubuntu Server Time Resetting

We have a server that was having trouble resetting the date on the server to todays date and time,  in the year 2020.  It appeared that the problem happened randomly and in some cases it would happen and then go away.  Here are some of the steps I went through to debug this.

My server has a daily 1:01 AM cronjob to the the date from another local server (to keep all of our servers in sync)

This command syncs the date with that server.

/usr/sbin/ntpdate -v my.localsever.com

Anytime I noticed the date off at 2020, when i would run this command and it would properly reset to the correct time,  so it seems it has to be coming from somewhere other than the my.localserver.com

 

So I decide to try to pinpoint when this happened.  Do to this I started a cron log,  which dumps the date,  every 30 seconds into a file, so I could look at that file to find out when the dates change

 /bin/date >> /tmp/bin.date.log

Now,  next time it happens I will have a history of the minute during which the issue happens and perhaps I can tie it to some process I have running.

Compare the packages (deb / apache) on two debian/ubuntu servers

Compare the packages (deb / apache) on two debian/ubuntu servers

Debian / Ubuntu

I worked up this command and I don’t want to lose it

#diff <(dpkg -l|awk '/ii /{print $2}') <(ssh 111.222.33.44 "dpkg -l"|awk '/ii /{print $2}')|grep '>'|sed -e 's/>//'

This command shows a list of all of the packages installed on 111.222.33.44 that are not installed on the current machine

To make this work for you,  just update the ssh 111.222.33.44 command to point to the server you want to compare it with.

I used this command to actually create my apt-get install command

#apt-get install `diff <(dpkg -l|awk '/ii /{print $2}') <(ssh 111.222.33.44 "dpkg -l"|awk '/ii /{print $2}')|grep '>'|sed -e 's/>//'`

Just be careful that you have the same Linux kernels etc,  or you may be installing more than you expect

Apache

The same thing can be done to see if we have the same Apache modeuls enabled on both machines

diff <(a2query -m|awk '{print $1}'|sort) <(ssh 111.222.33.44 a2query -m|awk '{print $1}'|sort)

This will show you which modules are / are not enabled on the different machines

 

DPM Error 337: Manually Set Name of Allowed DPM Server on Client

DPM Error 337: Manually Set Name of Allowed DPM Server on Client

When installing a Protection Agent from DPM 2007 on a server on a Windows 2003 Domain I ran into a wacky problem where the DPM Administration Console errored out during Install.

A server errored out with a 337 error.

You cannot install the protection agent on MYSERVER because access to the computer has been denied.

The Microsoft Documentation says that the recommended action is “Do the following to troubleshoot this issue: ”

1) If another DPM server is currently protecting MYSERVER use that DPM server to uninstall the protection agent from MYSERVER. Then, use this DPM server to install the protection agent on MYSERVER.

2) Verify that the time on the DPM server and the selected computer is synchronized with the domain controller. At a command prompt, type “net time /set” to synchronize the time with the domain controller.

3) If the computer is a domain controller, verify that the primary domain controller (the PDC Emulator) is running Windows Server 2003 with Service Pack 1 (SP1), and that Active Directory has completed replication between the domain controllers since the Windows Server 2003 SP1 installation.

So I start debugging with item 1.  Since I am not the first Administrator to work on this installation AND there is another DPM server located elsewhere on the network which was recenctly installed,  it seems possible that something odd was done with this computer wpointed to another DPM Server.

First I try uninstalling the client locally on the system and I log into the server to remove the program using Add/Remove Programs,  however the Add/Remove Programs listing is completely blank.  I do a bunch of research on this and eventually  I determine that the Unistall information must somehow be corrupt,  the system needs a OS restore or something.  I cant afford this and will plan to do this another day.   I start browsing around in the Program Files directory for the DPM Client to try and find an uninstall program or something and I happen upon an executable that looks interesting.  I open a command prompt and run the program to find out what it does.

#> cd C:Program FilesMicrosoft Data Protection ManagerDPMbin
#> setDPMServer.exe /?

This shows me how to run this program

#> SetDPMServer -dpmservername < servername>

So I run it with the name of my DPM server

#> SetDPMServer -dpmservername MYSERVER

And it clearly does something giving me the following information.

Configuring dpm server settings and firewall settings for dpm server =[MYSERVER]
Configuring dpm server settings and firewall settings for dpm server =[W2K.DOMAIN.COMMYSERVER]
Configuration completed successfully!!!

Now I head back over to my DPM server and run the install again from the Management Tab of the DPM 2007 Administrator Console.  And would you beleive it?  It installed perfectly.  So there are two things I need to get in order to address the other issues on my network (find out why Add / Remove Programs is not coming up,  find out if anyone is trying to protect some of the computers using another DPM server on the network).
The thing I never found out was if it could have been items 2 or 3 where there is a time sync problem between the AD controller and the server.  It might have been better to start with the easy fixes described in those steps since that would have been easier and would have ruled it in our out.

Until then though this will work; Mystery Solved!

Call Now Button(208) 344-1115

SIGN UP TO
GET OUR 
FREE
 APP BLUEPRINT

Join our email list

and get your free whitepaper