AWS Auto Scaling Group – CodeDeploy Challenges

AWS Auto Scaling Group – CodeDeploy Challenges

First here is my setup

  • A single development / test server in the AWS cloud,     backed by a separate Git Repository.
  • WHen code is completed in the development  environment it is commited to the development branch (using whichever branching scheme best fits the  project)
  • At the same time the code is merged to the test branch,  and the code is available for client testing on the ‘test_stage’ site if they would like
  • Then on an  as needed basis the code in in the test branch (on the test_stage server) is deployed to AWS using their CodeDeploy api
    • git archive test -> deploy.zip
    • upload the file an S3 bucket (s3cmd)
    • register the zip file as a revision using the AWS  Register Revision API call
  • This creates a file that can be deployed to any deployment group
  • I setup two groups in my AWS account , test and live.
  • When the client is ready,  I run a script which deploys thes the ziped up revision to the Test server,  where they are able to look atit and approve.
  • Then I use the same method but move it instead of the www deployment group.

(The complexities of setting this up are deeper than I am going in this article,   but for future prospects,  all of this programming knowledges is stored in our deploy.php  file)

A couple of tricks “they” dont tell you.

  •  Errors can be difficult to debug – if you update your code deployment to do more verbose logging it can help you to determine what some of the errors were.
    • update /etc/codedeploy-agent/conf/codedeployment.yml,   set verbose to yes.
    • restart the service /etc/init.d/code-deployment  restart (it can take several minutes to restart,  this is normal)
    • tail the log files to watch a deployment in real time,  or investigate it after the fact (tail /var/log/aws/codedeploy-agent)
  • Deploying a Revision to servers while they may be going through some termination instability,  may likely cause your deployment to fail when one of you servers terminates.
    • To prevent this,   update the deployment autoscaling plan to have a minim and a maximum of the server,  and do not take it under load during the 10 – 15 minutes (up to 2 hours) issues will cause errors
    • Depending on the load on your servers,  your deployment could take a lot of cpu and could generate an autoscaling alert and could spin up new tasks or send you an email.   There is not a correct way to deal with this,  however it is a good idea to know about it before you  deploy.
    • Finally the item that I wrote this because of,   it appears that when you attempt to deploy a revision to an autoscaling group,  it can cause some failures.
      • The obvious one is that the deployment will fail if it is attempted while the server is shutting down
      • However,  it seems that if you have decided to upgrade your AMI,  and your Launch Configuration,  that a deployment will fail.   And for me,  it actually caused a key failure to login as well (this could have been because of multiple  server terminations and then another server took over the IPs within a few minutes)  Anyway,   much caution about these things.

 

UPDATE:

Well,  the problem was actually that the by ‘afterinstall.sh’ script,  was cleaning up the /opt/codedeployment/ directory (so we didn’t run out of space after a couple dozen deployments),  but I was also removing the appspec.yml file.

 

So I updated the command that runs in the afterinstall to be

 /usr/bin/find /opt/codedeploy-agent/deployment-root/ -mindepth 2 -mtime +1 -not -path '*deployment-instruction*' -delete

Debugging CodeDeployment on AWS

Debugging CodeDeployment on AWS

This article is being written well after I have already installed the CodeDeployment daemon on an ubuntu server,   created an AMI out of it and set it up as an auto launch server from a Scaling Group.

I am documenting the process I went through to dig into the error a bit more, this helps to identify and remember where the logs files are and how to get additional information,  even if the issue is never the same again.

Issues with running the codedeployment showed up as a python error in the log

more /var/log/aws/codedeploy-agent/codedeployment-agent.log 
 put_host_command_complete(command_status:"Failed",diagnostics:{format:"JSON",payload:"{"error_code":5,"script_name":"","message":"not opened for reading","log":""}"

I decided this is not enough information to troubleshoot an error so I had to dig in and fina way to make it more verbose.   I found this file,   just update verbose from false to true

vi /etc/codedeploy-agent/conf/codedeployagent.yml

Then restart the codedeploy-agent

/etc/init.d/code-deployagent restart

This can take quite a while since it runs quite a bit of background installing and checking for duplicate processes. but once it is complete you can check that the process is running again.

ps ax|grep codedeploy

Once this is running in verbose mode,   monitor the log

tail -f /var/log/aws/codedeploy-agent/codedeployment-agent.log 

and re-run the deployment and view the results of the log,  the most useful thing for me was to grep for the folder that more specific error information was written to.

grep -i "Creating deployment" /var/log/aws/codedeploy-agent/codedeployment-agent.log

This showed me the folder that all of the code WAS going to be extracted to,   since there was an error the system actually dumped the contents of an error into a file called bundle.tar in the folder that it would have exported to.

cat /opt/codedeploy-agent/deployment-root/7ddce865-0611-45f0-bf74-459fcf806f23/d-YK4NWBJD7/bundle.tar

This returned an error from the S3 showing that the Code Deploy was having an error downloading from S3,  so I had to add access to the policy to download from S3 buckets as well

 

 

InstanceAgent::CodeDeployPlugin::CommandPoller: Missing credentials – Debug / My Fix

InstanceAgent::CodeDeployPlugin::CommandPoller: Missing credentials – Debug / My Fix

I created a Policy and Launch Configuration according to this documentation

http://docs.aws.amazon.com/codedeploy/latest/userguide/how-to-create-service-role.html

However   I still received this error

InstanceAgent::CodeDeployPlugin::CommandPoller: Missing credentials  - please check if this instance was started with an IAM instance profile

The problem lay somewhere in the configuration of how I setup and Launched the server,  however I set about exploring the server to evaluate and confirm the ‘missing credentials’.

Using these checks I would be able to confirm that the reason behind these errors (and there fore the problem with actually deploying code):

First run this command to check to see what roles were ‘requested’

echo `wget http://169.254.169.254/latest/meta-data/iam/security-credentials/ -O - -q `

This command will return the list of Roles that were given to your server.  You can then extend the request

MyRoleName

Then add the return on to the end of the URL to see the results of attempting to add the role,  in my case the error was displayed

{
 "Code" : "AssumeRoleUnauthorizedAccess",
 "Message" : "EC2 cannot assume the role MyRoleName. Please see documentation at http://docs.amazonwebservices.com/IAM/latest/UserGuide/RolesTroubleshooting.html.", 
 "LastUpdated" : "2015-02-17T04:38:25Z"
}

Basically,   the EC2 was unable to assume the role MyRoleName  did not have permission to Assume a Role,  This could be due to the permissions of the development account that was used to start eh Scaling Group which launched the EC2.  To test this,

  • I login with an administrator account
  • recreate the scaling group,  identical
  • login to the server
  • run the echo `wget http://169.254.169.254/latest/meta-data/iam/security-credentials/ -O – -q ` command

This didn’t do anything,  So I thought I would look into the specifics of why we have a role that should be able to be assumed,  but which has a message which explains that EC2 cannot assume the role.

So I looked back at the article http://docs.aws.amazon.com/codedeploy/latest/userguide/how-to-create-service-role.html and found somewhat of an anomaly,  it seems that the article suggests that we build a trust relationship with gives the ‘codedeploy.us-west-2.amazonaws.com’ service the ability to use this policy.  However,   that does not jive with the Messages I see in the logs defining that EC2 is not able to assume the role.   So I opened the trust relationship under the Role,  and Added the bolded line

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": [
            "ec2.amazonaws.com",
            "codedeploy.us-east-1.amazonaws.com",
            "codedeploy.us-west-2.amazonaws.com"
] }, "Action": "sts:AssumeRole" } ] }

Lo and Behold,  it worked.  Now when I run the following I get a Success Message

echo `wget http://169.254.169.254/latest/meta-data/iam/security-credentials/MyRoleName-O - -q `

So,  it turns out the problem is that the article is either incorrect or was written in order to give the CodeDeploy service the ability to work on EC2,  but not giving the EC2 servers access to the CodeDeploy service.   ( the codedeploy.us-east-1 services are also required in order to to give the deployment group the IAM role.)

While,  it was difficult to find the solution,  this troubleshooting steps above are useful to help identify related or other issues,  I hope you find some use from the tools

 

Michael Blood

 

 

Call Now Button(208) 344-1115

SIGN UP TO
GET OUR 
FREE
 APP BLUEPRINT

Join our email list

and get your free whitepaper