I've tried three times, and I can reproduce the problem every time. I create few volumes, let's call them A, and few snapshot for A let's call them B. And attach them, install services, running chroot, everything is sweet in comparison with Aoe used before. But the problem occurred when detach all volumes properly, terminate all instances properly, and stop storage controller service properly, and then reboot the machine, and every A and B lost after that.
Furthermore, in my cases, the storage controller became unavailable after reboot, no matter how many ways I tried, it just can create, can snapshot, can attach anymore.
First time I reinstall the whole cloud (I have 12 machines running the cloud now) to solve the problem.
Second time I reinstall CLC, CC, and SC to solve the problem, because simply reinstall SC still can't solve the problem.
This time, I was lucky, all the volume files are still in /var/lib/eucalyptus/volumes and I can see them. But still, those volumes became unusable, they can't be attached to any instances.
And then I observed few interesting things about this problem:
- A can't be attache to any instances
- Can create volumes from B (So that maybe I can have backup using snapshot, and then recover them.)
- But, Volumes recovered from snapshots still can't be attached to any instances
- Can't create snapshot from A
- Can create new volumes
- And they can be successfully attached to instances
So the problem became more interesting, I can create new ones, but can't use old ones. What the hell?
Log files:
cloud-error.log on storage controller:
---------------------Afterit rebooted-------------------------
18:39:20 ERROR [SystemUtil:Thread-16] com.eucalyptus.util.ExecutionException: //
/usr/lib/eucalyptus/euca_rootwrap losetup /dev/loop6 error: loop: can't get inf
o on device /dev/loop6: No such device or address
18:39:20 ERROR [SystemUtil:Thread-16] com.eucalyptus.util.ExecutionException: //
/usr/lib/eucalyptus/euca_rootwrap losetup /dev/loop1 error: loop: can't get inf
o on device /dev/loop1: No such device or address
18:39:20 ERROR [SystemUtil:Thread-16] com.eucalyptus.util.ExecutionException: //
/usr/lib/eucalyptus/euca_rootwrap losetup /dev/loop4 error: loop: can't get inf
o on device /dev/loop4: No such device or address
18:39:20 ERROR [SystemUtil:Thread-16] com.eucalyptus.util.ExecutionException: //
/usr/lib/eucalyptus/euca_rootwrap losetup /dev/loop3 error: loop: can't get inf
o on device /dev/loop3: No such device or address
18:39:20 ERROR [SystemUtil:Thread-16] com.eucalyptus.util.ExecutionException: //
/usr/lib/eucalyptus/euca_rootwrap losetup /dev/loop2 error:
18:39:20 ERROR [SystemUtil:Thread-16] com.eucalyptus.util.ExecutionException: //
/usr/lib/eucalyptus/euca_rootwrap losetup /dev/loop0 error: loop: can't get inf
o on device /dev/loop0: No such device or address
18:40:31 ERROR [SystemUtil:pool-9-thread-1] com.eucalyptus.util.ExecutionExcepti
on: ///usr/lib/eucalyptus/euca_rootwrap losetup -f error: losetup: could not fi
nd any free loop device
18:40:31 ERROR [SystemUtil:pool-9-thread-1] com.eucalyptus.util.ExecutionExcepti
on: ///usr/lib/eucalyptus/euca_rootwrap losetup -f error: losetup: could not fi
nd any free loop device
18:40:33 ERROR [BlockStorage:pool-9-thread-1] com.eucalyptus.util.EucalyptusClou
dException: Could not create loopback device for //var/lib/eucalyptus/volumes/vo
l-5A33062FIl85xWTW. Please check the max loop value and permissions
---------------------Afterit rebooted-------------------------
Okay, I'm sorry, I forget to increase the number of loop devices, and then I increased it and now I have 256 loop devices in /dev, is that enough?
--------------------Tried to create snapshot from old volumes----------------------------
09:43:10 ERROR [SystemUtil:pool-8-thread-5] com.eucalyptus.util.ExecutionExcepti
on: ///usr/lib/eucalyptus/euca_rootwrap tgtadm --lld iscsi --op show --mode targ
et --tid 13 error: tgtadm: can't find the target
09:54:40 ERROR [ISCSIManager:pool-8-thread-2] Unable to delete target: 13
09:55:18 ERROR [BlockStorage:pool-9-thread-1] com.eucalyptus.util.EucalyptusClou
dException: Unable to find snapshot: snap-5F0B0656
09:55:19 ERROR [SystemUtil:pool-9-thread-1] com.eucalyptus.util.ExecutionExcepti
on: ///usr/lib/eucalyptus/euca_rootwrap dd if=/dev/vg-mJ38iQ../lv-snap-Odyjxw..
of=//var/lib/eucalyptus/volumes/snap-5E9D064B bs=1M error: dd: opening `/dev/vg
-mJ38iQ../lv-snap-Odyjxw..': No such file or directory
09:55:19 ERROR [BlockStorage:pool-9-thread-1] com.eucalyptus.util.EucalyptusClou
dException: /var/lib/eucalyptus/volumes/snap-5E9D064B (No such file or directory
)
09:57:18 ERROR [BlockStorage:pool-9-thread-1] com.eucalyptus.util.EucalyptusClou
dException: /var/lib/eucalyptus/volumes/snap-5EDF064F (No such file or directory
)
--------------------Tried to create snapshot from old volumes----------------------------
The CLC, CC shows nothing in their log
--------------------Tried to attach an old volume----------------------------
This is nc.log on one of my node controller
[Fri Sep 3 10:03:02 2010][022544][EUCAINFO ] doAttachVolume() invoked (id=i-538309F7 vol=vol-59870629 remote=//,10.11.25.43,iqn.2009-06.com.eucalyptus.NewACluster:store14,VPqZgEav+fASsHW+HRmK1UfIcx3kSfDR/r+/UjyRoeCsFr1kq0LdfGHEBNowHs9zxLP/Z8nkREkG/m3hr2Oi8YoUQjjlXClVGWZashDvQx3KnEAVAL9zIlrpuzQXVR/ceqJx92GejmkYCSO1u/La/35954zFX288O+ZV8K4gp/diJLbFSnYcArwQBpFX/iJdvIrLxX8KwZ4lefa9S20HlTUqcJxEqkdLoyCLW15x1TtQSijDntsP539UwSQ9jGUwgeb/KMwzTmDOz0R2es/Npeq3vOUW7SzlnWc3iqDFcQlWbzphFB7JmHTFVA/zt0cthpoFn8pmtBy+ka6qqiE1mw== local=/dev/sdp)
[Fri Sep 3 10:03:02 2010][022544][EUCAINFO ] connect_iscsi_target invoked (dev_string=//,10.11.25.43,iqn.2009-06.com.eucalyptus.NewACluster:store14,VPqZgEav+fASsHW+HRmK1UfIcx3kSfDR/r+/UjyRoeCsFr1kq0LdfGHEBNowHs9zxLP/Z8nkREkG/m3hr2Oi8YoUQjjlXClVGWZashDvQx3KnEAVAL9zIlrpuzQXVR/ceqJx92GejmkYCSO1u/La/35954zFX288O+ZV8K4gp/diJLbFSnYcArwQBpFX/iJdvIrLxX8KwZ4lefa9S20HlTUqcJxEqkdLoyCLW15x1TtQSijDntsP539UwSQ9jGUwgeb/KMwzTmDOz0R2es/Npeq3vOUW7SzlnWc3iqDFcQlWbzphFB7JmHTFVA/zt0cthpoFn8pmtBy+ka6qqiE1mw==)
[Fri Sep 3 10:03:02 2010][022544][EUCADEBUG ] system_output(): [//usr/lib/eucalyptus/euca_rootwrap //usr/share/eucalyptus/connect_iscsitarget.pl //,10.11.25.43,iqn.2009-06.com.eucalyptus.NewACluster:store14,VPqZgEav+fASsHW+HRmK1UfIcx3kSfDR/r+/UjyRoeCsFr1kq0LdfGHEBNowHs9zxLP/Z8nkREkG/m3hr2Oi8YoUQjjlXClVGWZashDvQx3KnEAVAL9zIlrpuzQXVR/ceqJx92GejmkYCSO1u/La/35954zFX288O+ZV8K4gp/diJLbFSnYcArwQBpFX/iJdvIrLxX8KwZ4lefa9S20HlTUqcJxEqkdLoyCLW15x1TtQSijDntsP539UwSQ9jGUwgeb/KMwzTmDOz0R2es/Npeq3vOUW7SzlnWc3iqDFcQlWbzphFB7JmHTFVA/zt0cthpoFn8pmtBy+ka6qqiE1mw==]
[Fri Sep 3 10:03:05 2010][022544][EUCAERROR ] ERROR: connect_iscsi_target failed
[Fri Sep 3 10:03:05 2010][022544][EUCAERROR ] AttachVolume(): failed to connect to iscsi target
[Fri Sep 3 10:03:05 2010][022544][EUCAERROR ] ERROR: doAttachVolume() failed error=1
--------------------Tried to attach an old volume----------------------------
I'm now wondering, even if I reboot the storage controller nicely every time, it still got problems, what if we meet problems like blackout, disk failure and so on? How come the storage controller is always so vulnerable?
Maybe some of my operation was wrong, please tell me then how to solve this problem, I don't want to lost my data every time after restarting the storage controller.
Thank you very much for your precious time, and highly appreciate any help.
Regards
BestWC
I found out that after reboot the storage controller, the loop0 - loop7 got problems, but loop8 - loop255 I think are ok.
10:41:45 ERROR [SystemUtil:Thread-16] com.eucalyptus.util.ExecutionException: //
/usr/lib/eucalyptus/euca_rootwrap losetup /dev/loop5 error: loop: can't get inf
o on device /dev/loop5: No such device or address
10:41:45 ERROR [SystemUtil:Thread-16] com.eucalyptus.util.ExecutionException: //
/usr/lib/eucalyptus/euca_rootwrap losetup /dev/loop2 error: loop: can't get inf
o on device /dev/loop2: No such device or address
10:41:45 ERROR [SystemUtil:Thread-16] com.eucalyptus.util.ExecutionException: //
/usr/lib/eucalyptus/euca_rootwrap losetup /dev/loop0 error: loop: can't get inf
o on device /dev/loop0: No such device or address
10:41:45 ERROR [ISCSIManager:Thread-16] com.eucalyptus.util.EucalyptusCloudExcep
tion: tgtadm: invalid request
10:41:46 ERROR [ISCSIManager:Thread-16] com.eucalyptus.util.EucalyptusCloudExcep
tion: tgtadm: invalid request
Seems those old volumes originally mount with loop0-7 lost some files. Do I need to manually losetup -d loop devices everytime I need to restart?
Sorry, I can't really tell from your post how you performed the upgrade. Was this from source or packages? If you installed from packages, it should have retained AoE as the backing store, so I'm assuming not? Or did you change DISABLE_ISCSI to "N" after installation and then reboot the SC?
Can you provide us with step by step instructions on how you got into the state you did so we can try to reproduce your issue?
Sorry. I'm using CEntOS 5.5 and install eucalyptus 2.0 by package. But the iSCSI is default value after I install, so I didn't change.
The storage is now okay, I deleted all volumes, and create new, they can be attached now. I think some part of the volumes were broken. But still, I lost data. I'll keep trying to figure this out.
Thanks for your reply.
Did you upgrade using the upgrade instructions? http://open.eucalyptus.com/wiki/EucalyptusUpgrade_v2.0
Trying to get more info so we can determine if this is a reproducible problem.
Hi There
We are having exactly the same problem as described above with a clean install of Eucalyptus 2.0
Our setup is as follows:
A single cluster with
* Cloud Controller, SC, Walrus and CC on the same machine
* 4 Node controllers
Eucalyptus is fully functioning, and we can create, attach and snapshot volumes correctly. However if we reboot cleanly every machine in the cloud and bring it back up again cleanly, any Volumes which we created before the reboot cannot be attached to any instances. We are still able to create and attach new volumes and the old volumes do appear in HybridFox as volumes but any attempt to attach or snapshot them fails.
I have tested both with ISCSI enabled and disabled and this has no effect.
The errors we get back are very generic. On the Cloud Controller in cloud-error.log we see:
22:07:30 ERROR [QueuedEventCallback:New I/O client worker #2-26] Caught exception in asynchronous response handler.
com.eucalyptus.util.EucalyptusClusterException: [AttachVolumeResponseType attachedVolume=AttachedVolume null null null null null Fri Sep 03 22:07:30 BST 2010 correlationId=null userId=admin effectiveUserId=null _return=false statusMessage=ERROR]
at com.eucalyptus.cluster.callback.QueuedEventCallback.messageReceived(QueuedEventCallback.java:209)
at org.jboss.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:105)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:567)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:803)
at com.eucalyptus.ws.handlers.MessageStackHandler.handleUpstream(MessageStackHandler.java:134)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:567)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:803)
at com.eucalyptus.ws.handlers.MessageStackHandler.handleUpstream(MessageStackHandler.java:134)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:567)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:803)
at com.eucalyptus.ws.handlers.MessageStackHandler.handleUpstream(MessageStackHandler.java:134)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:567)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:803)
at com.eucalyptus.ws.handlers.MessageStackHandler.handleUpstream(MessageStackHandler.java:134)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:567)
at org.jboss.n
And on the Node controller in nc.log we get:
[Thu Sep 2 22:08:19 2010][003979][EUCAINFO ] doAttachVolume() invoked (id=i-37EB065F vol=vol-5F4F0657 remote=//,192.168.123.125,iqn.2009-06.com.eucalyptus.oxilcloudservice:store8,R9NCWVLP7kQ8n27kus6uMeaqk4K+HHN6XX2JpjfFQ5ruEMu0wIlQPZEQxkR1vDVohzgJEXJU2CqxfnQ6CadpPIHT74iTU0POzhjPw9ICv1OIZBoe6Ubpvbl/IA7MTr8JDiYMsy842zatmbJBD5EYdPESM5yca8HzOk7UkQnPeylFUKZItBN2thP3g3gx+6QE8g9ub5W+dOdPgbTj82AO7o8YT7nwosxUv9t25pvUGgO3/ZPiH8maDkiq8yYjPqaNVxJXYYdrwns2JhWLlPsBbMrBQ5MyGsMp89VanbPhuRXgOyi5H+zWNjAnVfKn5qJCl09CBO/rvZuy7KUCzhcTsw== local=/dev/sdc)
[Thu Sep 2 22:08:19 2010][003979][EUCAINFO ] connect_iscsi_target invoked (dev_string=//,192.168.123.125,iqn.2009-06.com.eucalyptus.oxilcloudservice:store8,R9NCWVLP7kQ8n27kus6uMeaqk4K+HHN6XX2JpjfFQ5ruEMu0wIlQPZEQxkR1vDVohzgJEXJU2CqxfnQ6CadpPIHT74iTU0POzhjPw9ICv1OIZBoe6Ubpvbl/IA7MTr8JDiYMsy842zatmbJBD5EYdPESM5yca8HzOk7UkQnPeylFUKZItBN2thP3g3gx+6QE8g9ub5W+dOdPgbTj82AO7o8YT7nwosxUv9t25pvUGgO3/ZPiH8maDkiq8yYjPqaNVxJXYYdrwns2JhWLlPsBbMrBQ5MyGsMp89VanbPhuRXgOyi5H+zWNjAnVfKn5qJCl09CBO/rvZuy7KUCzhcTsw==)
[Thu Sep 2 22:08:19 2010][003979][EUCADEBUG ] system_output(): [//usr/lib/eucalyptus/euca_rootwrap //usr/share/eucalyptus/connect_iscsitarget.pl //,192.168.123.125,iqn.2009-06.com.eucalyptus.oxilcloudservice:store8,R9NCWVLP7kQ8n27kus6uMeaqk4K+HHN6XX2JpjfFQ5ruEMu0wIlQPZEQxkR1vDVohzgJEXJU2CqxfnQ6CadpPIHT74iTU0POzhjPw9ICv1OIZBoe6Ubpvbl/IA7MTr8JDiYMsy842zatmbJBD5EYdPESM5yca8HzOk7UkQnPeylFUKZItBN2thP3g3gx+6QE8g9ub5W+dOdPgbTj82AO7o8YT7nwosxUv9t25pvUGgO3/ZPiH8maDkiq8yYjPqaNVxJXYYdrwns2JhWLlPsBbMrBQ5MyGsMp89VanbPhuRXgOyi5H+zWNjAnVfKn5qJCl09CBO/rvZuy7KUCzhcTsw==]
[Thu Sep 2 22:08:24 2010][003979][EUCAERROR ] ERROR: connect_iscsi_target failed
[Thu Sep 2 22:08:24 2010][003979][EUCAERROR ] AttachVolume(): failed to connect to iscsi target
[Thu Sep 2 22:08:24 2010][003979][EUCAERROR ] ERROR: doAttachVolume() failed error=1
Can anybody shed any light on this problem? Obviously it is a huge showstopper if we cannot retrieve all of our persistent data on a power outage or planned shutdown !!!!!
Thanks a lot
Mark
Can you post the part of your cloud-output.log and cloud-error.log right after you start the front end after a reboot? There should be more information in there to debug your problem. The Storage Controller attempts to recover old volumes after a restart. For some reason it is failing to do so on your setup. The logs should help troubleshoot the problem. Also, would help if you indicate the distro and how you installed Eucalyptus.
CEntOS 5.5
Eucalyptus 2.0.0 Community (Xen) via RPMs
One CLC/CC/SC
Four NC
In the cloud error log we get on startup
With ISCSI disabled:
11:31:22 ERROR [SystemUtil:main] java.io.IOException: Cannot run program "bttrack": java.io.IOException: error=13, Permission denied
11:31:24 ERROR [ISCSIManager:Thread-15] Unable to unbind tid: 8
11:31:26 ERROR [OverlayManager:Thread-15] com.eucalyptus.util.EucalyptusCloudException: Could not export AoE device /dev/vg-cSWzAw../lv-chAKhA.. StorageInfo.getStorageInfo().getStorageInterface(): eth0 pid: 5174 returnValue:
11:31:27 ERROR [OverlayManager:Thread-15] com.eucalyptus.util.EucalyptusCloudException: Could not export AoE device /dev/vg-GYqVEw../lv-t9gbXQ.. StorageInfo.getStorageInfo().getStorageInterface(): eth0 pid: 5215 returnValue:
and with ISCSI enabled:
11:33:45 ERROR [SystemUtil:main] java.io.IOException: Cannot run program "bttrack": java.io.IOException: error=13, Permission denied
11:33:47 ERROR [SystemUtil:Thread-21] com.eucalyptus.util.ExecutionException: ///usr/lib/eucalyptus/euca_rootwrap tgtadm --lld iscsi --op show --mode target --tid 11 error: tgtadm: can't find the target
11:33:49 ERROR [ISCSIManager:Thread-21] com.eucalyptus.util.EucalyptusCloudException: tgtadm: invalid request
11:33:51 ERROR [ISCSIManager:Thread-21] com.eucalyptus.util.EucalyptusCloudException: tgtadm: invalid request
11:33:51 ERROR [ISCSIManager:Thread-21] com.eucalyptus.util.EucalyptusCloudException: tgtadm: invalid request
I am unable to reproduce this issue.
I installed Eucalyptus front end components on a fresh Centos 5.5 installation from the 2.0 RPMs, created a number of volumes and then rebooted the host and restarted eucalyptus-cloud. I am able to use my volumes successfully.
Can you provide some more info so we can reproduce this issue? Are you able to reproduce this issue from scratch? Can you write down the exact steps you took? It is not beneficial to toggle the DISABLE_ISCSI flag before ensuring that the volumes work with the current setting.
Looking forward to getting more info/details from you.
neil
Dear Neil,
We have a Sun v40z as CLC, CC, SC and three Sun v20z as NCs. They are older machines (multiple single core Opterons thus the requirement to use Xen).
After configuring the initial 'cloud' following the steps in the Eucalyptus Administratrors Guide (2.0), we are using Hybridfox to make instances, block storages, etc., and then connect them.
If you attach the block storage to an instance ant then pull the plug (which did happen thus this particular thread) the block storage is orphaned. The instances, quite expectedly, have died and new instances cannot attach to the old storage.
We have done bare metal rebuilds to remove artifacts of our tinkering and have easily reproduced the issue by a) simply halting all the servers in turn CLC to last NC; b) rebooting just the CLC/CC/SC server (leaving the NCs on); and, c) pulling the plug at the wall (already mentioned).
Best wishes,
Philip
Sorry, I am still confused. When you say pull the plug, did you pull the plug on the node and you were no longer able to attach volumes to another instance on a different node (that shouldn't happen since the volumes are marked available as soon as an instance terminates)? Or did you pull the plug on the storage controller, then start it back up and you were unable to use the volume (that I believe because the NC will report that the volume is still attached)?
Can you be more clear, provide *exact* steps from a clean installation so we may reproduce the issue?
thanks
neil
Dear Neil,
My colleague actually pulled the plug out of the CLC, CC, SC box. We later simulated it by cutting the power at the PDU for all of the boxes at once.
I have bought a VT capable box so I am going to try UEC now...
Thank you for trying,
Philip
The issue (or at least an issue) here is that logical volumes for iSCSI-based EBS volumes are not activated (lvchange -ay) before the iSCSI target setup (tgtadm) is attempted, so it fails to create the LUN pointing to the LV.
In OverlayManager, the exportVolume() method calls enableLogicalVolume() in the section that handles AOE volumes, but it does not do it in the section that handles iSCSI volumes.
Hi guys, I think that I am having the same problem here. I have a CLC-SC-CC-Walrus box and a NC box. They are configured for multi-clustering and the CLC machine has 2 NIC's. Both boxes were based on UEC 10.04 LTS, but I was having some stability problems concerning the SC.
So I decided to upgrade both boxes to a 10.10 Maverick and Eucalyptus 2.0, also to test how the upgrade goes. Unfortunately, the upgrade did not go so well. The NC went fine, but the CLC machine had serios IRQ management problems, which made the box extremely slow at startup. Only by restarting Eucalyptus (which could take 1 hour) then the machine became responsive again, including Eucalyptus itself. Really weird.
Yesterday I upgraded the kernel with the most recent release (must be from this week) though and it seems to have solved this IRQ problem at least. So now the box is responding properly again. I can start VI's again now, I can login to the instance, I can create new volumes and attach them.. but:
- Previous volumes cannot be attached. I get the following errors:
# tail -f /var/log/eucalyptus/cc.log | grep ERROR
[Thu Nov 25 15:18:32 2010][010580][EUCAERROR ] ERROR: AttachVolume returned an error
[Thu Nov 25 15:18:32 2010] ERROR: doAttachVolume() returned FAIL
# tail -f /var/log/eucalyptus/cloud-output.log | grep ERROR
15:18:32 DEBUG 112 VolumeAttachCallback | com.eucalyptus.util.EucalyptusClusterException: [AttachVolumeResponseType attachedVolume=AttachedVolume null null null null null Thu Nov 25 15:18:32 CET 2010 correlationId=null userId=testUser effectiveUserId=null _return=false statusMessage=ERROR]
15:18:32 ERROR 271 QueuedEventCallback | Caught exception in asynchronous response handler.
# tail -f /var/log/eucalyptus/nc.log | grep ERR
[Thu Nov 25 15:18:27 2010][002259][EUCAERROR ] ERROR: connect_iscsi_target failed
[Thu Nov 25 15:18:27 2010][002259][EUCAERROR ] AttachVolume(): failed to connect to iscsi target
[Thu Nov 25 15:18:27 2010][002259][EUCAERROR ] ERROR: doAttachVolume() failed error=1
- If I create a new volume, it gets created, but I get the following error:
# tail -f /var/log/eucalyptus/cloud-output.log | grep ERROR
14:42:17 ERROR 94 SystemUtil | com.eucalyptus.util.ExecutionException: ///usr/lib/eucalyptus/euca_rootwrap tgtadm --lld iscsi --op show --mode target --tid 53 error: tgtadm: can't find the target
- If I try to attach the above volume, then I don't get any errors at all, the volume gets attached according to the CC, but connectivity to the instance is completely disrupted. I cannot login anymore, unless I reboot the instance, but then I cannot see the attached volume. Hybridfox still show the volume as attached though and if I try to detach it, then I get the following errors:
# tail -f /var/log/eucalyptus/cc.log | grep ERROR
[Thu Nov 25 15:22:57 2010][012079][EUCAERROR ] ERROR: DetachVolume returned an error
[Thu Nov 25 15:22:57 2010] ERROR: doDetachVolume() returned FAIL
# tail -f /var/log/eucalyptus/cloud-output.log | grep ERROR
15:22:57 DEBUG 99 VolumeDetachCallback | com.eucalyptus.util.EucalyptusClusterException: [DetachVolumeResponseType detachedVolume=AttachedVolume null null null null null Thu Nov 25 15:22:57 CET 2010 correlationId=null userId=testUser effectiveUserId=null _return=false statusMessage=ERROR]
15:22:57 ERROR 271 QueuedEventCallback | Caught exception in asynchronous response handler.
# tail -f /var/log/eucalyptus/nc.log | grep ERR
[Thu Nov 25 15:21:53 2010][002259][EUCAERROR ] libvirt: internal error '//,PUB.LIC.ADD.RESS,iqn.2009-06.com.eucalyptus.Cluster-CC01:store37,eJlnlcel1kdY++GE8l9AkOQsRPyWAcbVLl2q4F28e27maX1P5VsldPCFDogD2j9vrnP/h4SKkYMhrDCPp9yhZpoqrE7SolZgiKSDiJJTPwW22rl95TjQ440Q6qXN9HxptqPoezTtahUaM+fKT87dmkP0nmLNUMCHHTeg5Jaye3xUzJNHS3Vm3c11BWV0iDgNj2kLkiP8ZjQwlnJ7W/7aU72FbxwTEPhQXKWr3jpwDksaB2950gvTF2JgY9UjvQF2d/wIXiDOUANIBFYi138I87XLUttf6afccwW42+KYsRANc1z870wk9HYPqBfNK2kk2bHA+RDNM5RN0sFovUf4YQ==' does not exist (code=1)
[Thu Nov 25 15:21:53 2010][002259][EUCAERROR ] virDomainAttachDevice() failed (err=-1) XML={driver name='phy'/}{source dev='//,PUB.LIC.ADD.RESS,iqn.2009-06.com.eucalyptus.Cluster-CC01:store37,eJlnlcel1kdY++GE8l9AkOQsRPyWAcbVLl2q4F28e27maX1P5VsldPCFDogD2j9vrnP/h4SKkYMhrDCPp9yhZpoqrE7SolZgiKSDiJJTPwW22rl95TjQ440Q6qXN9HxptqPoezTtahUaM+fKT87dmkP0nmLNUMCHHTeg5Jaye3xUzJNHS3Vm3c11BWV0iDgNj2kLkiP8ZjQwlnJ7W/7aU72FbxwTEPhQXKWr3jpwDksaB2950gvTF2JgY9UjvQF2d/wIXiDOUANIBFYi138I87XLUttf6afccwW42+KYsRANc1z870wk9HYPqBfNK2kk2bHA+RDNM5RN0sFovUf4YQ=='/}{target dev='vdb'/}
[Thu Nov 25 15:22:52 2010][002259][EUCAERROR ] libvirt: operation failed: disk vdb not found (code=9)
[Thu Nov 25 15:22:52 2010][002259][EUCAERROR ] virDomainDetachDevice() failed (err=-1) XML= {disk type='block'}
[Thu Nov 25 15:22:52 2010][002259][EUCAERROR ] ERROR: doDetachVolume() failed error=1
I don't know if the Maverick upgrade was supposed to work smoothly, but it didn't happen in my case. I hope the above will help to understand more about this important issue and shed some light on the AoE/SCSI world.
Looking forward to solve the problem for my self as well, but I suspect already that I will have to reinstall everything from scratch. Any advice would be highly appreciated.
Thank you very much for your help!
Best regards!
TritoLux
I reinstalled a Maverick cloud from scratch with same multicluster config as above. I was able to reproduce the IRQ problem. It happens only on one specific server and only with kernel 2.6.35-22. After I performed a dist-upgrade to kernel 2.6.35-23, then the IRQ slow response was solved again. It only occured on the CLC though, not on the NC. So it is highly likely hardware/kernel related.
I was also able to reproduce the connectivity loss problem after attaching a volume. It only happens when iSCSI is active in combination with VIRTIO_NET set to 1. If I disable VIRTIO_NET, then the connectivity doesn't get lost anymore when I attach a volume. It doesn't seem to make any difference if VIRTIO_DISK or VIRTIO_ROOT are active or not.
Loosing the possibility to attach previously created volumes after CLC-CC-SC reboot is a persisting problem as well, even if I did a fresh reinstall.
My conclusion is that the problems reported in my previous post were not due to a failed upgrade, but they are more related to a kernel-maverick-iSCSI-eucalyptus 2.0 combination.
I will have to fall back to AoE now and hope that it goes better, unless somebody would kindly provide me with a solution to above problems.
Kind regards,
TritoLux
Sorry if I bump this post so early again. I am trying to disable iSCSI and use AoE, but it seems that whatever I do with the config, iSCSI just stays active.
I set DISABLE_ISCSI="Y" in CLC-SC-CC and on the NC, I also tried to disable all VIRTIO options and of course I tried to cleanrestart all controllers several times and rebooted all servers several times as well. When I try to attach a volume now, the CC tells me that everything is fine, no errors reported and the volumes get attached, both old and new volumes, but no new devices/voumes appear on the instance.
aoe-stats and aoe-discover on the node return empty results, but when I attach a volume, then the NC reports the following line:
# tail -f /var/log/syslog | grep iscsi
Dec 2 02:43:53 dC-CC01-NC01 iscsid: connection3:0 is operational now
It seems that iSCSI did not get disabled at all.
Am I missing anything? Any idea how to use pure AoE again?
Kind regards,
TritoLux
It seems that once a volume is created as AoE or iSCSI, it remains that way for ever. Did you try creating a new volume after disabling iSCSI?
I have actually successfully transitioned some volumes in the other direction (iSCSI to AoE), but it required hacking of the database - not recommended for the faint-of-heart...
Thanks dseven for your answer. Yes I did try to create new volumes as well of course. But the volumes that get created now do not appear in the instance (I checked both sd* and vd*), even though I receive no error messages or warnings on the cc.log or nc.log. The only difference is that, after several clean restarting and server rebooting, I finally get no iscsi connection created (on the NC), but aoe daoesn't get created either, as aoe-stat is still empty. According to the CC, a new volume gets properly attached, but it seems that the NC doesn't know about it.
I still get this error on the CC though:
# tail -f /var/log/eucalyptus/cloud-error.log
11:30:22 ERROR [SystemUtil:pool-8-thread-1] com.eucalyptus.util.ExecutionException: ///usr/lib/eucalyptus/euca_rootwrap tgtadm --lld iscsi --op show --mode target --tid 4 error: tgtadm: can't find the target
I also increased the number of loop devices to 255 just to be sure.
At the moment the situation that I am experiencing is the following:
iSCSI: seems to work fine, but the CC-SC box doesn't have to be rebooted, otherwise all existing volumes cannot be attached anymore.
AoE: existing volumes cannot get attached. New volumes get attached but they do not appear within the instance.
Basically, I can't use EBS at the moment.
Any further help would be appreciated.
Best regards,
TritoLux
TritoLux says...
I still get this error on the CC though:
# tail -f /var/log/eucalyptus/cloud-error.log
11:30:22 ERROR [SystemUtil:pool-8-thread-1] com.eucalyptus.util.ExecutionException: ///usr/lib/eucalyptus/euca_rootwrap tgtadm --lld iscsi --op show --mode target --tid 4 error: tgtadm: can't find the target
That is not an error, from what I can tell. The SC checks if the TID is available before it attempts to grab/allocate it. The "error" is really the SC checking the status, but I agree it should be reported more "nicely."
Did you check nc.log for errors? If the NC thinks that the volume is attached (as indicated in nc.log), then the issue is in your image (do you have acpiphp loaded for instance?) Rebooting/turning services on and off is unlikely to help you get to the bottom of this...it will only cause more confusion :)
Hi Neil, thank you for your response.
You may not consider it an "error", but we still cannot use EBS in a production environment, not with iSCSI, nor with AoE, due to the problems I described above.
Since this is preventing me to deploy my cloud, I would really appreciate some further help to find a solution asap.
I can provide you with any details you may ask if what I posted is not enough.
Thank you very much in advance!
Best regards,
TritoLux
p.s.:
I would appreciate understanding what other opportunities may be available in terms of technical consultancy. I need to have a stable managed-novlan cloud running before the 10th of December and I would not mind having an in depth (payed if necessary) second opinion regarding the configuration we have chosen. We would like to understand if the problems we are having are either due to my configuration faults, or eventually due to software bugs, please. Thank you again.
Neil, thanks for adding up to your comment. You can find more logs about my problem in my post from Thu, 11/25/2010 - 16:47. I am only using images from the UEC Store actually. What I noticed though is that if I start an instance then it never comes up correctly the first time, and I always have to reboot it before I can use it properly.
I have been installing clouds for 9 months by now, with different configs and different host OS. I always follow the instructions you provide us with step by step. But this is the first 2.0 cloud I install and I am having serious EBS problems. In what controller box should I check if acpiphp is loaded? I had to clean reload services due to config changes, which were necessary for testing. I hope you will help me to shed some light on my issue, please.
Regards, TritoLux
p.s. If you are referring to the NC log reported in my post from Thu, 12/02/2010 - 04:24, then that log shows that NC responds to an iSCSI request, while I had disabled iSCSI completely from all configs. However, if I restart the network interfaces, then I see that iSCSI is still active. After a couple of reboots of the node, then I see that the logs do not report any iSCSI message, but AoE is also not active as I said already above. That's why no volume gets ever attached. I would like to use EBS with either AoE os iSCSI, whichever works first and well enough for a production environment.
Here I post some more details about my EBS problem. What follows only concerns troubleshooting details regarding a cloud config based on AoE. The original iSCSI setting was disabled since it came up with different problems (everything works fine until the SC is rebooted, as the original post states).
First of all, I found out that the image start up problem regarding an image that needs to be rebooted after first startup is occuring with image 10.04 64 bit from the UEC Store, but not with 9.10 64bit, which runs fine immediately instead.
However, this EBS/AoE issues occur with both images in the same way. I hope that what follows will help to better understand the prolem:
NODE CONTROLLER
config
EUCALYPTUS="/"
EUCA_USER="eucalyptus"
DISABLE_DNS="Y"
CLOUD_OPTS="-Xmx512m"
DISABLE_EBS="N"
DISABLE_ISCSI="Y"
ENABLE_WS_SECURITY="Y"
LOGLEVEL="DEBUG"
VNET_PUBINTERFACE="br1"
VNET_PRIVINTERFACE="br1"
VNET_MODE="MANAGED-NOVLAN"
CC_PORT="8774"
SCHEDPOLICY="ROUNDROBIN"
POWER_IDLETHRESH="300"
POWER_WAKETHRESH="300"
NC_SERVICE="axis2/services/EucalyptusNC"
VNET_DHCPDAEMON="/usr/sbin/dhcpd3"
VNET_DHCPUSER="dhcpd"
DISABLE_TUNNELLING="N"
NODES=""
VNET_ADDRSPERNET="32"
NC_PORT="8775"
HYPERVISOR="kvm"
MANUAL_INSTANCES_CLEANUP=0
VNET_BRIDGE="br1"
INSTANCE_PATH="/var/lib/eucalyptus/instances/"
USE_VIRTIO_NET="0"
USE_VIRTIO_DISK="0"
USE_VIRTIO_ROOT="0"
MAX_CORES="16"
nc.log
When I attach a freshly created volume, I get the following in the nc.log:
[Fri Dec 3 23:35:44 2010][001818][EUCAINFO ] doAttachVolume() invoked (id=i-3F150777 vol=vol-5902061A remote=//,10.8.0.199,iqn.2009-06.com.eucalyptus.dCluster-CC01:store5,Ou87C9jm823WAohstp4ZXpfM/W6HOX2liFYvgqCTQdbbXe+wTtyMShZijEy0hMmBpc/mN482h8Rc8Bf1gNB5a/KL/9QF591FbFTh14wL+vzu7jcTAkoMzZUYncny6BPeB9vFlOCUX7RtveFxWIFGDmzV6AiYq6bNqOaTkmSyTUkvKEmMJwa4Zn4NrqyxSmVfLRiCGQv9fWo3jnoN7b8revNakfeRsd6KMVa0oJy6oPA6CBcGx/P4D+67iA367sJ/7fqvU2dEXPdy1i/cfzFB4OSka9MmpW5I3IeT2Lhzr3R0eoNu232QRgiuCK1hJcuTzZjSAh8Lv+Vhzh/x0QmyPQ== local=/dev/sdb)
[Fri Dec 3 23:35:44 2010][001818][EUCAINFO ] connect_iscsi_target invoked (dev_string=//,10.8.0.199,iqn.2009-06.com.eucalyptus.dCluster-CC01:store5,Ou87C9jm823WAohstp4ZXpfM/W6HOX2liFYvgqCTQdbbXe+wTtyMShZijEy0hMmBpc/mN482h8Rc8Bf1gNB5a/KL/9QF591FbFTh14wL+vzu7jcTAkoMzZUYncny6BPeB9vFlOCUX7RtveFxWIFGDmzV6AiYq6bNqOaTkmSyTUkvKEmMJwa4Zn4NrqyxSmVfLRiCGQv9fWo3jnoN7b8revNakfeRsd6KMVa0oJy6oPA6CBcGx/P4D+67iA367sJ/7fqvU2dEXPdy1i/cfzFB4OSka9MmpW5I3IeT2Lhzr3R0eoNu232QRgiuCK1hJcuTzZjSAh8Lv+Vhzh/x0QmyPQ==)
[Fri Dec 3 23:35:44 2010][001818][EUCADEBUG ] system_output(): [//usr/lib/eucalyptus/euca_rootwrap //usr/share/eucalyptus/connect_iscsitarget.pl //,10.8.0.199,iqn.2009-06.com.eucalyptus.dCluster-CC01:store5,Ou87C9jm823WAohstp4ZXpfM/W6HOX2liFYvgqCTQdbbXe+wTtyMShZijEy0hMmBpc/mN482h8Rc8Bf1gNB5a/KL/9QF591FbFTh14wL+vzu7jcTAkoMzZUYncny6BPeB9vFlOCUX7RtveFxWIFGDmzV6AiYq6bNqOaTkmSyTUkvKEmMJwa4Zn4NrqyxSmVfLRiCGQv9fWo3jnoN7b8revNakfeRsd6KMVa0oJy6oPA6CBcGx/P4D+67iA367sJ/7fqvU2dEXPdy1i/cfzFB4OSka9MmpW5I3IeT2Lhzr3R0eoNu232QRgiuCK1hJcuTzZjSAh8Lv+Vhzh/x0QmyPQ==]
[Fri Dec 3 23:35:45 2010][001818][EUCAINFO ] Attached device: /dev/sdb
[Fri Dec 3 23:35:46 2010][001818][EUCAINFO ] attached //,10.8.0.199,iqn.2009-06.com.eucalyptus.dCluster-CC01:store5,Ou87C9jm823WAohstp4ZXpfM/W6HOX2liFYvgqCTQdbbXe+wTtyMShZijEy0hMmBpc/mN482h8Rc8Bf1gNB5a/KL/9QF591FbFTh14wL+vzu7jcTAkoMzZUYncny6BPeB9vFlOCUX7RtveFxWIFGDmzV6AiYq6bNqOaTkmSyTUkvKEmMJwa4Zn4NrqyxSmVfLRiCGQv9fWo3jnoN7b8revNakfeRsd6KMVa0oJy6oPA6CBcGx/P4D+67iA367sJ/7fqvU2dEXPdy1i/cfzFB4OSka9MmpW5I3IeT2Lhzr3R0eoNu232QRgiuCK1hJcuTzZjSAh8Lv+Vhzh/x0QmyPQ== to sdb in domain i-3F150777
When I boot another instance (Ubuntu 9.10 64bits, I loose connectivity with the first one (Ubuntu 10.04 64bit), but it does not happen if I reverse the booting order. In order to restore connectivity, I have to reboot the 10.04 instance (volume still attached) and I get the following errors:
[Sat Dec 4 00:23:50 2010][001818][EUCAINFO ] doRebootInstance() invoked (id=i-3F150777)
[Sat Dec 4 00:23:52 2010][001818][EUCAERROR ] libvirt: internal error '//,10.8.0.199,iqn.2009-06.com.eucalyptus.dCluster-CC01:store5,Ou87C9jm823WAohstp4ZXpfM/W6HOX2liFYvgqCTQdbbXe+wTtyMShZijEy0hMmBpc/mN482h8Rc8Bf1gNB5a/KL/9QF591FbFTh14wL+vzu7jcTAkoMzZUYncny6BPeB9vFlOCUX7RtveFxWIFGDmzV6AiYq6bNqOaTkmSyTUkvKEmMJwa4Zn4NrqyxSmVfLRiCGQv9fWo3jnoN7b8revNakfeRsd6KMVa0oJy6oPA6CBcGx/P4D+67iA367sJ/7fqvU2dEXPdy1i/cfzFB4OSka9MmpW5I3IeT2Lhzr3R0eoNu232QRgiuCK1hJcuTzZjSAh8Lv+Vhzh/x0QmyPQ==' does not exist (code=1)
[Sat Dec 4 00:23:52 2010][001818][EUCAERROR ] virDomainAttachDevice() failed (err=-1) XML=
Afterwards, detaching the faulty volume, gives me the following (the volume then remains attached):
[Sat Dec 4 00:29:01 2010][001818][EUCAINFO ] doDetachVolume() invoked (id=i-3F150777 vol=vol-5902061A remote=//,10.8.0.199,iqn.2009-06.com.eucalyptus.dCluster-CC01:store5,Ou87C9jm823WAohstp4ZXpfM/W6HOX2liFYvgqCTQdbbXe+wTtyMShZijEy0hMmBpc/mN482h8Rc8Bf1gNB5a/KL/9QF591FbFTh14wL+vzu7jcTAkoMzZUYncny6BPeB9vFlOCUX7RtveFxWIFGDmzV6AiYq6bNqOaTkmSyTUkvKEmMJwa4Zn4NrqyxSmVfLRiCGQv9fWo3jnoN7b8revNakfeRsd6KMVa0oJy6oPA6CBcGx/P4D+67iA367sJ/7fqvU2dEXPdy1i/cfzFB4OSka9MmpW5I3IeT2Lhzr3R0eoNu232QRgiuCK1hJcuTzZjSAh8Lv+Vhzh/x0QmyPQ== local=/dev/sdb force=0)
[Sat Dec 4 00:29:01 2010][001818][EUCAINFO ] get_iscsi_target invoked (dev_string=//,10.8.0.199,iqn.2009-06.com.eucalyptus.dCluster-CC01:store5,Ou87C9jm823WAohstp4ZXpfM/W6HOX2liFYvgqCTQdbbXe+wTtyMShZijEy0hMmBpc/mN482h8Rc8Bf1gNB5a/KL/9QF591FbFTh14wL+vzu7jcTAkoMzZUYncny6BPeB9vFlOCUX7RtveFxWIFGDmzV6AiYq6bNqOaTkmSyTUkvKEmMJwa4Zn4NrqyxSmVfLRiCGQv9fWo3jnoN7b8revNakfeRsd6KMVa0oJy6oPA6CBcGx/P4D+67iA367sJ/7fqvU2dEXPdy1i/cfzFB4OSka9MmpW5I3IeT2Lhzr3R0eoNu232QRgiuCK1hJcuTzZjSAh8Lv+Vhzh/x0QmyPQ==)
[Sat Dec 4 00:29:01 2010][001818][EUCADEBUG ] system_output(): [//usr/lib/eucalyptus/euca_rootwrap //usr/share/eucalyptus/get_iscsitarget.pl //,10.8.0.199,iqn.2009-06.com.eucalyptus.dCluster-CC01:store5,Ou87C9jm823WAohstp4ZXpfM/W6HOX2liFYvgqCTQdbbXe+wTtyMShZijEy0hMmBpc/mN482h8Rc8Bf1gNB5a/KL/9QF591FbFTh14wL+vzu7jcTAkoMzZUYncny6BPeB9vFlOCUX7RtveFxWIFGDmzV6AiYq6bNqOaTkmSyTUkvKEmMJwa4Zn4NrqyxSmVfLRiCGQv9fWo3jnoN7b8revNakfeRsd6KMVa0oJy6oPA6CBcGx/P4D+67iA367sJ/7fqvU2dEXPdy1i/cfzFB4OSka9MmpW5I3IeT2Lhzr3R0eoNu232QRgiuCK1hJcuTzZjSAh8Lv+Vhzh/x0QmyPQ==]
[Sat Dec 4 00:29:01 2010][001818][EUCAINFO ] Device: /dev/sdb
[Sat Dec 4 00:29:02 2010][001818][EUCAERROR ] libvirt: operation failed: disk sdb not found (code=9)
[Sat Dec 4 00:29:02 2010][001818][EUCAERROR ] virDomainDetachDevice() failed (err=-1) XML=
[Sat Dec 4 00:29:02 2010][001818][EUCAINFO ] disconnect_iscsi_target invoked (dev_string=//,10.8.0.199,iqn.2009-06.com.eucalyptus.dCluster-CC01:store5,Ou87C9jm823WAohstp4ZXpfM/W6HOX2liFYvgqCTQdbbXe+wTtyMShZijEy0hMmBpc/mN482h8Rc8Bf1gNB5a/KL/9QF591FbFTh14wL+vzu7jcTAkoMzZUYncny6BPeB9vFlOCUX7RtveFxWIFGDmzV6AiYq6bNqOaTkmSyTUkvKEmMJwa4Zn4NrqyxSmVfLRiCGQv9fWo3jnoN7b8revNakfeRsd6KMVa0oJy6oPA6CBcGx/P4D+67iA367sJ/7fqvU2dEXPdy1i/cfzFB4OSka9MmpW5I3IeT2Lhzr3R0eoNu232QRgiuCK1hJcuTzZjSAh8Lv+Vhzh/x0QmyPQ==)
[Sat Dec 4 00:29:02 2010][001818][EUCAINFO ] vrun(): [//usr/lib/eucalyptus/euca_rootwrap //usr/share/eucalyptus/disconnect_iscsitarget.pl //,10.8.0.199,iqn.2009-06.com.eucalyptus.dCluster-CC01:store5,Ou87C9jm823WAohstp4ZXpfM/W6HOX2liFYvgqCTQdbbXe+wTtyMShZijEy0hMmBpc/mN482h8Rc8Bf1gNB5a/KL/9QF591FbFTh14wL+vzu7jcTAkoMzZUYncny6BPeB9vFlOCUX7RtveFxWIFGDmzV6AiYq6bNqOaTkmSyTUkvKEmMJwa4Zn4NrqyxSmVfLRiCGQv9fWo3jnoN7b8revNakfeRsd6KMVa0oJy6oPA6CBcGx/P4D+67iA367sJ/7fqvU2dEXPdy1i/cfzFB4OSka9MmpW5I3IeT2Lhzr3R0eoNu232QRgiuCK1hJcuTzZjSAh8Lv+Vhzh/x0QmyPQ==]
[Sat Dec 4 00:29:03 2010][001818][EUCAERROR ] ERROR: doDetachVolume() failed error=1
# la -Al /dev/etherd/
total 0
c-w--w---- 1 root disk 152, 3 2010-12-03 23:08 discover
cr--r----- 1 root disk 152, 2 2010-12-03 23:08 err
c-w--w---- 1 root disk 152, 6 2010-12-03 23:08 flush
c-w--w---- 1 root disk 152, 4 2010-12-03 23:08 interfaces
c-w--w---- 1 root disk 152, 5 2010-12-03 23:08 revalidate
aoe-discover and aoe-stat show empty records, even though a volume is attached to an instance, according to the CC:
$ euca-describe-volumes
VOLUME vol-5F2A064E 1 dCluster-CC01 in-use 2010-12-03T23:56:23.536Z
ATTACHMENT vol-5F2A064E i-4CFF0912 /dev/sdb 2010-12-03T23:56:39.856Z
VOLUME vol-537B05F4 5 dCluster-CC01 available 2010-12-01T13:56:53.495Z
CLUSTER CONTROLLER
config
EUCALYPTUS="/"
EUCA_USER="eucalyptus"
DISABLE_DNS="Y"
CLOUD_OPTS="-Xmx512m"
DISABLE_EBS="N"
DISABLE_ISCSI="Y"
ENABLE_WS_SECURITY="Y"
LOGLEVEL="DEBUG"
VNET_PUBINTERFACE="br0"
VNET_PRIVINTERFACE="br1"
VNET_MODE="MANAGED-NOVLAN"
CC_PORT="8774"
SCHEDPOLICY="ROUNDROBIN"
POWER_IDLETHRESH="300"
POWER_WAKETHRESH="300"
NC_SERVICE="axis2/services/EucalyptusNC"
VNET_DHCPDAEMON="/usr/sbin/dhcpd3"
VNET_DHCPUSER="dhcpd"
DISABLE_TUNNELLING="N"
NODES=""
VNET_ADDRSPERNET="32"
NC_PORT="8775"
HYPERVISOR="kvm"
MANUAL_INSTANCES_CLEANUP=0
VNET_BRIDGE="br1"
INSTANCE_PATH="/var/lib/eucalyptus/instances/"
USE_VIRTIO_NET="0"
USE_VIRTIO_DISK="0"
USE_VIRTIO_ROOT="0"
cc.log
[Fri Dec 3 23:35:11 2010][016896][EUCAERROR ] ERROR: AttachVolume returned an error
[Fri Dec 3 23:35:11 2010] ERROR: doAttachVolume() returned FAIL
vblade does not seem to be running on the CC, if it is supposed to be essential, then here we might have a potential root cause:
# ps aux | grep vblade
root 28097 0.0 0.0 8952 880 pts/0 S+ 01:44 0:00 grep --color=auto vblade
LIBVIRT LOGS
# cat /var/log/libvirt/qemu/i-4CFF0912.log
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin /usr/bin/kvm -S -M pc-0.12 -enable-kvm -m 900 -smp 1,sockets=1,cores=1,threads=1 -name i-4CFF0912 -uuid 406b3fd4-d57f-82a5-c504-389e1e2c5d51 -nographic -nodefaults -chardev socket,id=monitor,path=/var/lib/libvirt/qemu/i-4CFF0912.monitor,server,nowait -mon chardev=monitor,mode=readline -rtc base=utc -boot c -kernel /var/lib/eucalyptus/instances//admin/i-4CFF0912/kernel -initrd /var/lib/eucalyptus/instances//admin/i-4CFF0912/ramdisk -append root=/dev/sda1 console=ttyS0 -device lsi,id=scsi0,bus=pci.0,addr=0x3 -drive file=/var/lib/eucalyptus/instances//admin/i-4CFF0912/disk,if=none,id=drive-scsi0-0-0,boot=on,format=raw -device scsi-disk,bus=scsi0.0,scsi-id=0,drive=drive-scsi0-0-0,id=scsi0-0-0 -device e1000,vlan=0,id=net0,mac=d0:0d:4c:ff:09:12,bus=pci.0,addr=0x2 -net tap,fd=46,vlan=0,name=hostnet0 -chardev file,id=serial0,path=/var/lib/eucalyptus/instances//admin/i-4CFF0912/console.log -device isa-serial,chardev=serial0 -usb -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4
pci_add_option_rom: failed to find romfile "pxe-e1000.bin"
lsi_scsi: error: Unimplemented message 0x0c
VIRTUAL INSTANCE
acpi seems active:
ubuntu@172:~$ cat /var/log/dmesg | grep ACPI
[ 0.000000] ACPI: RSDP 00000000000fdb70 00014 (v00 BOCHS )
[ 0.000000] ACPI: RSDT 00000000383fde30 00034 (v01 BOCHS BXPCRSDT 00000001 BXPC 00000001)
[ 0.000000] ACPI: FACP 00000000383ffe70 00074 (v01 BOCHS BXPCFACP 00000001 BXPC 00000001)
[ 0.000000] ACPI: DSDT 00000000383fdfd0 01E22 (v01 BXPC BXDSDT 00000001 INTL 20090123)
[ 0.000000] ACPI: FACS 00000000383ffe00 00040
[ 0.000000] ACPI: SSDT 00000000383fdf90 00037 (v01 BOCHS BXPCSSDT 00000001 BXPC 00000001)
[ 0.000000] ACPI: APIC 00000000383fdeb0 00072 (v01 BOCHS BXPCAPIC 00000001 BXPC 00000001)
[ 0.000000] ACPI: HPET 00000000383fde70 00038 (v01 BOCHS BXPCHPET 00000001 BXPC 00000001)
[ 0.000000] ACPI: Local APIC address 0xfee00000
[ 0.000000] ACPI: PM-Timer IO Port: 0xb008
[ 0.000000] ACPI: Local APIC address 0xfee00000
[ 0.000000] ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
[ 0.000000] ACPI: IOAPIC (id[0x01] address[0xfec00000] gsi_base[0])
[ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level)
[ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level)
[ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level)
[ 0.000000] ACPI: IRQ0 used by override.
[ 0.000000] ACPI: IRQ2 used by override.
[ 0.000000] ACPI: IRQ5 used by override.
[ 0.000000] ACPI: IRQ9 used by override.
[ 0.000000] ACPI: IRQ10 used by override.
[ 0.000000] ACPI: IRQ11 used by override.
[ 0.000000] Using ACPI (MADT) for SMP configuration information
[ 0.000000] ACPI: HPET id: 0x8086a201 base: 0xfed00000
[ 0.170894] ACPI: Core revision 20090521
[ 0.180000] ACPI: bus type pci registered
[ 0.180517] ACPI: EC: Look up EC in DSDT
[ 0.182124] ACPI: Interpreter enabled
[ 0.182858] ACPI: (supports S0 S3 S4 S5)
[ 0.183750] ACPI: Using IOAPIC for interrupt routing
[ 0.186715] ACPI: No dock devices found.
[ 0.187520] ACPI: PCI Root Bridge [PCI0] (0000:00)
[ 0.190768] pci 0000:00:01.3: quirk: region b000-b03f claimed by PIIX4 ACPI
[ 0.195310] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
[ 0.197342] ACPI: PCI Interrupt Link [LNKA] (IRQs 5 *10 11)
[ 0.198642] ACPI: PCI Interrupt Link [LNKB] (IRQs 5 *10 11)
[ 0.199973] ACPI: PCI Interrupt Link [LNKC] (IRQs 5 10 *11)
[ 0.200402] ACPI: PCI Interrupt Link [LNKD] (IRQs 5 10 *11)
[ 0.206180] ACPI: WMI: Mapper loaded
[ 0.206894] PCI: Using ACPI for IRQ routing
[ 0.233307] pnp: PnP ACPI init
[ 0.233928] ACPI: bus type pnp registered
[ 0.235313] pnp: PnP ACPI: found 6 devices
[ 0.236126] ACPI: ACPI bus type pnp unregistered
[ 0.374779] ACPI: Power Button [PWRF]
[ 0.397890] ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 11
[ 1.025592] ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 10
[ 1.203607] ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 10
[ 5.211546] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
..but if I list the devices from within the instance I get no attached volumes:
ubuntu@172:~$ ls /dev/sd*
/dev/sda /dev/sda1 /dev/sda2 /dev/sda3
ubuntu@172:~$ ls /dev/vd*
ls: cannot access /dev/vd*: No such file or directory
Thank you very much for your time!
Best regards,
TritoLux
Hello there,
We set up a test environment on 2 64bits servers, with a simple CLC-CC-SC-Walrus and separate NC, in order to see if CentOS 5.5/Xen/Euca2.01 were behaving better than UEC/Kvm/Euca2.0 regarding this EBS issue. We left every config with default settings, apart from the network mode, as we tried both MANAGED and MANAGED-NOVLAN.
Unfortunately, the problem is still there, clearly reproduced, with pretty much the same error messages.
The only difference is that on CentOS I occasionally can attach previously created volumes, even after rebooting the machines, but all pre-existing partitions are gone anyway (together with their data of course). According to me, this is an easily reproducible problem affecting Eucalyptus since iSCSI was introduced. Something that should not be present on stable releases, considering that we could consistently reproduce this major bug on fresh Ubuntu, Debian and CentOS installations.
As this is a serious "show stopper" and it seems impossible to get back to AoE unless starting an installation from 1.6.2, I think that this is one of those problems that should be treated with the highest priority.
Any further comment from the Eucalyptus team would be surely appreciated.
Thank you very much for your time and attention.
Best regards,
TritoLux
I'm still trying to understand the exact issue you are complaining about and how to reproduce it. If you are saying on rebooting the Storage Controller (SC), volumes do not come back online if you are using iSCSI, that might be related to:
https://bugs.launchpad.net/bugs/683788
If this bug is a blocker, the workaround is to use AoE. A fix has been committed and will be in the next release. I'm still not exactly sure what is preventing you from using AoE. DISABLE_ISCSI is a parameter that only matters on the Storage Controller. It is not effect on the node. You will have to restart the SC when you change this value.
hope that helps
neil
Hi Neil,
thank you for your response. I am sorry if my posts were not clear enough or came across as complaints only.
My main objective was to report what I thought was a serious problem with as many details as possible, so that Eucalyptus can be improved even more.
I am happy that a bug regarding this iSCSI issue has been already opened and it is being addressed for the next release.
However, falling back to AoE is not always working, as you can clearly read from my post Sat, 12/04/2010 - 03:41, where I am only reporting error logs from our UEC systems after disabling iSCSI. I don't know what is causing them either. Maybe those logs will help you to better understand what is going on.
Without having neither iSCSI nor AoE at work, you may safely agree on the fact that not being able to use EBS at all can be a serious show stopper for most of the users, wouldn't you?
Fortunately, we just found out that falling back to AoE works fine on CentOS for some reason, as opposed to UEC or Debian for instance. We haven't tested extensively yet though. At least, on CentOS we can finally use EBS with Euca2.x for the first time without loosing any newly created AoE volumes after reboot, even though the original settings were based on iSCSI.
Still, we are getting thousands of weird warnings concerning AoE on the syslog, even though they are not preventing the SC to work fine so far. This is what we get on the NC (many times at any SC operation):
Dec 7 17:40:21 testCloud-NC01 kernel: BUG: warning at lib/kref.c:32/kref_get() (Not tainted)
Dec 7 17:40:21 testCloud-NC01 kernel:
Dec 7 17:40:21 testCloud-NC01 kernel: Call Trace:
Dec 7 17:40:21 testCloud-NC01 kernel: [] kref_get+0x38/0x3d
Dec 7 17:40:21 testCloud-NC01 kernel: [] kobject_get+0x12/0x17
Dec 7 17:40:21 testCloud-NC01 kernel: [] blk_get_queue+0x1f/0x26
Dec 7 17:40:21 testCloud-NC01 kernel: [] :blkbk:dispatch_rw_block_io+0x4db/0x5a2
Dec 7 17:40:21 testCloud-NC01 kernel: [] __next_cpu+0x19/0x28
Dec 7 17:40:21 testCloud-NC01 kernel: [] find_busiest_group+0x1db/0x44a
Dec 7 17:40:21 testCloud-NC01 kernel: [] dequeue_task+0x18/0x37
Dec 7 17:40:21 testCloud-NC01 kernel: [] deactivate_task+0x28/0x5f
Dec 7 17:40:21 testCloud-NC01 kernel: [] monotonic_clock+0x35/0x7b
Dec 7 17:40:21 testCloud-NC01 kernel: [] thread_return+0x6c/0x113
Dec 7 17:40:21 testCloud-NC01 kernel: [] kobject_cleanup+0x39/0x7e
Dec 7 17:40:21 testCloud-NC01 kernel: [] :blkbk:blkif_schedule+0x36e/0x456
Dec 7 17:40:21 testCloud-NC01 kernel: [] :blkbk:blkif_schedule+0x0/0x456
Dec 7 17:40:21 testCloud-NC01 kernel: [] keventd_create_kthread+0x0/0xc4
Dec 7 17:40:21 testCloud-NC01 kernel: [] kthread+0xfe/0x132
Dec 7 17:40:21 testCloud-NC01 kernel: [] child_rip+0xa/0x12
Dec 7 17:40:21 testCloud-NC01 kernel: [] keventd_create_kthread+0x0/0xc4
Dec 7 17:40:21 testCloud-NC01 kernel: [] kthread+0x0/0x132
Dec 7 17:40:21 testCloud-NC01 kernel: [] child_rip+0x0/0x12
And this is what I found on the net regarding this bug:
It seems this error is caused by the aoe module (v22) from linux kernel 2.6.18 in combination with 32 bit domU on 64 bit dom0. Using the aoe devices directly doesn't cause this error message. With the newest aoe module (v63) this error message is supposed to be gone.
Hope this helps and that the problem is clearer now.
Best regards,
TritoLux
Hello there,
I noticed that the latest stable release is 2.0.2 now. Thank you for this update.
I would appreciate if anybody could answer the following questions for me:
Could you confirm that bug https://bugs.launchpad.net/bugs/683788 is solved with the latest 2.0.2 release?
Would you recommend upgrading from 2.0.1 to 2.0.2 in order to safely make use of the iSCSI option and possibly be able to boot from a root partition on an EBS volume now?
It is actually not yet clear for me if booting from an EBS volume is already possible with AoE, or if iSCSI is essential for that.
Is it enough to change the fstab root entries to point to an external volume in order to boot from EBS?
What does the new option VIRTIO_ROOT exactly do to a virtual instance?
Thank you very much for your help and clarifications!
Kind regards,
TritoLux
Hi dseven, thanks for your response.
I had seen the patch source indeed, but I have installed our cloud from binaries and I cannot afford to reinstall the whole cloud by now. I will have to wait a bit longer for an update (pity though that such a simple and important fix was not included already). However, I doubt that I will move from AoE to iSCSI anyway, as all created volumes will not work then, as far as I understood. Basically, I almost have to build a separate cloud if I want to move to iSCSI.. not an ideal perspective.
Though, I would like to have an answer to the following questions if you don't mind. In alternative, if anybody could point me to some extensive documentation about VIRTIO_ROOT it would be also appreciated. Please let me know if I need to open a new thread for these:
It is actually not yet clear for me if booting from an EBS volume is already possible with AoE, or if iSCSI is essential for that.
What does the new option VIRTIO_ROOT exactly do?
Does this create a sort of automatic chroot to an EBS volume at boot time?
I would like to understand if the VIRTIO_ROOT option makes it possible to make an EBS volume usable as permanent root partition without needing the IO intensive rsync being run every 5 mins. In this case I would like to have the confirmation that it is possible to achieve that with AoE technology as well.
Thank you very much for your help and clarifications!
Cheers,
TritoLux
Hi everyone, I am having the exact same problem discussed in this thread but with Eucalyptus version 2.0.3, will this fix ever be applied to the main stream code for 2.0.x or do I need to reinstall compiling the source code and applying the patch myself? Since I am about to start creating my volumes, I would like to know whether I should go for the compiling option or the debian binaries downgrading to AoE.
My setup:
FC-CC-SC-CLC-WALRUS on a Debian Squeeze machine: Linux 2.6.32-5-amd64 #1 SMP Tue Jun 14 09:42:28 UTC 2011 x86_64 GNU/Linux
NC on a Debian Squeeze machine: Linux 2.6.32-5-amd64 #1 SMP Tue Jun 14 09:42:28 UTC 2011 x86_64 GNU/Linux
Using Eucalyptus Debian 5.0 image (http://open.eucalyptus.com/wiki/EucalyptusUserImageCreatorGuide).
Hello,
Currently, the way to obtain this fix is to checkout the source (https://code.launchpad.net/~eucalyptus-maintainers/eucalyptus/eucalyptus...) or obtain a nightly package: http://open.eucalyptus.com/participate/nightly-builds
2.0 is no longer an actively developed code branch. We are working on 3.0, which should be out in a few more weeks.
hope that helps.
neil
Hi everyone,
Thanks Neil for your answer, I wasn't aware that a 3.0 version was on its way, that's great news!!!
I was about to build the app from source when I came across this link:
https://bugs.launchpad.net/eucalyptus/+bug/733067
Although this is something that needs to be done every time your SC server is restarted, it seems to me faster than reinstalling the application and a clean work around to the problem. I have being trying it for a while and seems to be working perfectly. By the way, I've added the "/run/lock" part as it also seems to happen every time I restart the server.
Delete all iSCSI targets
tgtadm --mode target --op delete --tid=
Activate lvm2 volumes
vgscan
vgchange -ay
Also, after rebooting the OS we will need to run this command:
mkdir /run/lock
Otherwise we will get this error on cloud-error.log
00:25:26 ERROR [SystemUtil:pool-8-thread-1] com.eucalyptus.util.ExecutionException: ///usr/lib/eucalyptus/euca_rootwrap pvcreate /dev/loop6 error: /var/lock/lvm: mkdir failed: No such file or directory
File-based locking initialisation failed.
Restart SC service
/etc/init.d/eucalyptus-cloud restart