2020年12月27日 星期日

DPM

DPM

DPM troubleshooting

這篇文章記錄一些DPM的筆記

xrdcp時permission denied

/var/log/xrootd/dpmdisk/xrootd.log

190219 06:59:31 22192 secgsi_GetSrvCertEnt: failed to load certificate for the issuing CA 'b459ca48.0|9cd75e87.0

/etc/grid-security/certificates錯誤,從別台複製過來可解決
也有可能是sharekey不一致,可用md5sum確認
或是:disk-node 的 /data01權限錯誤

DN has NOT been authorized

dmlite.log:
dmlite dome processreq : DN '/C=TW/O=AS/OU=GRID/CN=cdvm2.twgrid.org' has NOT been authorized.
這其實不是問題,進入dmlite-shellfsadd可解決

Host is not trusted, identity provided was (ID,“dpmmgr”)

/etc/hosts中需要新增

Could not connect! Can’t connect to local MySQL server through socket

這是在disknode上發生的錯誤
當時我正要把原本作為headnodecdvm2轉為disknode,卻忘記清空/etc/dmlite.conf.d/mysql.cfg
刪除該檔案後即可正常運作

有時可以有時不行

  • dpm-rmfs 看看
  • disk node 權限設定有誤

Permission refused

通常這時只差臨門一腳,可以試試看不加–safe的情況下重做mapfile

Hostname和cert不一樣

設定DNS或/etc/hosts

Cannot verify AC signature! (or permission denied at gridftp)

沒用到好的proxy
or
/etc/grid-security/vomsdir/沒設定好

DOME

  • 記得要upgrade mysql server
  • /etc/xrootd/xrootd-dpmredir|xrootd-dpmdisk要開httpd,並安裝缺少的檔案
  • /etc/dmlite.conf.ddomeadapter.conf設定好,安裝缺少的檔案,並移除adapter.conf
  • /etc/domehead.conf|domedisk.conf設定好
  • /etc/sysconfig/dpminfo設定好

dmlite.log出現rfio

檢查前一項,例如以下log:

1.7.1548810776 at #012[bt]: (3) /usr/lib64/dmlite/plugin_adapter.so : dmlite::StdRFIOHandler::StdRFIOHandler(std::string const&, int, unsigned int)+0x620 [0x7fefcc577700]#012[bt]: (4) /usr/lib64/dmlite/plugin_adapter.so : dmlite::StdRFIODriver::createIOHandler(std::string const&, int, dmlite::Extensible const&, unsigned int)+0x1be [0x7fefcc577aae]#012[bt]: (5) /lib64/libdmlite.so.0 : dmlite_fopen+0x1a1 [0x7fefd3846f41]#012[bt]: (6) /usr/lib64/httpd/modules/mod_lcgdm_disk.so : +0x658c [0x7fefd043058c]

grep -r "/usr/lib64/dmlite/plugin_adapter.so"找誰用了它,照理來說應該用/usr/lib64/dmlite/plugin_domeadapter.so才對,此時推估是httpd出錯,找到後修正。

Permission refuse

voms或cert或mapfile沒設定好

permission denied

目錄權限設定錯誤,應該用dmlite去設定它(例如mkdir, quotatokenset)

HTTP 409 : Conflict, File Exist

沒有設定好quotatoken
或是,ipv6沒設定好(davix failed)

No space left on device

quotatoken size太小,用quotatokenmod <quotatoken id> path <path> pool <pool> size <x GB>修改
記得是GB,不是G

Failed at step RUNTIME_DIRECTORY spawning /usr/bin/xrootd: File exists

/usr/lib/tmpfiles.d/xrootd.conf設定錯誤,把裡面的xrootd改成dpmmgr即可

機率性失敗

可能是某個node壞掉,進入該node查詢

Error: Could not connect to server

防火牆沒設好

Error when parsing json

-[#00.000022] Error when parsing json response: {
}
沒有pool

Error: HTTP 400 : Server Error

server端 error (/var/log/httpd/ssl_error_log): AH01964
可能是hostname有問題(不能包含底線)
需要修改LogLevel (/etc/httpd/conf/httpd.conf and /etc/httpd/conf.d/zlcgdm-dav.conf)

Missing token on pfn: /domehead/command/dome_getidmap

檢查/etc/dmlite.conf.d/domeadapter.conf之類的檔案,headnode port可能沒設定,導致data接到了httpd而非xrootd
會連帶發生[3005] Unable to close file ; invalid argument dpm cern錯誤

fd is a NULL pointer at , Could not open…

可能是selinux沒關,請檢查

Cannot stat lfn: ‘/dpm/twgrid.org’ err: 2 what: ‘[#00.000002] Entry ‘twgrid.org’ not found under ‘/dpm’’ and no volatile filesystem matches.

可能是mariadb沒關

SSL handshake failed: Connection timed out during SSL handshake’. No response to show

mtu => 1500?
事後改回9000就可以通了,莫名其妙

xrootd 開不起來

  • 檢查tmpfs的權限
  • 可能是Plugin version XrdOss v4.8.5 is incompatible with XrdDPMOss v4.9.0 (must be <= 4.8.x) in osslib libXrdDPMOss.so-4.3

httpd Unregistered Authentication Agent for unix-process:1669:9621 (system bus name :1.20, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_US.UTF-8) (disconnected from bus)

可能是ca沒裝
安裝方式:https://wiki.egi.eu/wiki/EGI_IGTF_Release

xrootd啟動失敗

190412 08:52:45 13452 XrdProtocol: Protocol XrdHttp could not be loaded
------ xrootd dpmredir@cddh.twgrid.org:-1 initialization failed.
190412 08:55:55 13487 Starting on Linux 3.10.0-862.14.4.el7.x86_64

可能是mysql user的權限沒設定好

[dmlite/dmlite.log] No useable identity provided

剛裝好時會有這問題,等一陣子後重開即可

gridftp

注意firewall 80/tcp port要開

srmv2.2

DESTINATION SRM_PUT_TURL error on the turl

dmlite log出現莫名DN時
service dpm restart

Error: Error while copying to SFN: f-dpmp23.grid.sinica.edu.tw:/data01/atlas/2019-05-23/dd_50M.1558585079.8.1558591240 with error: Error Contacting the remote disknode

adapter 沒切掉
造成drainfs時錯誤

OPS

20190506

enter image description here

enter image description here

smstor45 high swap usage => RAID card GUI process (use java)

=> systemd-journald => /etc/systemd/journald.conf 全部註解掉

find /proc -maxdepth 2 -path "/proc/[0-9]*/status" -readable -exec awk -v FS=":" '{process[$1]=$2;sub(/^[ \t]+/,"",process[$1]);} END {if(process["VmSwap"] && process["VmSwap"] != "0 kB") printf "%10s %-30s %20s\n",process["Pid"],process["Name"],process["VmSwap"]}' '{}' \;

Token does not validate

Jun 11 06:37:11 f-dpmp1234 httpd[32387]: {139943286126336}!!! dmlite setMessage : DmException(…):[#00.000013] Missing token on pfn: /.noindex.html at #012[bt]: (3) /usr/lib64/dmlite/plugin_domeadapter.so : dmlite::DomeIODriver::createIOHandler(std::string const&, int, dmlite::Extensible const&, unsigned int)+0xae4 [0x7f4717d507f4]#012[bt]: (4) /lib64/libdmlite.so.0 : dmlite_fopen+0x249 [0x7f471e178bb9]#012[bt]: (5) /usr/lib64/httpd/modules/mod_lcgdm_disk.so : +0x70d8 [0x7f471ab400d8]#012[bt]: (6) /usr/lib64/httpd/modules/mod_lcgdm_dav.so : +0x455f [0x7f471e64055f]
是ipv6設定有問題。

如果client會先和header要資料,header會拿client的ip作成token。
如果header和disknode都只有v4,沒問題
如果headerv4,disk v6,會變成拿v4作成的token去跟v6的機器要,會有問題。
把disk的v6 disable掉可解
總之disk和head要一致。

drain時timeout

iptables 80 port 要打開

No useable identity provided

fetch-crl

qryconf 容量不對

hostname有問題

Error: The file is pinned

等幾小時,可能會自動解pinned

install new disk-node

  • dpm setup script ( testing on vm )
  • dome setup script
  • ipv6 setting
    • nmcli con
  • hostname setting
    • hostnamectl
  • zfs
    • iozone
      • download and compile from source
    • list all devices `ls -altrh /dev/disk/by-id/ | grep wwn | grep -v sdbj | awk ‘{print $(NF-2)}’ > /tmp/disk.txt
    • cat /tmp/disk.txt | awk ‘BEGIN{print “zpool create -o ashift=12 data01 \”} NR<57{if((NR-1)%70){printf("%s “, “raidz1”)};printf(”%s ", $0);if(NR%70){if(NR!=56){printf("\\n")}else{printf("\n")}}} NR>56{if(NR==57){printf("zpool add data01 spare “)}printf(”%s ",$0)}’ > ~/cdli/create.sh
    • bash ~/cdli/create.sh
    • parted
    • `zpool add log
    • [root@hpstor14 cdli]# echo “730000000000” > /sys/module/zfs/parameters/zfs_arc_min
      [root@hpstor14 cdli]# echo “760000000000” > /sys/module/zfs/parameters/zfs_arc_max
      [root@hpstor14 cdli]# echo “760000000000” > /sys/module/zfs/parameters/zfs_arc_meta_limit
      [root@hpstor14 cdli]# echo “730000000000” > /sys/module/zfs/parameters/zfs_arc_meta_min
      [root@hpstor14 cdli]# zfs set atime=off data01
      [root@hpstor14 cdli]# zfs set compression=lz4 data01
      [root@hpstor14 cdli]# zfs set mountpoint=/data01 data01

drain 沒反應

-Parameter(s): hpstor11.grid.sinica.edu.tw, /data01, dryrun, false
重開rfiod,shift.conf要記得上!
dpm 要重開!

‘NSS: private key from file not found’

好像放一段時間就好了?等吧…
除此之外,specify threads = 1可改善

with error: The Headnode Dav server reported error 403 when issuing the copy

fetch-crl
/usr/libexec/edg-mkgridmap/edg-mkgridmap.pl --conf=/etc/lcgdm-mkgridmap.conf --output=/etc/lcgdm-mapfile

Davix

用PROPFIND方法可以找檔案的metadata,內容大致上長這樣

<?xml version="1.0" encoding="utf-8"?>
<D:multistatus xmlns:D="DAV:">
<D:response xmlns:lcgdm="LCGDM:" xmlns:lp1="DAV:" xmlns:lp2="http://apache.org/dav/props/" xmlns:lp3="LCGDM:">
<D:href>/dpm/grid.sinica.edu.tw/home/atlas/atlasdatadisk/rucio/data17_13TeV/57/40/DAOD_EGAM2.20516709._000082.pool.root.1</D:href>
<D:propstat>
<D:prop>
<lcgdm:checksum.adler32>7569f3d3</lcgdm:checksum.adler32><lp1:resourcetype/>
<lp1:creationdate>2020-02-09T03:14:43Z</lp1:creationdate><lp1:getlastmodified>Sun, 09 Feb 2020 03:14:41 GMT</lp1:getlastmodified><lp3:lastaccessed>Sun, 09 Feb 2020 06:22:13 GMT</lp3:lastaccessed><lp1:getetag>d870f44-5e3f7921</lp1:getetag><lp1:getcontentlength>115902905</lp1:getcontentlength><lp1:displayname>DAOD_EGAM2.20516709._000082.pool.root.1</lp1:displayname><lp1:getcontenttype>application/x-troff-man</lp1:getcontenttype><lp1:executable>F</lp1:executable><lp2:executable>F</lp2:executable><lp1:iscollection>0</lp1:iscollection><lp3:guid></lp3:guid><lp3:mode>0100664</lp3:mode><lp3:sumtype>AD</lp3:sumtype><lp3:sumvalue>7569f3d3</lp3:sumvalue><lp3:fileid>226955076</lp3:fileid><lp3:status></lp3:status><lp3:xattr>{"checksum.adler32": "7569f3d3"}</lp3:xattr><lp1:owner>3109</lp1:owner><lp1:group>110</lp1:group></D:prop>
<D:status>HTTP/1.1 200 OK</D:status>
</D:propstat>
</D:response>
</D:multistatus>

HTTP-tpc not work

timeout => 檢查/var/www/proxycache有沒有被正確開起來

could not connect to remote node

檢查header 和 disk的 /etc/gridftp.conf的設定

Too many connection…

/etc/my.cnf沒有讀到
要在/usr/lib/systemd/system/mariadb.service中加入這行

ExecStart=/usr/bin/mysqld_safe --defaults-file=/etc/my.cnf --basedir=/usr

eymir.dmlite.log:May 11 10:38:08 eymir globus-gridftp-server[720]: {140281692354304}[0] dmlite Memcache MemcacheFactory : MemcacheFactory started.

/etc/dmlite.conf.d/zmemcache.conf被打開了,要清空它。

FTS transfer error

TRANSFER [70] TRANSFER globus_ftp_client: the server responded with an error 500 500-Command failed. : globus_ftp_control_data_write failed. 500-globus_ftp_control_data_write(): Handle not in proper state. PORT 500- 500 End.

可能是ipv6沒設定好,檢查是否都在同一台server上。

查詢某個ip

site對site的查v6
其他查v4,hostname沒有用的

TRANSFER [2] SOURCE SRM_GET_TURL error on the turl request : [SE][StatusOfGetRequest][SRM_INVALID_PATH]

transfer dashboard上有部份錯集中在某些機器上,可能和網路故障、v6故障之類的有關,重開該機器服務可以解決。

網路出問題後srm不能使用

  1. 重開srm機器上的服務
  2. 重開header上的舊服務(dpns & dpm)

2020年12月9日 星期三

關於磨合

關於磨合

這篇文章寫得挺有道理,在此引用一下。

當你意識到對方提出測試了,你根本不用想著通過,因為若你本來就是她要的,那你的真實本質將不證自明,若你不是她要的,那大家合則來,不合則散。

這次的事情之所以會不順利,追根究柢應該是有文章中的心態。為了一個結果而硬是去迎合,忽略了太多事情,希望日後能有所改善。