Hadoop unable to create new native thread

这已经是我第N次遇到这个问题了。我的leader是一个做事很小心的人,每次用机器的时候都生怕用坏了,任务不敢多开,进程不敢多启,而我恰恰相反,喜欢一次开几十、上百个进程让它们去跑。

于是就这样了:

2012-12-05 16:59:50,649 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.4.1.26:50010, storageID=DS-1495479526-10.4.1.26-50010-1351698048013, infoPort=50075, ipcPort=50020):DataXceiveServer: Exiting due to:java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:691)
at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:133)
at java.lang.Thread.run(Thread.java:722)

2012-12-05 16:59:50,650 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_-7993801208181269223_29312 terminating
2012-12-05 16:59:50,650 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-5613326197766122812_29314 src: /10.4.1.26:50409 dest: /10.4.1.26:50010
2012-12-05 16:59:50,709 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting DataXceiveServer
2012-12-05 16:59:50,709 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_-1331623754726596493_29314 java.io.EOFException: while trying to read 65557 bytes
2012-12-05 16:59:50,752 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_-1331623754726596493_29314 1 : Thread is interrupted.
2012-12-05 16:59:50,752 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_-1331623754726596493_29314 terminating
2012-12-05 16:59:50,752 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_-9012273100946910402_29314 terminating
2012-12-05 16:59:50,752 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-1331623754726596493_29314 received exception java.io.EOFException: while trying to read 65557 bytes
2012-12-05 16:59:50,753 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.4.1.26:50010, storageID=DS-1495479526-10.4.1.26-50010-1351698048013, infoPort=50075, ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 65557 bytes
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:268)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:312)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:376)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
at java.lang.Thread.run(Thread.java:722)
2012-12-05 16:59:50,754 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block blk_-5613326197766122812_29314 terminating
2012-12-05 16:59:51,101 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Exiting DataBlockScanner thread.
2012-12-05 16:59:51,989 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.4.1.26:50010, storageID=DS-1495479526-10.4.1.26-50010-1351698048013, infoPort=50075, ipcPort=50020) Starting thread to transfer block blk_-8090576313760428542_29286 to 10.4.1.30:50010 10.4.1.17:50010 10.4.1.19:50010 10.4.1.14:50010
2012-12-05 16:59:51,990 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.4.1.26:50010, storageID=DS-1495479526-10.4.1.26-50010-1351698048013, infoPort=50075, ipcPort=50020) Starting thread to transfer block blk_2442236886486298102_29277 to 10.4.1.19:50010 10.4.1.14:50010 10.4.1.30:50010
2012-12-05 16:59:52,352 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.4.1.26:50010, storageID=DS-1495479526-10.4.1.26-50010-1351698048013, infoPort=50075, ipcPort=50020):Transmitted block blk_-8090576313760428542_29286 to /10.4.1.30:50010
2012-12-05 16:59:52,363 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
2012-12-05 16:59:56,204 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at test26.localdomain/10.4.1.26
************************************************************/

我目前怀疑是不是ulimit干的!

$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 62738
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

此博客中的热门博文

少写代码,多读别人写的代码

在windows下使用llvm+clang

tensorflow distributed runtime初窥