{"id":2483,"date":"2026-05-11T16:25:25","date_gmt":"2026-05-11T08:25:25","guid":{"rendered":"https:\/\/oknomad.blog\/?p=2483"},"modified":"2026-05-11T17:09:13","modified_gmt":"2026-05-11T09:09:13","slug":"testing-code-expolu-keeps-converging-to-1e-9-and-relu-cant-converge","status":"publish","type":"post","link":"https:\/\/oknomad.blog\/?p=2483","title":{"rendered":"testing code &#8211; ExpoLU keeps converging to 1e-9 and Relu cant converge"},"content":{"rendered":"\n<p>I wrote the code below line by line manually. Although I have 2 coding agents on my VS code I just used them for advice, and they did helped and teached a lot and made me get hands on quickly. And Gemini help add some annotations for me before posting.<\/p>\n\n\n\n<p>In the test code below, ExpoLU(a1b1p2) converged to 1e-9 in about 10 tests and ReLU cant converge at all in 4 or 5 tests. You can run it to see yourself.<\/p>\n\n\n\n<p>Below is the test code for ExpoLU and ReLU, and training data of 21 pairs of x and y is generated by me manually. Grok helped me find a bug on training data normalization, but with wrong normalization ExpoLU can work well too, haha.<\/p>\n\n\n\n<p>By the way I removes norma0 for input data&#8217;s initial normalization which doesnt help.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import torch\nimport torch.nn as nn\n\n# Device Configuration\ndevice=torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\nprint(f\"Using device: {device}\")\n\n# Hyperparameters\npowerlist=&#91;1] # power list including linear base (1) and nonlinear Polynomial expansion (>1)\nacttype=1  #activation type: 1 is Expolu and 9 is Relu\nn=512 #feature number\nlayers=30 # ffn layer number at least 2\nlr=0.001\/n #learning rate\nrounds=400 #training rounds\n\n# Dataset: 21 pairs of x and y\ntdata1=torch.tensor(&#91;\n    &#91;-10,6],&#91;-9,3],&#91;-8,1],&#91;-7,-2],&#91;-6,-5],&#91;-5,-2],&#91;-4,1],&#91;-3,3],&#91;-2,5],&#91;-1,2],&#91;0,0],\n    &#91;10,6],&#91;9,3],&#91;8,1],&#91;7,-2],&#91;6,-5],&#91;5,-2],&#91;4,1],&#91;3,3],&#91;2,5],&#91;1,2],\n    ],\n    dtype=torch.float32,\n    device=device,\n    )\n\n# Global Statistics for Normalization\nxmean=tdata1&#91;:,:-1].mean()\nxstd=tdata1&#91;:,:-1].std()\nymean=tdata1&#91;:,-1:].mean()\nystd=tdata1&#91;:,-1:].std()\n\nclass activation:\n    def __init__(self, type):\n\n        def ExpoLU(x,a=1,b=1,p=2):  # acttype=1\n            return torch.where(x&lt;-b,torch.tensor(0.0,device=x.device),\n                               torch.where(x&lt;b,(1\/a)*(x+b)**p,x+(1\/a)*(2*b)**p-b))\n        def PowerLU(x,xp=4): # acttype=2\n            return torch.where(x&lt;-1,torch.tensor(0.0,device=x.device),\n                           torch.where(x&lt;1,(x+1)**xp\/(2**xp),x))\n        def ParaLU(x, scope=0.1): # acttype=3\n            return torch.where(x&lt;-scope,torch.tensor(0.0,device=x.device),\n                           torch.where(x&lt;scope,(x+scope)**2\/(4*scope),x))\n        def QuartLU(x, scope=0.1): # acttype=4\n            return torch.where(x&lt;-scope,torch.tensor(0.0,device=x.device),\n                           torch.where(x&lt;(scope\/3),(x+scope)**4*27\/(256*scope**3),x))\n        \n        funcs={\n            0:lambda x:x,\n            1:ExpoLU,\n            2:PowerLU,\n            3:ParaLU,\n            4:QuartLU,\n            8:torch.sigmoid,\n            9:torch.relu,            \n        }\n        self.act=funcs&#91;type]\n\n# Initialize global activation reference\nact=activation(acttype).act       \n\nclass ffn(nn.Module):\n    def __init__(self, power, layer, tdata):\n        super().__init__()\n        self.powerlist=power\n        self.layernum=layer\n        self.tdata=tdata\n        self.act=activation(acttype).act\n        \n        maxpower=max(self.powerlist)\n        self.W0=torch.randn(maxpower,len(tdata&#91;0])-1,n,dtype=torch.float32,device=tdata.device) #torch: data.shape&#91;1]-1\n        self.W0.requires_grad=True\n        self.B0=torch.randn(1,n,dtype=torch.float32,device=tdata.device)\n        self.B0.requires_grad=True\n        self.Wh=torch.randn(maxpower,self.layernum-1,n,n,dtype=torch.float32,device=tdata.device)\n        self.Wh.requires_grad=True\n        self.Bh=torch.randn(self.layernum-1,1,n,dtype=torch.float32,device=tdata.device)\n        self.Bh.requires_grad=True\n        self.Wo=torch.randn(maxpower,n,1,dtype=torch.float32,device=tdata.device)\n        self.Wo.requires_grad=True\n        self.Bo=torch.randn(1,1,dtype=torch.float32,device=tdata.device)\n        self.Bo.requires_grad=True\n    \n    \n\n    #Internal Layer Normalization: Resets signal to Mean 0, Std 1\n    def norma(self,x):\n        x=(x-x.mean())\/x.std()\n        return x\n\n    #Global Dataset Normalization    \n    def norma0(self,x,typeid):\n        if typeid==\"x\":\n            x=(x-xmean)\/xstd\n        if typeid==\"y\":\n            x=(x-ymean)\/ystd\n        return x\n      \n    #Input Layer: Power Transformation + Double Normalization    \n    def ffn0(self,x):\n        firstpower=0\n        \n        for idx,power in enumerate(self.powerlist):\n            if firstpower==0:\n                xh=torch.sign(x)*abs(x**power)@self.W0&#91;idx]+self.B0  \n                firstpower=power\n            else:\n                xh=xh+torch.sign(x)*abs(x**power)@self.W0&#91;idx]\n            xh=self.norma(xh)\n            xh=self.act(xh)\n            xh=self.norma(xh)    \n        return xh\n    \n    #Hidden Layers: Power Transformation + Double Normalization\n    def ffnh(self,xh): #linep=1\n        for laynum in range(self.layernum-1):\n            firstpower=0\n            for idx,power in enumerate(self.powerlist):\n                if firstpower==0:\n                    x=torch.sign(xh)*abs(xh**power)@self.Wh&#91;idx,laynum]+self.Bh&#91;laynum]  \n                    firstpower=power\n                else:\n                    x=xh+torch.sign(xh)*abs(xh**power)@self.Wh&#91;idx,laynum]\n            xh=self.norma(x)\n            xh=self.act(xh)\n            xh=self.norma(xh)\n        return xh\n    \n    #Output Layer: Final Projection\n    def ffno(self,x):\n        firstpower=0\n        for idx,power in enumerate(self.powerlist):\n            if firstpower==0:\n                xh=torch.sign(x)*abs(x**power)@self.Wo&#91;idx]+self.Bo  \n                firstpower=power\n            else:\n                xh=xh+torch.sign(x)*abs(x**power)@self.Wo&#91;idx]\n        return xh\n    \n    def forward(self,xx):\n        xh=self.ffn0(xx)\n        xh=self.ffnh(xh)\n        ho=self.ffno(xh)\n        return ho\n\n\n\n\ndef training_1(inputdata): \n    doffn=ffn(powerlist, layers, inputdata)\n\n    for i in range(rounds):\n        printloss=0\n        for item in inputdata:\n            x=item&#91;:-1]\n            y=item&#91;-1:]\n\n            # Dataset-level Normalization\n #           x=doffn.norma0(x,\"x\")\n #           y=doffn.norma0(y,\"y\")\n\n            # Forward Pass &amp; MSE Loss\n            ho=doffn.forward(x)\n            loss=(ho-y)**2\n            loss=loss.mean()\n\n            printloss+=loss\n            loss.backward()\n\n            # Manual Gradient Descent (No Optimizer)\n            with torch.no_grad():        \n                for weight in &#91;doffn.W0, doffn.B0, doffn.Wh, doffn.Bh, doffn.Wo, doffn.Bo]:\n                    if weight.grad is not None:\n                        weight -= lr * weight.grad\n                        weight.grad.zero_()     \n\n        print(f\"round {i}, loss {printloss\/len(inputdata):.12f}\")\n\n# run training\ntraining_1(tdata1)<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>I wrote the code below line by line manually. Although I have 2 coding agents on my VS code I just used them for advice,&#8230;<\/p>\n<div class=\"more-link-wrapper\"><a class=\"more-link\" href=\"https:\/\/oknomad.blog\/?p=2483\">Continue reading<span class=\"screen-reader-text\">testing code &#8211; ExpoLU keeps converging to 1e-9 and Relu cant converge<\/span><\/a><\/div>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-2483","post","type-post","status-publish","format-standard","hentry","category-uncategorized","entry"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/oknomad.blog\/index.php?rest_route=\/wp\/v2\/posts\/2483","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/oknomad.blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/oknomad.blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/oknomad.blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/oknomad.blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2483"}],"version-history":[{"count":5,"href":"https:\/\/oknomad.blog\/index.php?rest_route=\/wp\/v2\/posts\/2483\/revisions"}],"predecessor-version":[{"id":2490,"href":"https:\/\/oknomad.blog\/index.php?rest_route=\/wp\/v2\/posts\/2483\/revisions\/2490"}],"wp:attachment":[{"href":"https:\/\/oknomad.blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2483"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/oknomad.blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2483"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/oknomad.blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2483"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}