๊ด€๋ฆฌ ๋ฉ”๋‰ด

๋ชฉ๋ก2022/05/20 (1)

DATA101

[Deep Learning] ์ตœ์ ํ™”(Optimizer): (2) AdaGrad

๐Ÿ“š ๋ชฉ์ฐจ 1. ๊ฐœ๋… 2. ์žฅ์  3. ๋‹จ์  1. ๊ฐœ๋… AdaGrad๋Š” ๋”ฅ๋Ÿฌ๋‹ ์ตœ์ ํ™” ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜๋กœ์จ Adaptive Gradient์˜ ์•ฝ์ž์ด๊ณ , ์ ์‘์  ๊ธฐ์šธ๊ธฐ๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค. Feature๋งˆ๋‹ค ์ค‘์š”๋„, ํฌ๊ธฐ ๋“ฑ์ด ์ œ๊ฐ๊ฐ์ด๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋“  Feature๋งˆ๋‹ค ๋™์ผํ•œ ํ•™์Šต๋ฅ ์„ ์ ์šฉํ•˜๋Š” ๊ฒƒ์€ ๋น„ํšจ์œจ์ ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ด€์ ์—์„œ AdaGrad ๊ธฐ๋ฒ•์ด ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. AdaGrad๋Š” Feature๋ณ„๋กœ ํ•™์Šต๋ฅ (Learning rate)์„ Adaptiveํ•˜๊ฒŒ, ์ฆ‰ ๋‹ค๋ฅด๊ฒŒ ์กฐ์ ˆํ•˜๋Š” ๊ฒƒ์ด ํŠน์ง•์ž…๋‹ˆ๋‹ค. AdaGrad๋ฅผ ์ˆ˜์‹์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค. $$ g_{t} = g_{t-1} + (\nabla f(x_{t-1}))^{2} $$ $$ x_{t} = x_{t-1} - \frac{\eta}{\sqrt{g_{t} + \epsi..