๊ด€๋ฆฌ ๋ฉ”๋‰ด

๋ชฉ๋กMomentum (2)

DATA101

[Deep Learning] ์ตœ์ ํ™”(Optimizer): (4) Adam

1. ๊ฐœ๋…Adaptive Moment Estimation(Adam)์€ ๋”ฅ๋Ÿฌ๋‹ ์ตœ์ ํ™” ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜๋กœ์จ Momentum๊ณผ RMSProp์˜ ์žฅ์ ์„ ๊ฒฐํ•ฉํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. ์ฆ‰, ํ•™์Šต์˜ ๋ฐฉํ–ฅ๊ณผ ํฌ๊ธฐ(=Learning rate)๋ฅผ ๋ชจ๋‘ ๊ฐœ์„ ํ•œ ๊ธฐ๋ฒ•์œผ๋กœ ๋”ฅ๋Ÿฌ๋‹์—์„œ ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ๋˜์–ด "์˜ค๋˜" ์ตœ์ ํ™” ๊ธฐ๋ฒ•์œผ๋กœ ์•Œ๋ ค์ ธ ์žˆ์Šต๋‹ˆ๋‹ค. ์ตœ๊ทผ์—๋Š” RAdam, AdamW๊ณผ ๊ฐ™์ด ๋”์šฑ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ์ตœ์ ํ™” ๊ธฐ๋ฒ•์ด ์ œ์•ˆ๋˜์—ˆ์ง€๋งŒ, ๋ณธ ํฌ์ŠคํŒ…์—์„œ๋Š” ๋”ฅ๋Ÿฌ๋‹ ๋ถ„์•ผ ์ „๋ฐ˜์„ ๊ณต๋ถ€ํ•˜๋Š” ๋งˆ์Œ๊ฐ€์ง์œผ๋กœ Adam์— ๋Œ€ํ•ด ์•Œ์•„๋ด…๋‹ˆ๋‹ค.2. ์ˆ˜์‹์ˆ˜์‹๊ณผ ํ•จ๊ป˜ Adam์— ๋Œ€ํ•ด ์ž์„ธํžˆ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. $$ m_{t} = \beta_{1} m_{t-1} + (1 - \beta_{1}) \nabla f(x_{t-1}) $$$$ g_{t} = \beta_{..

[Deep Learning] ์ตœ์ ํ™”(Optimizer): (1) Momentum

๋ณธ ํฌ์ŠคํŒ…์—์„œ๋Š” ๋”ฅ๋Ÿฌ๋‹ ์ตœ์ ํ™”(optimizer) ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜์ธ Momentum์˜ ๊ฐœ๋…์— ๋Œ€ํ•ด ์•Œ์•„๋ด…๋‹ˆ๋‹ค. ๋จผ์ €, Momentum ๊ธฐ๋ฒ•์ด ์ œ์•ˆ๋œ ๋ฐฐ๊ฒฝ์ธ ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•(Gradient Descent)์˜ ํ•œ๊ณ„์ ์— ๋Œ€ํ•ด ๋‹ค๋ฃจ๊ณ  ์•Œ์•„๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.๐Ÿ“š ๋ชฉ์ฐจ1. ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•์˜ ํ•œ๊ณ„ 1.1. Local Minimum ๋ฌธ์ œ 1.2. Saddle Point ๋ฌธ์ œ2. Momentum 2.1. ๊ฐœ๋… 2.2. ์ˆ˜์‹1. ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•์˜ ํ•œ๊ณ„๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•(Gradient Descent)์€ ํฌ๊ฒŒ 2๊ฐ€์ง€ ํ•œ๊ณ„์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ฒซ์งธ, Local Minimum์— ๋น ์ง€๊ธฐ ์‰ฝ๋‹ค๋Š” ์ . ๋‘˜์งธ, ์•ˆ์žฅ์ (Saddle point)๋ฅผ ๋ฒ—์–ด๋‚˜์ง€ ๋ชปํ•œ๋‹ค๋Š” ์ . ๊ฐ๊ฐ์— ๋Œ€ํ•ด ์•Œ์•„๋ด…๋‹ˆ๋‹ค.1.1. Local Minimum..