๊ด€๋ฆฌ ๋ฉ”๋‰ด

๋ชฉ๋กrmsprop (2)

DATA101

[Deep Learning] ์ตœ์ ํ™”(Optimizer): (4) Adam

1. ๊ฐœ๋…Adaptive Moment Estimation(Adam)์€ ๋”ฅ๋Ÿฌ๋‹ ์ตœ์ ํ™” ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜๋กœ์จ Momentum๊ณผ RMSProp์˜ ์žฅ์ ์„ ๊ฒฐํ•ฉํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. ์ฆ‰, ํ•™์Šต์˜ ๋ฐฉํ–ฅ๊ณผ ํฌ๊ธฐ(=Learning rate)๋ฅผ ๋ชจ๋‘ ๊ฐœ์„ ํ•œ ๊ธฐ๋ฒ•์œผ๋กœ ๋”ฅ๋Ÿฌ๋‹์—์„œ ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ๋˜์–ด "์˜ค๋˜" ์ตœ์ ํ™” ๊ธฐ๋ฒ•์œผ๋กœ ์•Œ๋ ค์ ธ ์žˆ์Šต๋‹ˆ๋‹ค. ์ตœ๊ทผ์—๋Š” RAdam, AdamW๊ณผ ๊ฐ™์ด ๋”์šฑ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ์ตœ์ ํ™” ๊ธฐ๋ฒ•์ด ์ œ์•ˆ๋˜์—ˆ์ง€๋งŒ, ๋ณธ ํฌ์ŠคํŒ…์—์„œ๋Š” ๋”ฅ๋Ÿฌ๋‹ ๋ถ„์•ผ ์ „๋ฐ˜์„ ๊ณต๋ถ€ํ•˜๋Š” ๋งˆ์Œ๊ฐ€์ง์œผ๋กœ Adam์— ๋Œ€ํ•ด ์•Œ์•„๋ด…๋‹ˆ๋‹ค.2. ์ˆ˜์‹์ˆ˜์‹๊ณผ ํ•จ๊ป˜ Adam์— ๋Œ€ํ•ด ์ž์„ธํžˆ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. $$ m_{t} = \beta_{1} m_{t-1} + (1 - \beta_{1}) \nabla f(x_{t-1}) $$$$ g_{t} = \beta_{..

[Deep Learning] ์ตœ์ ํ™”(Optimizer): (3) RMSProp

1. ๊ฐœ๋…RMSProp๋Š” ๋”ฅ๋Ÿฌ๋‹ ์ตœ์ ํ™” ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜๋กœ์จ Root Mean Sqaure Propagation์˜ ์•ฝ์ž๋กœ, ์•Œ์— ์—์Šคํ”„๋กญ(R.M.S.Prop)์ด๋ผ๊ณ  ์ฝ์Šต๋‹ˆ๋‹ค.โœ‹๋“ฑ์žฅ๋ฐฐ๊ฒฝ์ตœ์ ํ™” ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜์ธ AdaGrad๋Š” ํ•™์Šต์ด ์ง„ํ–‰๋  ๋•Œ ํ•™์Šต๋ฅ (Learning rate)์ด ๊พธ์ค€ํžˆ ๊ฐ์†Œํ•˜๋‹ค ๋‚˜์ค‘์—๋Š” \(0\)์œผ๋กœ ์ˆ˜๋ ดํ•˜์—ฌ ํ•™์Šต์ด ๋” ์ด์ƒ ์ง„ํ–‰๋˜์ง€ ์•Š๋Š”๋‹ค๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. RMSProp์€ ์ด๋Ÿฌํ•œ ํ•œ๊ณ„์ ์„ ๋ณด์™„ํ•œ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์œผ๋กœ์จ ์ œํ”„๋ฆฌ ํžŒํŠผ ๊ต์ˆ˜๊ฐ€ Coursea ๊ฐ•์˜ ์ค‘์— ๋ฐœํ‘œํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค.๐Ÿ›  ์›๋ฆฌRMSProp์€ AdaGrad์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋ณ€์ˆ˜(feature)๋ณ„๋กœ ํ•™์Šต๋ฅ ์„ ์กฐ์ ˆํ•˜๋˜ ๊ธฐ์šธ๊ธฐ ์—…๋ฐ์ดํŠธ ๋ฐฉ์‹์—์„œ ์ฐจ์ด๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด์ „ time step์—์„œ์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ๋‹จ์ˆœํžˆ ๊ฐ™์€ ๋น„์œจ๋กœ ๋ˆ„์ ํ•˜์ง€ ์•Š๊ณ  ์ง€์ˆ˜์ด๋™..