๊ด€๋ฆฌ ๋ฉ”๋‰ด

๋ชฉ๋กํ…์ŠคํŠธ๋ถ„์„ (15)

DATA101

[NLP] ๋ฌธ์„œ ์œ ์‚ฌ๋„ ๋ถ„์„: (3) ์ž์นด๋“œ ์œ ์‚ฌ๋„(Jaccard Similarity)

๐Ÿ“š ๋ชฉ์ฐจ1. ์ž์นด๋“œ ์œ ์‚ฌ๋„ ๊ฐœ๋…2. ์ž์นด๋“œ ์œ ์‚ฌ๊ณ  ์‹ค์Šต1. ์ž์นด๋“œ ์œ ์‚ฌ๋„ ๊ฐœ๋…์ž์นด๋“œ ์œ ์‚ฌ๋„(Jaccard Similarity)๋Š” \(2\)๊ฐœ์˜ ์ง‘ํ•ฉ \(A\), \(B\)๊ฐ€ ์žˆ์„ ๋•Œ ๋‘ ์ง‘ํ•ฉ์˜ ํ•ฉ์ง‘ํ•ฉ ์ค‘ ๊ต์ง‘ํ•ฉ์˜ ๋น„์œจ์ž…๋‹ˆ๋‹ค. ์ฆ‰, ๋‘ ์ง‘ํ•ฉ์ด ์™„์ „ํžˆ ๊ฐ™์„ ๋•Œ๋Š” ์ž์นด๋“œ ์œ ์‚ฌ๋„๊ฐ€ \(1\)์ด๋ฉฐ, ๋‘ ์ง‘ํ•ฉ์— ๊ต์ง‘ํ•ฉ์ด ์—†๋Š” ๊ฒฝ์šฐ๋Š” \(0\)์ž…๋‹ˆ๋‹ค. ์ž์นด๋“œ ์œ ์‚ฌ๋„๋ฅผ \(J\)๋ผ๊ณ  ํ•  ๋•Œ ๋‘ ์ง‘ํ•ฉ ๊ฐ„์˜ ์ž์นด๋“œ ์œ ์‚ฌ๋„ ์ˆ˜์‹์€ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค. $$ J(A, B) = \frac{|A \cap B|}{|A \cup B|} = \frac{|A \cap B|}{|A| + |B| - |A \cap B|} $$ ์ž์นด๋“œ ์œ ์‚ฌ๋„ ๊ฐœ๋…์„ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ถ„์•ผ๋กœ ๊ทธ๋Œ€๋กœ ๊ฐ€์ ธ์˜ค๋ฉด, ํ•˜๋‚˜์˜ ์ง‘ํ•ฉ์ด ๊ณง ํ•˜๋‚˜์˜ ๋ฌธ์„œ๊ฐ€ ํ•ด๋‹นํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ..

[NLP] ๋ฌธ์„œ ์œ ์‚ฌ๋„ ๋ถ„์„: (2) ์œ ํด๋ฆฌ๋””์•ˆ ๊ฑฐ๋ฆฌ(Euclidean Distance)

๐Ÿ“š ๋ชฉ์ฐจ1. ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ ๊ฐœ๋…2. ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ ์‹ค์Šต1. ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ ๊ฐœ๋…์ˆ˜ํ•™์  ๊ด€์  ์ ‘๊ทผ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ(Euclidean Distance)๋Š” ๋‘ ์  ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. ๋‘ ์  \(p\)์™€ \(q\)๊ฐ€ ๊ฐ๊ฐ \((p_1, p_2, ..., p_n)\), \((q_1, q_2, ..., q_n)\) ์ขŒํ‘œ๋ฅผ ๊ฐ€์งˆ ๋•Œ, ๋‘ ์  ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ ๊ณต์‹์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค. $$ \sqrt{(q_1 - p_1)^2 + (q_2 - p_2)^2 + ... + (q_n - p_n)^2} = \sqrt{\displaystyle\sum_{i=1}^{n}(q_i - p_i)^2}$$ ๋‹ค์ฐจ์›์ด ์•„๋‹Œ 2์ฐจ์› ๊ณต๊ฐ„์—์„œ ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ๋ฅผ ์‰ฝ๊ฒŒ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค(๊ทธ๋ฆผ 1 ์ฐธ๊ณ ). ๋‘ ์  \..

[NLP] Word2Vec: (3) Skip-gram ๊ฐœ๋… ๋ฐ ์›๋ฆฌ

๐Ÿ“š๋ชฉ์ฐจ1. ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ 2. ์ธ๊ณต์‹ ๊ฒฝ๋ง ๋ชจํ˜• 3. ํ•™์Šต ๊ณผ์ •4. CBOW vs Skip-gram5. ํ•œ๊ณ„์ ๋“ค์–ด๊ฐ€๋ฉฐWord2Vec๋Š” ํ•™์Šต๋ฐฉ์‹์— ๋”ฐ๋ผ ํฌ๊ฒŒ \(2\)๊ฐ€์ง€๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค: Continuous Bag of Words(CBOW)์™€ Skip-gram. CBOW๋Š” ์ฃผ๋ณ€ ๋‹จ์–ด(Context Word)๋กœ ์ค‘๊ฐ„์— ์žˆ๋Š” ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ์ค‘๊ฐ„์— ์žˆ๋Š” ๋‹จ์–ด๋ฅผ ์ค‘์‹ฌ ๋‹จ์–ด(Center Word) ๋˜๋Š” ํƒ€๊ฒŸ ๋‹จ์–ด(Target Word)๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค. ๋ฐ˜๋Œ€๋กœ, Skip-gram์€ ์ค‘์‹ฌ ๋‹จ์–ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ฃผ๋ณ€ ๋‹จ์–ด๋“ค์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์„ ํ–‰์—ฐ๊ตฌ๋“ค์— ๋”ฐ๋ฅด๋ฉด, ๋Œ€์ฒด๋กœ Skip-gram์ด CBOW๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ•˜๋‹ค๊ณ  ์•Œ๋ ค์ ธ ์žˆ๋Š”๋ฐ, ์ด์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ ๋ณธ ํฌ์ŠคํŒ…์— 'Chapter 4..

[NLP] Word2Vec: (1) ๊ฐœ๋…

๐Ÿ“š ๋ชฉ์ฐจ1. Word2Vec ๊ฐœ๋…2. ํฌ์†Œํ‘œํ˜„๊ณผ์˜ ์ฐจ์ด์  3. ์–ธ์–ด๋ชจ๋ธ๊ณผ์˜ ์ฐจ์ด์ 1. Word2Vec ๊ฐœ๋…Word2Vec๋Š” Word to Vector๋ผ๋Š” ์ด๋ฆ„์—์„œ ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด ๋‹จ์–ด(Word)๋ฅผ ์ปดํ“จํ„ฐ๊ฐ€ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋„๋ก ์ˆ˜์น˜ํ™”๋œ ๋ฒกํ„ฐ(Vector)๋กœ ํ‘œํ˜„ํ•˜๋Š” ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ๋Š” ๋ถ„์‚ฐํ‘œํ˜„(Distributed Representation) ๊ธฐ๋ฐ˜์˜ ์›Œ๋“œ์ž„๋ฒ ๋”ฉ(Word Embedding) ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค. ๋ถ„์‚ฐํ‘œํ˜„์ด๋ž€ ๋ถ„ํฌ๊ฐ€์„ค(Distibutional Hypothesis) ๊ฐ€์ • ํ•˜์— ์ €์ฐจ์›์— ๋‹จ์–ด ์˜๋ฏธ๋ฅผ ๋ถ„์‚ฐํ•˜์—ฌ ํ‘œํ˜„ํ•˜๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. ๋ถ„ํฌ๊ฐ€์„ค์€ "์œ ์‚ฌํ•œ ๋ฌธ๋งฅ์— ๋“ฑ์žฅํ•œ ๋‹จ์–ด๋Š” ์œ ์‚ฌํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ–๋Š”๋‹ค"๋ผ๋Š” ๊ฐ€์ •์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐํ™”ํ•˜๋Š” ์ž‘์—…์„ ์›Œ๋“œ์ž„๋ฒ ๋”ฉ(Word Embedding)์ด๋ผ๊ณ ..

[NLP] Word Embedding์˜ ์ดํ•ด: ํฌ์†Œํ‘œํ˜„๊ณผ ๋ฐ€์ง‘ํ‘œํ˜„

๐Ÿ“š ๋ชฉ์ฐจ1. ํฌ์†Œํ‘œํ˜„(Sparse Representation) 2. ๋ฐ€์ง‘ํ‘œํ˜„(Dense Representation) 3. ์›Œ๋“œ์ž„๋ฒ ๋”ฉ(Word Embedding)๋“ค์–ด๊ฐ€๋ฉฐ์›Œ๋“œ ์ž„๋ฒ ๋”ฉ(Word Embedding)์€ ๋‹จ์–ด(Word)๋ฅผ ์ปดํ“จํ„ฐ๊ฐ€ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜์ธ๋ฐ, ํŠนํžˆ ๋ฐ€์ง‘ํ‘œํ˜„(Dense Representation) ๋ฐฉ์‹์„ ํ†ตํ•ด ํ‘œํ˜„ํ•˜๋Š” ๊ธฐ๋ฒ•์„ ๋งํ•ฉ๋‹ˆ๋‹ค. ๋ฐ€์ง‘ํ‘œํ˜„๊ณผ ๋ฐ˜๋Œ€๋˜๋Š” ๊ฐœ๋…์ด ํฌ์†Œํ‘œํ˜„(Sparse Representation)์ž…๋‹ˆ๋‹ค. ์›Œ๋“œ ์ž„๋ฒ ๋”ฉ์„ ์ดํ•ดํ•˜๊ธฐ์— ์•ž์„œ ํฌ์†Œํ‘œํ˜„๊ณผ ๋ฐ€์ง‘ํ‘œํ˜„์— ๋Œ€ํ•ด ์•Œ์•„๋ด…๋‹ˆ๋‹ค.1. ํฌ์†Œํ‘œํ˜„(Sparse Representation)ํฌ์†Œํ‘œํ˜„์€ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฒกํ„ฐ ๋˜๋Š” ํ–‰๋ ฌ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ˆ˜์น˜ํ™”ํ•˜์—ฌ ํ‘œํ˜„ํ•  ๋•Œ ๊ทนํžˆ ์ผ๋ถ€์˜ ์ธ๋ฑ์Šค๋งŒ ํŠน์ • ๊ฐ’์œผ๋กœ ํ‘œํ˜„ํ•˜๊ณ , ๋Œ€๋ถ€๋ถ„์˜ ..

[NLP] ๋ฌธ์„œ ๋‹จ์–ด ํ–‰๋ ฌ(DTM) ๊ฐœ๋… ์ดํ•ด

๋ณธ ํฌ์ŠคํŒ…์—์„œ๋Š” ์นด์šดํŠธ ๊ธฐ๋ฐ˜์˜ ๋‹จ์–ด ํ‘œํ˜„๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜์ธ ๋ฌธ์„œ ๋‹จ์–ด ํ–‰๋ ฌ(DTM)์˜ ๊ฐœ๋…์— ๋Œ€ํ•ด ์•Œ์•„๋ด…๋‹ˆ๋‹ค.๐Ÿ“š ๋ชฉ์ฐจ1. DTM ๊ฐœ๋… 2. DTM ์˜ˆ์‹œ 3. DTM ํ•œ๊ณ„์ 1. DTM ๊ฐœ๋…๋ฌธ์„œ ๋‹จ์–ด ํ–‰๋ ฌ(Document-Term Maxtrix, DTM)์€ ๋‹ค์ˆ˜์˜ ๋ฌธ์„œ ๋ฐ์ดํ„ฐ(=Corpus)์—์„œ ๋“ฑ์žฅํ•œ ๋ชจ๋“  ๋‹จ์–ด์˜ ์ถœํ˜„ ๋นˆ๋„์ˆ˜(frequency)๋ฅผ ํ–‰๋ ฌ๋กœ ํ‘œํ˜„ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ฆ‰, DTM์€ ๋‹ค์ˆ˜์˜ ๋ฌธ์„œ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ Bag of Words(BoW)๋ฅผ ํ–‰๋ ฌ๋กœ ํ‘œํ˜„ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. DTM์€ ๊ตญ์†Œ ํ‘œํ˜„(Local Representation) ๋˜๋Š” ์ด์‚ฐ ํ‘œํ˜„(Discrete Representation)์˜ ์ผ์ข…์œผ๋กœ ์นด์šดํŠธ ๊ธฐ๋ฐ˜์˜ ๋‹จ์–ด ํ‘œํ˜„๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.2. DTM ์˜ˆ์‹œDTM ์˜ˆ์‹œ๋ฅผ ๋“ค์–ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜์™€ ๊ฐ™์ด 4๊ฐœ์˜ ๋ฌธ์„œ๊ฐ€ ์žˆ๋‹ค..