๊ด€๋ฆฌ ๋ฉ”๋‰ด

๋ชฉ๋ก์ž์—ฐ์–ด์ฒ˜๋ฆฌ (16)

DATA101

Mecab ์„ค์น˜ ์—๋Ÿฌ ํ•ด๊ฒฐํ•˜๊ธฐ: "Exception: Install MeCab in order to use it: http://konlpy.org/en/latest/install/"

๐Ÿ‘จ‍๐Ÿ’ป ๋“ค์–ด๊ฐ€๋ฉฐKoNLPy์™€ Mecab ํŒจํ‚ค์ง€๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ ์„ค์น˜๋˜์–ด ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. ํ˜น์‹œ๋‚˜ ์„ค์น˜๋˜์–ด ์žˆ์ง€ ์•Š๋‹ค๋ฉด ์•„๋ž˜ ํฌ์ŠคํŒ…์„ ์ฐธ๊ณ ํ•ด ์ฃผ์„ธ์š”.https://heytech.tistory.com/3 [Python/NLP] KoNLPy ์„ค์น˜ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์ž!์˜ค๋Š˜์€ ํ•œ๊ตญ์–ด ์ž์—ฐ์–ด์ฒ˜๋ฆฌ(NLP)๋ฅผ ์œ„ํ•œ ํŒŒ์ด์ฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ KoNLPy๋ฅผ ์„ค์น˜ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค. 1. ํ„ฐ๋ฏธ๋„/์ปค๋งจ๋“œ๋ผ์ธ ์˜คํ”ˆ ๊ฐ€์žฅ ๋จผ์ €, ํ„ฐ๋ฏธ๋„/์ปค๋งจ๋“œ๋ผ์ธ์„ ์—ด์–ด์ค๋‹ˆ๋‹ค. 2. KoNLPy ํŒจํ‚ค์ง€ ์„ค์น˜ pip iheytech.tistory.comMecab ์„ค์น˜๋ฐฉ๋ฒ•bash ๋ณธ๊ฒฉ์ ์œผ๋กœ ์—๋Ÿฌ ํ•ด๊ฒฐ๋ฐฉ๋ฒ•์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค.๐Ÿค– ์—๋Ÿฌ ์ƒํ™ฉfrom konlpy.tag import MecabMecab().nouns("ํ—ค์ด ํ…Œํฌ ๋ธ”๋กœ๊ทธ์ž…๋‹ˆ๋‹ค.")Mecab ํ˜•ํƒœ..

[NLP] ๋ฌธ์„œ ์œ ์‚ฌ๋„ ๋ถ„์„: (3) ์ž์นด๋“œ ์œ ์‚ฌ๋„(Jaccard Similarity)

๐Ÿ“š ๋ชฉ์ฐจ1. ์ž์นด๋“œ ์œ ์‚ฌ๋„ ๊ฐœ๋…2. ์ž์นด๋“œ ์œ ์‚ฌ๊ณ  ์‹ค์Šต1. ์ž์นด๋“œ ์œ ์‚ฌ๋„ ๊ฐœ๋…์ž์นด๋“œ ์œ ์‚ฌ๋„(Jaccard Similarity)๋Š” \(2\)๊ฐœ์˜ ์ง‘ํ•ฉ \(A\), \(B\)๊ฐ€ ์žˆ์„ ๋•Œ ๋‘ ์ง‘ํ•ฉ์˜ ํ•ฉ์ง‘ํ•ฉ ์ค‘ ๊ต์ง‘ํ•ฉ์˜ ๋น„์œจ์ž…๋‹ˆ๋‹ค. ์ฆ‰, ๋‘ ์ง‘ํ•ฉ์ด ์™„์ „ํžˆ ๊ฐ™์„ ๋•Œ๋Š” ์ž์นด๋“œ ์œ ์‚ฌ๋„๊ฐ€ \(1\)์ด๋ฉฐ, ๋‘ ์ง‘ํ•ฉ์— ๊ต์ง‘ํ•ฉ์ด ์—†๋Š” ๊ฒฝ์šฐ๋Š” \(0\)์ž…๋‹ˆ๋‹ค. ์ž์นด๋“œ ์œ ์‚ฌ๋„๋ฅผ \(J\)๋ผ๊ณ  ํ•  ๋•Œ ๋‘ ์ง‘ํ•ฉ ๊ฐ„์˜ ์ž์นด๋“œ ์œ ์‚ฌ๋„ ์ˆ˜์‹์€ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค. $$ J(A, B) = \frac{|A \cap B|}{|A \cup B|} = \frac{|A \cap B|}{|A| + |B| - |A \cap B|} $$ ์ž์นด๋“œ ์œ ์‚ฌ๋„ ๊ฐœ๋…์„ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ถ„์•ผ๋กœ ๊ทธ๋Œ€๋กœ ๊ฐ€์ ธ์˜ค๋ฉด, ํ•˜๋‚˜์˜ ์ง‘ํ•ฉ์ด ๊ณง ํ•˜๋‚˜์˜ ๋ฌธ์„œ๊ฐ€ ํ•ด๋‹นํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ..

[NLP] ๋ฌธ์„œ ์œ ์‚ฌ๋„ ๋ถ„์„: (2) ์œ ํด๋ฆฌ๋””์•ˆ ๊ฑฐ๋ฆฌ(Euclidean Distance)

๐Ÿ“š ๋ชฉ์ฐจ1. ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ ๊ฐœ๋…2. ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ ์‹ค์Šต1. ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ ๊ฐœ๋…์ˆ˜ํ•™์  ๊ด€์  ์ ‘๊ทผ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ(Euclidean Distance)๋Š” ๋‘ ์  ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. ๋‘ ์  \(p\)์™€ \(q\)๊ฐ€ ๊ฐ๊ฐ \((p_1, p_2, ..., p_n)\), \((q_1, q_2, ..., q_n)\) ์ขŒํ‘œ๋ฅผ ๊ฐ€์งˆ ๋•Œ, ๋‘ ์  ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ ๊ณต์‹์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค. $$ \sqrt{(q_1 - p_1)^2 + (q_2 - p_2)^2 + ... + (q_n - p_n)^2} = \sqrt{\displaystyle\sum_{i=1}^{n}(q_i - p_i)^2}$$ ๋‹ค์ฐจ์›์ด ์•„๋‹Œ 2์ฐจ์› ๊ณต๊ฐ„์—์„œ ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ๋ฅผ ์‰ฝ๊ฒŒ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค(๊ทธ๋ฆผ 1 ์ฐธ๊ณ ). ๋‘ ์  \..

[NLP] ๋ฌธ์„œ ์œ ์‚ฌ๋„ ๋ถ„์„: (1) ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„(Cosine Similarity)

๐Ÿ“š ๋ชฉ์ฐจ1. ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ๊ฐœ๋…2. ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ์‹ค์Šต1. ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ๊ฐœ๋…์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„(Cosine Similarity)๋ž€ ๋‘ ๋ฒกํ„ฐ ์‚ฌ์ด์˜ ๊ฐ๋„๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๋‘ ๋ฒกํ„ฐ๊ฐ€ ์–ผ๋งˆ๋‚˜ ์œ ์‚ฌํ•œ์ง€ ์ธก์ •ํ•˜๋Š” ์ฒ™๋„์ž…๋‹ˆ๋‹ค. ์ฆ‰, DTM, TF-IDF, Word2Vec ๋“ฑ๊ณผ ๊ฐ™์ด ๋‹จ์–ด๋ฅผ ์ˆ˜์น˜ํ™”ํ•˜์—ฌ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ฌธ์„œ ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ๋น„๊ตํ•˜๋Š” ๊ฒŒ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋Š” \(1\)์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ๋‘ ๋ฒกํ„ฐ๊ฐ€ ์œ ์‚ฌํ•˜๋‹ค๊ณ  ํ•ด์„ํ•˜๋ฉฐ, ๋ฌธ์„œ์˜ ๊ธธ์ด๊ฐ€ ๋‹ค๋ฅธ ๊ฒฝ์šฐ์—๋„ ๋น„๊ต์  ๊ณต์ •ํ•˜๊ฒŒ ๋น„๊ตํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜ ๊ทธ๋ฆผ 1๊ณผ ๊ฐ™์ด ๋‘ ๋ฒกํ„ฐ๊ฐ€ ๊ฐ™์€ ๋ฐฉํ–ฅ์„ ๊ฐ€๋ฆฌํ‚ค๋Š”, ์ฆ‰ ๋‘ ๋ฒกํ„ฐ ์‚ฌ์ด์˜ ๊ฐ๋„๊ฐ€ \(0^\circ\)์ผ ๋•Œ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๊ฐ€ ์ตœ๋Œ“๊ฐ’์ธ 1์„ ๊ฐ–์Šต๋‹ˆ๋‹ค. \(A\), \(B\)๋ผ๋Š” ๋‘ ๋ฒกํ„ฐ๊ฐ€..

[NLP] Word2Vec: (3) Skip-gram ๊ฐœ๋… ๋ฐ ์›๋ฆฌ

๐Ÿ“š๋ชฉ์ฐจ1. ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ 2. ์ธ๊ณต์‹ ๊ฒฝ๋ง ๋ชจํ˜• 3. ํ•™์Šต ๊ณผ์ •4. CBOW vs Skip-gram5. ํ•œ๊ณ„์ ๋“ค์–ด๊ฐ€๋ฉฐWord2Vec๋Š” ํ•™์Šต๋ฐฉ์‹์— ๋”ฐ๋ผ ํฌ๊ฒŒ \(2\)๊ฐ€์ง€๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค: Continuous Bag of Words(CBOW)์™€ Skip-gram. CBOW๋Š” ์ฃผ๋ณ€ ๋‹จ์–ด(Context Word)๋กœ ์ค‘๊ฐ„์— ์žˆ๋Š” ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ์ค‘๊ฐ„์— ์žˆ๋Š” ๋‹จ์–ด๋ฅผ ์ค‘์‹ฌ ๋‹จ์–ด(Center Word) ๋˜๋Š” ํƒ€๊ฒŸ ๋‹จ์–ด(Target Word)๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค. ๋ฐ˜๋Œ€๋กœ, Skip-gram์€ ์ค‘์‹ฌ ๋‹จ์–ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ฃผ๋ณ€ ๋‹จ์–ด๋“ค์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์„ ํ–‰์—ฐ๊ตฌ๋“ค์— ๋”ฐ๋ฅด๋ฉด, ๋Œ€์ฒด๋กœ Skip-gram์ด CBOW๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ•˜๋‹ค๊ณ  ์•Œ๋ ค์ ธ ์žˆ๋Š”๋ฐ, ์ด์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ ๋ณธ ํฌ์ŠคํŒ…์— 'Chapter 4..

[NLP] Word2Vec: (2) CBOW ๊ฐœ๋… ๋ฐ ์›๋ฆฌ

๐Ÿ“š๋ชฉ์ฐจ1. ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ 2. ์ธ๊ณต์‹ ๊ฒฝ๋ง ๋ชจํ˜• 3. ํ•™์Šต ์ ˆ์ฐจ4. CBOW vs Skip-gram5. ํ•œ๊ณ„์ ๋“ค์–ด๊ฐ€๋ฉฐWord2Vec๋Š” ํ•™์Šต๋ฐฉ์‹์— ๋”ฐ๋ผ ํฌ๊ฒŒ \(2\)๊ฐ€์ง€๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค: Continuous Bag of Words(CBOW)์™€ Skip-gram. CBOW๋Š” ์ฃผ๋ณ€ ๋‹จ์–ด(Context Word)๋กœ ์ค‘๊ฐ„์— ์žˆ๋Š” ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ์ค‘๊ฐ„์— ์žˆ๋Š” ๋‹จ์–ด๋ฅผ ์ค‘์‹ฌ ๋‹จ์–ด(Center Word) ๋˜๋Š” ํƒ€๊ฒŸ ๋‹จ์–ด(Target Word)๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค. ๋ฐ˜๋Œ€๋กœ, Skip-gram์€ ์ค‘์‹ฌ ๋‹จ์–ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ฃผ๋ณ€ ๋‹จ์–ด๋“ค์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ๋ณธ ํฌ์ŠคํŒ…์—์„œ๋Š” CBOW์— ๋Œ€ํ•ด ๋‹ค๋ฃจ๊ณ , ๋‹ค์Œ ํฌ์ŠคํŒ…์—์„œ Skip-gram์— ๋Œ€ํ•ด ์ž์„ธํžˆ ๋‹ค๋ฃน๋‹ˆ๋‹ค.1. ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑCBOW์—์„œ ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์„ ..

[NLP] Word2Vec: (1) ๊ฐœ๋…

๐Ÿ“š ๋ชฉ์ฐจ1. Word2Vec ๊ฐœ๋…2. ํฌ์†Œํ‘œํ˜„๊ณผ์˜ ์ฐจ์ด์  3. ์–ธ์–ด๋ชจ๋ธ๊ณผ์˜ ์ฐจ์ด์ 1. Word2Vec ๊ฐœ๋…Word2Vec๋Š” Word to Vector๋ผ๋Š” ์ด๋ฆ„์—์„œ ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด ๋‹จ์–ด(Word)๋ฅผ ์ปดํ“จํ„ฐ๊ฐ€ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋„๋ก ์ˆ˜์น˜ํ™”๋œ ๋ฒกํ„ฐ(Vector)๋กœ ํ‘œํ˜„ํ•˜๋Š” ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ๋Š” ๋ถ„์‚ฐํ‘œํ˜„(Distributed Representation) ๊ธฐ๋ฐ˜์˜ ์›Œ๋“œ์ž„๋ฒ ๋”ฉ(Word Embedding) ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค. ๋ถ„์‚ฐํ‘œํ˜„์ด๋ž€ ๋ถ„ํฌ๊ฐ€์„ค(Distibutional Hypothesis) ๊ฐ€์ • ํ•˜์— ์ €์ฐจ์›์— ๋‹จ์–ด ์˜๋ฏธ๋ฅผ ๋ถ„์‚ฐํ•˜์—ฌ ํ‘œํ˜„ํ•˜๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. ๋ถ„ํฌ๊ฐ€์„ค์€ "์œ ์‚ฌํ•œ ๋ฌธ๋งฅ์— ๋“ฑ์žฅํ•œ ๋‹จ์–ด๋Š” ์œ ์‚ฌํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ–๋Š”๋‹ค"๋ผ๋Š” ๊ฐ€์ •์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐํ™”ํ•˜๋Š” ์ž‘์—…์„ ์›Œ๋“œ์ž„๋ฒ ๋”ฉ(Word Embedding)์ด๋ผ๊ณ ..