๊ด€๋ฆฌ ๋ฉ”๋‰ด

๋ชฉ๋ก์ž์นด๋“œ์œ ์‚ฌ๋„ (1)

DATA101

[NLP] ๋ฌธ์„œ ์œ ์‚ฌ๋„ ๋ถ„์„: (3) ์ž์นด๋“œ ์œ ์‚ฌ๋„(Jaccard Similarity)

๐Ÿ“š ๋ชฉ์ฐจ1. ์ž์นด๋“œ ์œ ์‚ฌ๋„ ๊ฐœ๋…2. ์ž์นด๋“œ ์œ ์‚ฌ๊ณ  ์‹ค์Šต1. ์ž์นด๋“œ ์œ ์‚ฌ๋„ ๊ฐœ๋…์ž์นด๋“œ ์œ ์‚ฌ๋„(Jaccard Similarity)๋Š” \(2\)๊ฐœ์˜ ์ง‘ํ•ฉ \(A\), \(B\)๊ฐ€ ์žˆ์„ ๋•Œ ๋‘ ์ง‘ํ•ฉ์˜ ํ•ฉ์ง‘ํ•ฉ ์ค‘ ๊ต์ง‘ํ•ฉ์˜ ๋น„์œจ์ž…๋‹ˆ๋‹ค. ์ฆ‰, ๋‘ ์ง‘ํ•ฉ์ด ์™„์ „ํžˆ ๊ฐ™์„ ๋•Œ๋Š” ์ž์นด๋“œ ์œ ์‚ฌ๋„๊ฐ€ \(1\)์ด๋ฉฐ, ๋‘ ์ง‘ํ•ฉ์— ๊ต์ง‘ํ•ฉ์ด ์—†๋Š” ๊ฒฝ์šฐ๋Š” \(0\)์ž…๋‹ˆ๋‹ค. ์ž์นด๋“œ ์œ ์‚ฌ๋„๋ฅผ \(J\)๋ผ๊ณ  ํ•  ๋•Œ ๋‘ ์ง‘ํ•ฉ ๊ฐ„์˜ ์ž์นด๋“œ ์œ ์‚ฌ๋„ ์ˆ˜์‹์€ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค. $$ J(A, B) = \frac{|A \cap B|}{|A \cup B|} = \frac{|A \cap B|}{|A| + |B| - |A \cap B|} $$ ์ž์นด๋“œ ์œ ์‚ฌ๋„ ๊ฐœ๋…์„ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ถ„์•ผ๋กœ ๊ทธ๋Œ€๋กœ ๊ฐ€์ ธ์˜ค๋ฉด, ํ•˜๋‚˜์˜ ์ง‘ํ•ฉ์ด ๊ณง ํ•˜๋‚˜์˜ ๋ฌธ์„œ๊ฐ€ ํ•ด๋‹นํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ..