본문 바로가기

부스트캠프 AI Tech/NLP

NLP overview

딸기스무디 2022. 3. 7. 12:07

NLP

NLU : Natural Language Understanding
NLG : Natural Language Generation

Task

NLP (major conference : ACL, EMNLP, NAACL)

Low-level parsing

tokenization, stemming

Word level

Named entity recognition(고유명사 인식), POS tagging, noun-phrase chunking, dependency parsing, coreference resolution

Sentence level

Sentiment analysis, machine translation

Multi-sentence and paragraph level

Entailment, prediction, question answering, dialog systems, summarization

Text mining (major conference : KDD, formerly, WWW, WSDM, CIKM, ICWSM)

Extract useful information and insights from text and document data
Document clustering
Highly related to computational social science

Information retrieval (major conference : SIGIR, WSDM, CIKM, RecSys)

Highly related to computational social science

Trend

text data는 시계열 데이터로 볼 수 있으며 각 word는 Word2Vec, Glove등의 기법을 통하여 벡터로 표현될 수 있다.(word embedding)
RNN 계열 모델(LSTM, GRUs)이 NLP task의 main architenture이다.
전체적인 NLP task의 성능은 attention, transformer 구조로 인해 향상되어 RNN 모델을 대체하였다.
오늘날에는 pre-trained model을 fine tuning하는 형식으로 task 수행
self-supervised learning : word를 masking하고 맞히게 하는 형식
BERT, GPT-3 등은 transfer learning을 통해 범용적인 사용이 가능

Bag-of-Words

Constructing the vocabulary containg unique words
Encoding unique words to one-hot vectors
- 각 word 간 distance = root 2
- 각 word 간 cosine 유사도 = 0
각 문장은 one-hot vector의 합으로 표현 가능

NaiveBayes Classifier for document classification

P(c|d) : document d가 하나의 class c일 확률 = P(d|c)P(c) - Bayes Rule

'부스트캠프 AI Tech > NLP' 카테고리의 다른 글

Transformer (0)	2022.03.14
Word embedding (0)	2022.03.07

티스토리툴바