欢迎各位兄弟 发布技术文章

这里的技术是共享的

You are here

Using Tika and the Attachments module to index PDFs, DOC files, etc. 使用 Tika 和附件模块索引 PDF、DOC 文件等。 有大用 有大大用 有大大大用

Using Tika and the Attachments module to index PDFs, DOC files, etc.

If you would like to index the text content of file attachments on your Drupal nodes (like PDFs, Word documents, spreadsheets, etc.), you can do so using the following modules:

Hosted Apache Solr includes Apache Tika, which is a software library that assists in extracting text from file attachments. The fastest and most customizable method of using Apache Tika is to have it installed on the same server where your Drupal site resides, but if you would like to use the extraction handler running on Hosted Apache Solr's servers, you can configure the Solr modules according to the below instructions:

Search API Attachments

Make sure the Search API Attachments module is enabled, then add a new file search index and configure it as instructed below:

  1. Visit the Search API Attachments configuration page (admin/config/search/search_api/attachments) and set the following options:

    1. Extraction method: Solr (remote server)

    2. Solr extracting servlet pathextract/tika

  2. Create a new file search index (click 'Add index' on the admin/config/search/search_api page, and choose 'File' for 'Item type').

  3. After the index is created, add a couple fields (like 'File name') to the 'Fields to index', and save the changes.

  4. For the file search index's filters configuration, make sure 'File attachments' is checked under 'Data Alterations', then save the changes.

  5. Go back to the index's 'Fields' tab, and you should now see a 'File content' (attachments_content) field in the list, set to be indexed. This means the file attachments will be indexed correctly for this file search index.

Apache Solr Search Attachments

Make sure the Apache Solr Attachments module is enabled, then follow the instructions below:

  1. On the Apache Solr search Attachments configuration page (admin/config/search/apachesolr/attachments), select 'Solr (remote server)' for the 'Extract using' option, and save the configuration.

  2. To verify that extraction is working, click the 'Test your tika extraction' button at the bottom of the same page.

If the testing fails, and if you see an error in your Drupal site's logs like java.lang.ClassNotFoundException: solr.extraction.ExtractingRequestHandler, then your solr configuration may be incorrect; in some cases an older Apache Solr Search configuration was used, and it incorrectly looks for a file apache-solr-cell..., when it should be searching for solr-cell... (using Solr 4.10.4 or later).


来自 https://hostedapachesolr.com/support/tika-attachments


下面是使用 谷歌浏览器翻译后的结果

使用 Tika 和附件模块索引 PDF、DOC 文件等。


如果您想为 Drupal 节点(如 PDF、Word 文档、电子表格等)上的文件附件的文本内容建立索引,可以使用以下模块:

托管的 Apache Solr 包括Apache Tika,这是一个帮助从文件附件中提取文本的软件库。使用 Apache Tika 的最快和最可定制的方法是将它安装在您的 Drupal 站点所在的同一台服务器上,但是如果您想使用在托管 Apache Solr 的服务器上运行的提取处理程序,您可以根据以下内容配置 Solr 模块以下说明:

搜索 API 附件

确保启用Search API 附件模块,然后添加新的文件搜索索引并按照以下说明进行配置:

  1. 访问搜索 API 附件配置页面 ( admin/config/search/search_api/attachments) 并设置以下选项:

    1. 提取方式:Solr(远程服务器)

    2. Solr 提取 servlet 路径extract/tika

  2. 创建新的文件搜索索引(单击admin/config/search/search_api页面上的“添加索引” ,然后为“项目类型”选择“文件”)。

  3. 创建索引后,将几个字段(如“文件名”)添加到“要索引的字段”,并保存更改。

  4. 对于文件搜索索引的过滤器配置,确保在“数据更改”下选中“文件附件”,然后保存更改。

  5. 返回索引的“字段”选项卡,您现在应该会attachments_content在列表中看到一个“文件内容”( ) 字段,设置为已编入索引。这意味着文件附件将为此文件搜索索引正确编制索引。

Apache Solr 搜索附件

确保Apache Solr 附件模块已启用,然后按照以下说明进行操作:

  1. 在 Apache Solr 搜索附件配置页面 ( admin/config/search/apachesolr/attachments) 上,为“提取使用”选项选择“Solr(远程服务器)”,然后保存配置。

  2. 要验证提取是否有效,请单击同一页面底部的“测试您的 tika 提取”按钮。

如果测试失败,并且您在 Drupal 站点的日志中看到类似 的错误java.lang.ClassNotFoundException: solr.extraction.ExtractingRequestHandler,那么您的 solr 配置可能不正确;在某些情况下,使用了较旧的 Apache Solr 搜索配置,并且它apache-solr-cell...在应该搜索时错误地查找文件solr-cell...(使用 Solr 4.10.4 或更高版本)。


来自 https://hostedapachesolr.com/support/tika-attachments


普通分类: