Forum Webscript.Ru

Программирование => PHP => Тема начата: NiggaInDaStudio от 08 Июня 2007, 11:17:16

Название: Как убрать чувствительность к регистру при поиске?
Отправлено: NiggaInDaStudio от 08 Июня 2007, 11:17:16: Подскажите пожалуйста.
Есть работающий скрипт поиска по сайту:
http://www.teploenergoplast.ru/search.php

Если ввести "РАДИАТОРЫ" найдется один документ, а именно http://www.teploenergoplast.ru/rad/st_kermi.php
так как в нем есть слово это слово именно с заглавных букв.
Вводим "Радиаторы" - находится 10 документов, это те в которых присутствует это слово, если оно начинается с заглавной,
а все остальные буквы прописные.
Вводим "радиаторы" и тут уже 34 документа, включающие в себя все предыдущие (и этот http://www.teploenergoplast.ru/rad/st_kermi.php).
Как бы сделать так, чтобы скрипт не обращал внимание на регистр и выводил все?
Вариант, чтобы поисковый запрос преобразовывался в прописные не подойдет,
так как например по зпаросу "ппу" находится 9 документов, а "ППУ" - 10, тут все наоборот.

search.php
include "config.php";
$time1 = getmicrotime();
read_template("template.htm");
$stpos = 0;
$stype = "AND";
$query = "";
$abort = 0;

get_query();
if (count($query_arr) > 0) {
get_results();
$time3 = getmicrotime();
$time = $time3-$time1;
# print "
get_results() took $time sec.
";

boolean();
$time4 = getmicrotime();
$time = $time4-$time3;
$search_time = $time4-$time1;
$search_time = sprintf("%2.4f", $search_time);
# print "
boolean() took $time sec.
";
}

print print_template("header");

if (count($query_arr) > 0) {
if ($rescount>0) {
print print_template("results_header");
print_results();
print print_template("results_footer");
} else {
print print_template("no_results");
}
} else {
print print_template("empty_query");
}

print print_template("footer");

#=====================================================================

function get_query() {

global $HTTP_GET_VARS;
global $query, $stpos, $stype, $query_arr, $wholeword, $querymode, $stop_words_array;
global $min_length;

$query = $HTTP_GET_VARS["query"];
$stpos = $HTTP_GET_VARS["stpos"];
$stype = $HTTP_GET_VARS["stype"];

$query = strtolower($query);
$query = preg_replace("/[^a-zа-яA-ZА-Я +!]/"," ",$query);
$query_arr_dum = preg_split("/\\s+/",$query);

foreach($query_arr_dum as $word) {
if (strlen($word) < $min_length) { continue; }
if (array_key_exists($word,$stop_words_array)) { continue; }
$query_arr[] = $word;
}

for ($i=0; $i if (preg_match("/\\!/", $query_arr[$i])) { $wholeword[$i] = 1;} # WholeWord
$query_arr[$i] = preg_replace("/[\\! ]/","",$query_arr[$i]);
if ($stype == "AND") { $querymode[$i] = 2;} # AND
if (preg_match ("/^\\-/", $query_arr[$i])) { $querymode[$i] = 1;} # NOT
if (preg_match ("/^\\+/", $query_arr[$i])) { $querymode[$i] = 2;} # AND
$query_arr[$i] = preg_replace("/^[\\+\\- ]/","",$query_arr[$i]);
}

if ($stpos <0) {$stpos = 0;};
}
#=====================================================================

function get_results() {

global $HASHSIZE, $INDEXING_SCHEME, $HASH, $HASHWORDS, $FINFO, $SITEWORDS, $WORD_IND;

global $query_arr, $wholeword, $querymode;
global $res, $allres, $rescount, $query_statistics;

$fp_HASH = fopen ("$HASH", "rb");
$fp_HASHWORDS = fopen ("$HASHWORDS", "rb");
$fp_SITEWORDS = fopen ("$SITEWORDS", "rb");
$fp_WORD_IND = fopen ("$WORD_IND", "rb");

for ($j=0; $j $query = $query_arr[$j];
$allres[$j] = array();

if ($INDEXING_SCHEME == 1) {
   $substring_length = strlen($query);
} else {
   $substring_length = 4;
}
$hash_value = abs(hash(substr($query,0,$substring_length)) % $HASHSIZE);

fseek($fp_HASH,$hash_value*4,0);
$dum = fread($fp_HASH,4);
$dum = unpack("Ndum", $dum);
fseek($fp_HASHWORDS,$dum[dum],0);
$dum = fread($fp_HASHWORDS,4);
$dum1 = unpack("Ndum", $dum);

for ($i=0; $i<$dum1[dum]; $i++) {
$dum = fread($fp_HASHWORDS,8);
$arr_dum = unpack("Nwordpos/Nfilepos",$dum);
fseek($fp_SITEWORDS,$arr_dum[wordpos],0);
$word = fgets($fp_SITEWORDS,1024);
$word = preg_replace("/\\x0A/","",$word);
$word = preg_replace("/\\x0D/","",$word);

if ( ($wholeword[$j]==1) && ($word != $query) ) {$word = "";};
$pos = strpos($word, $query);
if ($pos !== false) {
fseek($fp_WORD_IND,$arr_dum[filepos],0);
$dum = fread($fp_WORD_IND,4);
$dum2 = unpack("Ndum",$dum);
$dum = fread($fp_WORD_IND,$dum2[dum]*4);
for($k=0; $k<$dum2[dum]; $k++){
$zzz = unpack("Ndum",substr($dum,$k*4,4));
$allres[$j][$zzz[dum]] = 1;
}
}
};
}
for ($j=0; $j    $found_number = count($allres[$j]);
$query_statistics .= " $query_arr[$j]-$found_number\\n";
}
}
#=====================================================================

function boolean() {

global $query_arr, $querymode, $stype;
global $res, $allres, $rescount;

if (count($query_arr) == 1) {
foreach ($allres[0] as $k => $v) {
if ($k) {
$res .= pack("N",$k);
}
}
$rescount = intval(strlen($res)/4);
unset($allres);
return;
} else {

if ($stype == "AND") {
for ($i=0; $i if ($querymode[$i] == 2) {
$min = $i;
break;
}
}
for ($i=$min+1; $i if (count($allres[$i]) < count($allres[$min]) && $querymode[$i] == 2) {
$min = $i;
}
}
for ($i=0; $i if ($i == $min) {
continue;
}
if ($querymode[$i] == 2) {
foreach ($allres[$min] as $k => $v) {
if (array_key_exists($k,$allres[$i])) {
} else {
unset($allres[$min][$k]);
}
}
} else {
foreach ($allres[$min] as $k => $v) {
if (array_key_exists($k,$allres[$i])) {
unset($allres[$min][$k]);
}
}
}
}
foreach ($allres[$min] as $k => $v) {
if ($k) {
$res .= pack("N",$k);
}
}
$rescount = intval(strlen($res)/4);
return;
}

if ($stype == "OR") {
for ($i=0; $i if ($querymode[$i] != 1) {
$max = $i;
break;
}
}
for ($i=$max+1; $i if (count($allres[$i]) > count($allres[$min]) && $querymode[$i] != 1) {
$max = $i;
}
}
for ($i=0; $i if ($i == $max) {
continue;
}
if ($querymode[$i] != 1) {
foreach ($allres[$i] as $k => $v) {
$allres[$max][$k] = 1;
}
} else {
foreach ($allres[$i] as $k => $v) {
if (array_key_exists($k,$allres[$max])) {
unset($allres[$max][$k]);
}
}
}
}
foreach ($allres[$max] as $k => $v) {
if ($k) {
$res .= pack("N",$k);
}
}
$rescount = intval(strlen($res)/4);
return;
}
}
}
#=====================================================================

function print_results() {

global $FINFO, $FINFO_IND, $query, $stpos, $stype, $res_num, $res;
global $url, $title, $size, $description, $rescount, $next_results;

$time1 = getmicrotime();

$fp_FINFO = fopen ("$FINFO", "rb");

for ($i=$stpos; $i<$stpos+$res_num; $i++) {
if ($i >= strlen($res)/4) {break;};
$strpos = unpack("Npos",substr($res,$i*4,4));
fseek($fp_FINFO,$strpos[pos],0);
$dum = fgets($fp_FINFO,4024);
list($url, $size, $title, $description) = explode("::",$dum);
print print_template("results");
}; # for

if ($rescount <= $res_num) {$next_results = ""; return 1;}

$mhits = 20 * $res_num;
$pos2 = $stpos - $stpos % $mhits;
$pos1 = $pos2 - $mhits;
$pos3 = $pos2 + $mhits;

if ($pos1 < 0) { $prev = ""; }
else {
$prev = " $prev .= ">PREV \\n";
}

if ($pos3 > $rescount) { $next = ""; }
else {
$next = " $next .= ">NEXT \\n";
}

$next_results .= $prev;
$next_results .= " |\\n";
for ($i=$pos2; $i<$pos3; $i += $res_num) {
if ($i >= $rescount) {break;}
$page_number = $i/$res_num+1;
if ( $i != $stpos ) {
$next_results .= " $next_results .= ">".$page_number." |\\n";
} else {
$next_results .= $page_number." |\\n";
}
}
$next_results .= $next;
}
#=====================================================================

function getmicrotime(){
list($usec, $sec) = explode(" ",microtime());
return ((float)$usec + (float)$sec);
}

#=====================================================================

function hash($key) {

$chars = preg_split("//",$key);
for($i=1;$i $chars2[$i] = ord($chars[$i]);
}

$h = hexdec("00000000");
$f = hexdec("F0000000");

for($i=1;$i $h = ($h << 4) + $chars2[$i];
if ($g = $h & $f) { $h ^= $g >> 24; };
$h &= ~$g;
}
return $h;
}

#===================================================================

function read_template($filename) {

$size = filesize($filename);
$fd = fopen ($filename, "rb");
$template = fread ($fd, $size);
fclose ($fd);

global $templates;

$count = preg_match_all("/(.*?)/s", $template, $matches, PREG_SET_ORDER);
for($i=0; $i < count($matches); $i++) {
$templates[$matches[$i][1]] = $matches[$i][2];
}
}
#===================================================================

function print_template($part) {

global $templates;
global $query, $search_time, $query_statistics, $stpos, $url, $title, $size, $description, $rescount, $next_results;
$template = $templates[$part];

$template = preg_replace("/%query%/s","$query",$template);
$template = preg_replace("/%search_time%/s","$search_time",$template);
$template = preg_replace("/%query_statistics%/s","$query_statistics",$template);
$template = preg_replace("/%stpos%/s",$stpos+1,$template);
$template = preg_replace("/%url%/s","$url",$template);
$template = preg_replace("/%title%/s","$title",$template);
$template = preg_replace("/%size%/s","$size",$template);
$template = preg_replace("/%description%/s","$description",$template);
$template = preg_replace("/%rescount%/s","$rescount",$template);
$template = preg_replace("/%next_results%/s","$next_results",$template);

return $template;
}
#===================================================================

?>
Название: Как убрать чувствительность к регистру при поиске?
Отправлено: NiggaInDaStudio от 08 Июня 2007, 11:18:15: И на всякый случай:

config.php

# Directory where yours html files are located
# In most cases you may use path relative to the location of script
# Or use absolute path
# Type "./" for the current directory
$base_dir = "./";

# Base URL of your site
$base_url = "http://www.teploenergoplast.ru/";

# site size
# 1 - Tiny ~1Mb
# 2 - Medium ~10Mb
# 3 - Big ~50Mb
# 4 - Large >100Mb
$site_size = 2;

# Path to index database files
$HASH = "db/0_hash";
$HASHWORDS = "db/0_hashwords";
$FINFO = "db/0_finfo";
$SITEWORDS = "db/0_sitewords";
$WORD_IND = "db/0_word_ind";

#===================================================================
#
# These variables are used by spider
#
#===================================================================

# Starting URL (used by spider)
$start_url = array(
"http://www.teploenergoplast.ru",
);

# Spider will index only files from these servers
$allow_url = array(
"http://www.teploenergoplast.ru",
);

#===================================================================
#
# All other variables are optional. Script should work fine
# with default settings.
# These variables controls the indexing process.
#
#===================================================================

# File extensions to index
# Add "NONE" if you want to index files without extensions
$file_ext = \'php\';

# List of directories, which should not be indexed
$no_index_dir = \'img db\';

# List of files, which should not be indexed
$no_index_files = \'robots.txt\';

#minimum word length to index
$min_length = 2;

# Index or not numbers (set $numbers = "" if you don\'t want to index numbers)
# You may add here other non-letter characters, which you want to index
$numbers = "";

# Parts of documents, which should not be indexed
# Uncomment and edit, if you want to use this feature
$use_selective_indexing = "NO";
$no_index_strings = array(
"" => "",
"" => "",
);

# Cut default filenames from URL ("YES" or "NO")
$cut_default_filenames = \'YES\';
$default_filenames = \'index.php\';

# Convert URL to lower case ("YES" or "NO")
$url_to_lower_case = \'NO\';

# Indexing scheme
# Whole word - 1
# Beginning of the word - 2
# Every substring - 3
$INDEXING_SCHEME = 2;

# Translate escape chars (like È or ÿ) ("YES" or "NO")
$use_esc = "YES";

# Index META tags ("YES" or "NO")
$use_META = "YES";

# List of stopwords ("YES" or "NO")
$use_stop_words = "YES";
$stop_words = "и или любой любые любых но ему их как от на с из когда где нее ее его него ей она он не без в для вас по";

#===================================================================
#
# These variables controls the script output.
#
#===================================================================

# Number of results per page
$res_num=10;

# Define length of page description in output
# and use META description ("YES") or first "n" characters of page ("NO")
$descr_size = 256;
$use_META_descr = "NO";

#===================================================================
#
# --- end of configuration ---
#
# Please do not edit below this line unless you know what you do
#
#===================================================================

if ($site_size == 1) {
$HASHSIZE = 20001;
} elseif ($site_size == 3) {
$HASHSIZE = 100001;
} elseif ($site_size == 4) {
$HASHSIZE = 300001;
} else {
$HASHSIZE = 50001;
}

#===================================================================

function prepare_string($str) {
$str = preg_replace ("/^\\s+|\\s+$/", "", $str);
$str = preg_replace ("/\\s+/", "|", $str);
$str = preg_replace ("/\\./", "\\\\\\.", $str);
$str = "(".$str.")";
return $str;
}

if (preg_match("/NONE/",$file_ext) ) {
$file_ext = preg_replace ("/NONE/", "", $file_ext);
$file_ext = prepare_string($file_ext);
$file_ext = \'(\\.\'.$file_ext.\'|^[^.]+|/[^.]*)$\';
} else {
$file_ext = prepare_string($file_ext);
$file_ext = \'\\.\'.$file_ext.\'$\';
}

$non_parse_ext = prepare_string($non_parse_ext);
$non_parse_ext = \'\\.\'.$non_parse_ext.\'$\';

$no_index_dir = prepare_string($no_index_dir);

$no_index_files = prepare_string($no_index_files);

$default_filenames = prepare_string($default_filenames);
$default_filenames = \'/\'.$default_filenames.\'$\';

#===================================================================

$stop_words = preg_replace("/\\s+/s"," ",$stop_words);
$pos = 0;
do {
$new_pos = strpos($stop_words," ",$pos);
if ($new_pos === FALSE) {
$word = substr($stop_words,$pos);
$stop_words_array[$word] = 1;
break;
};
$word = substr($stop_words,$pos,$new_pos-$pos);
$stop_words_array[$word] = 1;
$pos = $new_pos+1;
} while (1>0);

#===================================================================

$html_esc = array(
"À" => chr(192),
"Á" => chr(193),
"Â" => chr(194),
"Ã" => chr(195),
"Ä" => chr(196),
"Å" => chr(197),
"Æ" => chr(198),
"Ç" => chr(199),
"È" => chr(200),
"É" => chr(201),
"&Eirc;" => chr(202),
"Ë" => chr(203),
"Ì" => chr(204),
"Í" => chr(205),
"Î" => chr(206),
"Ï" => chr(207),
"Ð" => chr(208),
"Ñ" => chr(209),
"Ò" => chr(210),
"Ó" => chr(211),
"Ô" => chr(212),
"Õ" => chr(213),
"Ö" => chr(214),
"×" => chr(215),
"Ø" => chr(216),
"Ù" => chr(217),
"Ú" => chr(218),
"Û" => chr(219),
"Ü" => chr(220),
"Ý" => chr(221),
"Þ" => chr(222),
"ß" => chr(223),
"à" => chr(224),
"á" => chr(225),
"â" => chr(226),
"ã" => chr(227),
"ä" => chr(228),
"å" => chr(229),
"æ" => chr(230),
"ç" => chr(231),
"è" => chr(232),
"é" => chr(233),
"ê" => chr(234),
"ë" => chr(235),
"ì" => chr(236),
"í" => chr(237),
"î" => chr(238),
"ï" => chr(239),
"ð" => chr(240),
"ñ" => chr(241),
"ò" => chr(242),
"ó" => chr(243),
"ô" => chr(244),
"õ" => chr(245),
"ö" => chr(246),
"÷" => chr(247),
"ø" => chr(248),
"ù" => chr(249),
"ú" => chr(250),
"û" => chr(251),
"ü" => chr(252),
"ý" => chr(253),
"þ" => chr(254),
"ÿ" => chr(255),
" " => " ",
"&" => " ",
""e;" => " ",
);

#===================================================================

function esc2char($str) {

global $html_esc;

$esc = $str[0];
$char = "";

if (preg_match ("/&[a-zA-Z]*;/", $esc)) {
$char = $html_esc[$esc];
} elseif (preg_match ("/&#([0-9]*);/", $esc, $matches)) {
   $char = chr($matches[1]);
} elseif (preg_match ("/&#x([0-9a-fA-F]*);/", $esc, $matches)) {
   $char = chr(hexdec($matches[1]));
}
return $char;
}
#=====================================================================

?>
Название: Как убрать чувствительность к регистру при поиске?
Отправлено: Altaxar от 08 Июня 2007, 12:10:32: вот собственно строка поиска, ее и надо модернизировать,
$pos = strpos($word, $query);

попробуй так:

if ( ($wholeword[$j]==1) && ($word != $query) ) {$word = "";};
$pos = strpos($word, $query);

замени на :
$word=strtolower($word);
if ( ($wholeword[$j]==1) && ($word != $query) ) {$word = "";};
$pos = strpos($word, $query);

и еще (немного выше, чтобы сымволы поиска также перевести в нижний регистр) :

for ($j=0; $j$query = $query_arr[$j];

замени на :

for ($j=0; $j$query = strtolower($query_arr[$j]);
Название: Как убрать чувствительность к регистру при поиске?
Отправлено: NiggaInDaStudio от 08 Июня 2007, 13:56:50: Altaxar, спасибо, но не помогло. По прежнему регистр такой какой я ввожу. Но даже если это бы и заработало, тут даже не в этом дело. Повторюсь, вариант, чтобы поисковый запрос преобразовывался в прописные не подойдет,
так как например по зпаросу "ппу" находится 9 документов, а "ППУ" - 10, тут все наоборот.
Название: Как убрать чувствительность к регистру при поиске?
Отправлено: Altaxar от 08 Июня 2007, 14:22:37: $word=strtolower($word);
if ( ($wholeword[$j]==1) && ($word != $query) ) {$word = "";};
$pos = strpos($word, $query);

$word- содержит как раз строку в которой ищет запрос,
strpos()- выдает место нахождение строки в другой строке.

$pos = strpos($word, $query); - строка пересечение даный, запроса с данными которые он читает из файла.
$word=strtolower($word); - эта строка должна была перевести в нижний регистр, строку в которой происходит поиск. Почему не перевела, разберайся.

P.S. можеш заметить, что поиск производится посимвольно, тоесть $word содержит 1 символ, который ищет в запросе.
Название: Как убрать чувствительность к регистру при поиске?
Отправлено: CGVictor от 08 Июня 2007, 16:27:09: Altaxar
Цитировать
Altaxar:
Почему не перевела, разберайся

У меня похожий случай из-за локали был.
Решил с помощью [p]mb_strtolower[/p].
Название: Как убрать чувствительность к регистру при поиске?
Отправлено: NiggaInDaStudio от 09 Июня 2007, 11:07:45: CGVictor, спасибо, заработало.